diff --git a/.nojekyll b/.nojekyll
new file mode 100644
index 00000000..e69de29b
diff --git a/cache.json b/cache.json
new file mode 100644
index 00000000..61744909
--- /dev/null
+++ b/cache.json
@@ -0,0 +1 @@
+{"2024-12-12T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.09614v1","updated":"2024-12-12T18:59:41Z","published":"2024-12-12T18:59:41Z","title":"Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge\n Graph-Based RAG","summary":" We introduce a novel approach to enhance the capabilities of text-to-image\nmodels by incorporating a graph-based RAG. Our system dynamically retrieves\ndetailed character information and relational data from the knowledge graph,\nenabling the generation of visually accurate and contextually rich images. This\ncapability significantly improves upon the limitations of existing T2I models,\nwhich often struggle with the accurate depiction of complex or culturally\nspecific subjects due to dataset constraints. Furthermore, we propose a novel\nself-correcting mechanism for text-to-image models to ensure consistency and\nfidelity in visual outputs, leveraging the rich context from the graph to guide\ncorrections. Our qualitative and quantitative experiments demonstrate that\nContext Canvas significantly enhances the capabilities of popular models such\nas Flux, Stable Diffusion, and DALL-E, and improves the functionality of\nControlNet for fine-grained image editing tasks. To our knowledge, Context\nCanvas represents the first application of graph-based RAG in enhancing T2I\nmodels, representing a significant advancement for producing high-fidelity,\ncontext-aware multi-faceted images.\n","authors":["Kavana Venkatesh","Yusuf Dalva","Ismini Lourentzou","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.09614v1.pdf","comment":"Project Page: https://context-canvas.github.io/"},{"id":"http://arxiv.org/abs/2412.09612v1","updated":"2024-12-12T18:59:40Z","published":"2024-12-12T18:59:40Z","title":"Olympus: A Universal Task Router for Computer Vision Tasks","summary":" We introduce Olympus, a new approach that transforms Multimodal Large\nLanguage Models (MLLMs) into a unified framework capable of handling a wide\narray of computer vision tasks. Utilizing a controller MLLM, Olympus delegates\nover 20 specialized tasks across images, videos, and 3D objects to dedicated\nmodules. This instruction-based routing enables complex workflows through\nchained actions without the need for training heavy generative models. Olympus\neasily integrates with existing MLLMs, expanding their capabilities with\ncomparable performance. Experimental results demonstrate that Olympus achieves\nan average routing accuracy of 94.75% across 20 tasks and precision of 91.82%\nin chained action scenarios, showcasing its effectiveness as a universal task\nrouter that can solve a diverse range of computer vision tasks. Project page:\nhttps://github.com/yuanze-lin/Olympus_page\n","authors":["Yuanze Lin","Yunsheng Li","Dongdong Chen","Weijian Xu","Ronald Clark","Philip H. S. Torr"],"pdf_url":"https://arxiv.org/pdf/2412.09612v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2412.09605v1","updated":"2024-12-12T18:59:27Z","published":"2024-12-12T18:59:27Z","title":"AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web\n Tutorials","summary":" Graphical User Interface (GUI) agents hold great potential for automating\ncomplex tasks across diverse digital environments, from web applications to\ndesktop software. However, the development of such agents is hindered by the\nlack of high-quality, multi-step trajectory data required for effective\ntraining. Existing approaches rely on expensive and labor-intensive human\nannotation, making them unsustainable at scale. To address this challenge, we\npropose AgentTrek, a scalable data synthesis pipeline that generates\nhigh-quality GUI agent trajectories by leveraging web tutorials. Our method\nautomatically gathers tutorial-like texts from the internet, transforms them\ninto task goals with step-by-step instructions, and employs a visual-language\nmodel agent to simulate their execution in a real digital environment. A\nVLM-based evaluator ensures the correctness of the generated trajectories. We\ndemonstrate that training GUI agents with these synthesized trajectories\nsignificantly improves their grounding and planning performance over the\ncurrent models. Moreover, our approach is more cost-efficient compared to\ntraditional human annotation methods. This work underscores the potential of\nguided replay with web tutorials as a viable strategy for large-scale GUI agent\ntraining, paving the way for more capable and autonomous digital agents.\n","authors":["Yiheng Xu","Dunjie Lu","Zhennan Shen","Junli Wang","Zekun Wang","Yuchen Mao","Caiming Xiong","Tao Yu"],"pdf_url":"https://arxiv.org/pdf/2412.09605v1.pdf","comment":"https://agenttrek.github.io"},{"id":"http://arxiv.org/abs/2412.09601v1","updated":"2024-12-12T18:59:11Z","published":"2024-12-12T18:59:11Z","title":"TimeRefine: Temporal Grounding with Time Refining Video LLM","summary":" Video temporal grounding aims to localize relevant temporal boundaries in a\nvideo given a textual prompt. Recent work has focused on enabling Video LLMs to\nperform video temporal grounding via next-token prediction of temporal\ntimestamps. However, accurately localizing timestamps in videos remains\nchallenging for Video LLMs when relying solely on temporal token prediction.\nOur proposed TimeRefine addresses this challenge in two ways. First, instead of\ndirectly predicting the start and end timestamps, we reformulate the temporal\ngrounding task as a temporal refining task: the model first makes rough\npredictions and then refines them by predicting offsets to the target segment.\nThis refining process is repeated multiple times, through which the model\nprogressively self-improves its temporal localization accuracy. Second, to\nenhance the model's temporal perception capabilities, we incorporate an\nauxiliary prediction head that penalizes the model more if a predicted segment\ndeviates further from the ground truth, thus encouraging the model to make\ncloser and more accurate predictions. Our plug-and-play method can be\nintegrated into most LLM-based temporal grounding approaches. The experimental\nresults demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on\nthe ActivityNet and Charades-STA datasets, respectively. Code and pretrained\nmodels will be released.\n","authors":["Xizi Wang","Feng Cheng","Ziyang Wang","Huiyu Wang","Md Mohaiminul Islam","Lorenzo Torresani","Mohit Bansal","Gedas Bertasius","David Crandall"],"pdf_url":"https://arxiv.org/pdf/2412.09601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09596v1","updated":"2024-12-12T18:58:30Z","published":"2024-12-12T18:58:30Z","title":"InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for\n Long-term Streaming Video and Audio Interactions","summary":" Creating AI systems that can interact with environments over long periods,\nsimilar to human cognition, has been a longstanding research goal. Recent\nadvancements in multimodal large language models (MLLMs) have made significant\nstrides in open-world understanding. However, the challenge of continuous and\nsimultaneous streaming perception, memory, and reasoning remains largely\nunexplored. Current MLLMs are constrained by their sequence-to-sequence\narchitecture, which limits their ability to process inputs and generate\nresponses simultaneously, akin to being unable to think while perceiving.\nFurthermore, relying on long contexts to store historical data is impractical\nfor long-term interactions, as retaining all information becomes costly and\ninefficient. Therefore, rather than relying on a single foundation model to\nperform all functions, this project draws inspiration from the concept of the\nSpecialized Generalist AI and introduces disentangled streaming perception,\nreasoning, and memory mechanisms, enabling real-time interaction with streaming\nvideo and audio input. The proposed framework InternLM-XComposer2.5-OmniLive\n(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:\nProcesses multimodal information in real-time, storing key details in memory\nand triggering reasoning in response to user queries. (2) Multi-modal Long\nMemory Module: Integrates short-term and long-term memory, compressing\nshort-term memories into long-term ones for efficient retrieval and improved\naccuracy. (3) Reasoning Module: Responds to queries and executes reasoning\ntasks, coordinating with the perception and memory modules. This project\nsimulates human-like cognition, enabling multimodal large language models to\nprovide continuous and adaptive service over time.\n","authors":["Pan Zhang","Xiaoyi Dong","Yuhang Cao","Yuhang Zang","Rui Qian","Xilin Wei","Lin Chen","Yifei Li","Junbo Niu","Shuangrui Ding","Qipeng Guo","Haodong Duan","Xin Chen","Han Lv","Zheng Nie","Min Zhang","Bin Wang","Wenwei Zhang","Xinyue Zhang","Jiaye Ge","Wei Li","Jingwen Li","Zhongying Tu","Conghui He","Xingcheng Zhang","Kai Chen","Yu Qiao","Dahua Lin","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09596v1.pdf","comment":"Github Repo:\n https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-OmniLive"},{"id":"http://arxiv.org/abs/2412.09587v1","updated":"2024-12-12T18:55:53Z","published":"2024-12-12T18:55:53Z","title":"OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets\n in 50+ Languages","summary":" We present OpenNER 1.0, a standardized collection of openly available named\nentity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51\nlanguages, annotated in varying named entity ontologies. We correct annotation\nformat issues, standardize the original datasets into a uniform representation,\nmap entity type names to be more consistent across corpora, and provide the\ncollection in a structure that enables research in multilingual and\nmulti-ontology NER. We provide baseline models using three pretrained\nmultilingual language models to compare the performance of recent models and\nfacilitate future research in NER.\n","authors":["Chester Palen-Michel","Maxwell Pickering","Maya Kruse","Jonne Sälevä","Constantine Lignos"],"pdf_url":"https://arxiv.org/pdf/2412.09587v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09578v1","updated":"2024-12-12T18:53:46Z","published":"2024-12-12T18:53:46Z","title":"DISHONEST: Dissecting misInformation Spread using Homogeneous sOcial\n NEtworks and Semantic Topic classification","summary":" The emergence of the COVID-19 pandemic resulted in a significant rise in the\nspread of misinformation on online platforms such as Twitter. Oftentimes this\ngrowth is blamed on the idea of the \"echo chamber.\" However, the behavior said\nto characterize these echo chambers exists in two dimensions. The first is in a\nuser's social interactions, where they are said to stick with the same clique\nof like-minded users. The second is in the content of their posts, where they\nare said to repeatedly espouse homogeneous ideas. In this study, we link the\ntwo by using Twitter's network of retweets to study social interactions and\ntopic modeling to study tweet content. In order to measure the diversity of a\nuser's interactions over time, we develop a novel metric to track the speed at\nwhich they travel through the social network. The application of these analysis\nmethods to misinformation-focused data from the pandemic demonstrates\ncorrelation between social behavior and tweet content. We believe this\ncorrelation supports the common intuition about how antisocial users behave,\nand further suggests that it holds even in subcommunities already rife with\nmisinformation.\n","authors":["Caleb Stam","Emily Saldanha","Mahantesh Halappanavar","Anurag Acharya"],"pdf_url":"https://arxiv.org/pdf/2412.09578v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09572v1","updated":"2024-12-12T18:52:40Z","published":"2024-12-12T18:52:40Z","title":"DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through\n Diverse Perspectives and Multi-Agent Interaction","summary":" Quantifying the uncertainty in the factual parametric knowledge of Large\nLanguage Models (LLMs), especially in a black-box setting, poses a significant\nchallenge. Existing methods, which gauge a model's uncertainty through\nevaluating self-consistency in responses to the original query, do not always\ncapture true uncertainty. Models might respond consistently to the origin query\nwith a wrong answer, yet respond correctly to varied questions from different\nperspectives about the same query, and vice versa. In this paper, we propose a\nnovel method, DiverseAgentEntropy, for evaluating a model's uncertainty using\nmulti-agent interaction under the assumption that if a model is certain, it\nshould consistently recall the answer to the original query across a diverse\ncollection of questions about the same original query. We further implement an\nabstention policy to withhold responses when uncertainty is high. Our method\noffers a more accurate prediction of the model's reliability and further\ndetects hallucinations, outperforming other self-consistency-based methods.\nAdditionally, it demonstrates that existing models often fail to consistently\nretrieve the correct answer to the same query under diverse varied questions\neven when knowing the correct answer.\n","authors":["Yu Feng","Phu Mon Htut","Zheng Qi","Wei Xiao","Manuel Mager","Nikolaos Pappas","Kishaloy Halder","Yang Li","Yassine Benajiba","Dan Roth"],"pdf_url":"https://arxiv.org/pdf/2412.09572v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09569v1","updated":"2024-12-12T18:51:13Z","published":"2024-12-12T18:51:13Z","title":"JuStRank: Benchmarking LLM Judges for System Ranking","summary":" Given the rapid progress of generative AI, there is a pressing need to\nsystematically compare and choose between the numerous models and\nconfigurations available. The scale and versatility of such evaluations make\nthe use of LLM-based judges a compelling solution for this challenge.\nCrucially, this approach requires first to validate the quality of the LLM\njudge itself. Previous work has focused on instance-based assessment of LLM\njudges, where a judge is evaluated over a set of responses, or response pairs,\nwhile being agnostic to their source systems. We argue that this setting\noverlooks critical factors affecting system-level ranking, such as a judge's\npositive or negative bias towards certain systems. To address this gap, we\nconduct the first large-scale study of LLM judges as system rankers. System\nscores are generated by aggregating judgment scores over multiple system\noutputs, and the judge's quality is assessed by comparing the resulting system\nranking to a human-based ranking. Beyond overall judge assessment, our analysis\nprovides a fine-grained characterization of judge behavior, including their\ndecisiveness and bias.\n","authors":["Ariel Gera","Odellia Boni","Yotam Perlitz","Roy Bar-Haim","Lilach Eden","Asaf Yehudai"],"pdf_url":"https://arxiv.org/pdf/2412.09569v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09563v1","updated":"2024-12-12T18:48:51Z","published":"2024-12-12T18:48:51Z","title":"Does Representation Matter? Exploring Intermediate Layers in Large\n Language Models","summary":" Understanding what defines a good representation in large language models\n(LLMs) is fundamental to both theoretical understanding and practical\napplications. In this paper, we investigate the quality of intermediate\nrepresentations in various LLM architectures, including Transformers and State\nSpace Models (SSMs). We find that intermediate layers often yield more\ninformative representations for downstream tasks than the final layers. To\nmeasure the representation quality, we adapt and apply a suite of metrics -\nsuch as prompt entropy, curvature, and augmentation-invariance - originally\nproposed in other contexts. Our empirical study reveals significant\narchitectural differences, how representations evolve throughout training, and\nhow factors like input randomness and prompt length affect each layer. Notably,\nwe observe a bimodal pattern in the entropy of some intermediate layers and\nconsider potential explanations tied to training data. Overall, our results\nilluminate the internal mechanics of LLMs and guide strategies for\narchitectural optimization and training.\n","authors":["Oscar Skean","Md Rifat Arefin","Yann LeCun","Ravid Shwartz-Ziv"],"pdf_url":"https://arxiv.org/pdf/2412.09563v1.pdf","comment":"Accepted to 2024 NeurIPs Workshop on Machine Learning and Compression"},{"id":"http://arxiv.org/abs/2412.09560v1","updated":"2024-12-12T18:46:38Z","published":"2024-12-12T18:46:38Z","title":"Foundational Large Language Models for Materials Research","summary":" Materials discovery and development are critical for addressing global\nchallenges. Yet, the exponential growth in materials science literature\ncomprising vast amounts of textual data has created significant bottlenecks in\nknowledge extraction, synthesis, and scientific reasoning. Large Language\nModels (LLMs) offer unprecedented opportunities to accelerate materials\nresearch through automated analysis and prediction. Still, their effective\ndeployment requires domain-specific adaptation for understanding and solving\ndomain-relevant tasks. Here, we present LLaMat, a family of foundational models\nfor materials science developed through continued pretraining of LLaMA models\non an extensive corpus of materials literature and crystallographic data.\nThrough systematic evaluation, we demonstrate that LLaMat excels in\nmaterials-specific NLP and structured information extraction while maintaining\ngeneral linguistic capabilities. The specialized LLaMat-CIF variant\ndemonstrates unprecedented capabilities in crystal structure generation,\npredicting stable crystals with high coverage across the periodic table.\nIntriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2,\nwe observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific\nperformance across diverse materials science tasks, including structured\ninformation extraction from text and tables, more particularly in crystal\nstructure generation, a potential adaptation rigidity in overtrained LLMs.\nAltogether, the present work demonstrates the effectiveness of domain\nadaptation towards developing practically deployable LLM copilots for materials\nresearch. Beyond materials science, our findings reveal important\nconsiderations for domain adaptation of LLMs, such as model selection, training\nmethodology, and domain-specific performance, which may influence the\ndevelopment of specialized scientific AI systems.\n","authors":["Vaibhav Mishra","Somaditya Singh","Dhruv Ahlawat","Mohd Zaki","Vaibhav Bihani","Hargun Singh Grover","Biswajit Mishra","Santiago Miret"," Mausam","N. M. Anoop Krishnan"],"pdf_url":"https://arxiv.org/pdf/2412.09560v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20535v2","updated":"2024-12-12T18:45:33Z","published":"2024-05-30T23:20:25Z","title":"Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large\n Language Models Reasoning","summary":" Instruction Fine-Tuning (IFT) significantly enhances the zero-shot\ncapabilities of pretrained Large Language Models (LLMs). While coding data is\nknown to boost LLM reasoning abilities during pretraining, its role in\nactivating internal reasoning capacities during IFT remains understudied. This\npaper investigates a key question: How does coding data impact LLMs' reasoning\ncapacities during IFT stage? To explore this, we thoroughly examine the impact\nof coding data across different coding data proportions, model families, sizes,\nand reasoning domains, from various perspectives. Specifically, we create three\nIFT datasets with increasing coding data proportions, fine-tune six LLM\nbackbones across different families and scales on these datasets, evaluate the\ntuned models' performance across twelve tasks in three reasoning domains, and\nanalyze the outcomes from three broad-to-granular perspectives: overall,\ndomain-level, and task-specific. Our holistic analysis provides valuable\ninsights into each perspective. First, coding data tuning enhances the overall\nreasoning capabilities of LLMs across different model families and scales.\nMoreover, while the impact of coding data varies by domain, it shows consistent\ntrends within each domain across different model families and scales.\nAdditionally, coding data generally provides comparable task-specific benefits\nacross model families, with optimal proportions in IFT datasets being\ntask-dependent.\n","authors":["Xinlu Zhang","Zhiyu Zoey Chen","Xi Ye","Xianjun Yang","Lichang Chen","William Yang Wang","Linda Ruth Petzold"],"pdf_url":"https://arxiv.org/pdf/2405.20535v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17251v2","updated":"2024-12-12T18:26:45Z","published":"2024-10-22T17:59:57Z","title":"Altogether: Image Captioning via Re-aligning Alt-text","summary":" This paper focuses on creating synthetic data to improve the quality of image\ncaptions. Existing works typically have two shortcomings. First, they caption\nimages from scratch, ignoring existing alt-text metadata, and second, lack\ntransparency if the captioners' training data (e.g. GPT) is unknown. In this\npaper, we study a principled approach Altogether based on the key idea to edit\nand re-align existing alt-texts associated with the images. To generate\ntraining data, we perform human annotation where annotators start with the\nexisting alt-text and re-align it to the image content in multiple rounds,\nconsequently constructing captions with rich visual concepts. This differs from\nprior work that carries out human annotation as a one-time description task\nsolely based on images and annotator knowledge. We train a captioner on this\ndata that generalizes the process of re-aligning alt-texts at scale. Our\nresults show our Altogether approach leads to richer image captions that also\nimprove text-to-image generation and zero-shot image classification tasks.\n","authors":["Hu Xu","Po-Yao Huang","Xiaoqing Ellen Tan","Ching-Feng Yeh","Jacob Kahn","Christine Jou","Gargi Ghosh","Omer Levy","Luke Zettlemoyer","Wen-tau Yih","Shang-Wen Li","Saining Xie","Christoph Feichtenhofer"],"pdf_url":"https://arxiv.org/pdf/2410.17251v2.pdf","comment":"accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine"},{"id":"http://arxiv.org/abs/2412.08268v2","updated":"2024-12-12T17:32:23Z","published":"2024-12-11T10:35:45Z","title":"LCFO: Long Context and Long Form Output Dataset and Benchmarking","summary":" This paper presents the Long Context and Form Output (LCFO) benchmark, a\nnovel evaluation framework for assessing gradual summarization and summary\nexpansion capabilities across diverse domains. LCFO consists of long input\ndocuments (5k words average length), each of which comes with three summaries\nof different lengths (20%, 10%, and 5% of the input text), as well as\napproximately 15 questions and answers (QA) related to the input content.\nNotably, LCFO also provides alignments between specific QA pairs and\ncorresponding summaries in 7 domains. The primary motivation behind providing\nsummaries of different lengths is to establish a controllable framework for\ngenerating long texts from shorter inputs, i.e. summary expansion. To establish\nan evaluation metric framework for summarization and summary expansion, we\nprovide human evaluation scores for human-generated outputs, as well as results\nfrom various state-of-the-art large language models (LLMs). GPT-4o-mini\nachieves best human scores among automatic systems in both summarization and\nsummary expansion tasks (~ +10% and +20%, respectively). It even surpasses\nhuman output quality in the case of short summaries (~ +7%). Overall automatic\nmetrics achieve low correlations with human evaluation scores (~ 0.4) but\nmoderate correlation on specific evaluation aspects such as fluency and\nattribution (~ 0.6). The LCFO benchmark offers a standardized platform for\nevaluating summarization and summary expansion performance, as well as\ncorresponding automatic metrics, thereby providing an important evaluation\nframework to advance generative AI.\n","authors":["Marta R. Costa-jussà","Pierre Andrews","Mariano Coria Meglioli","Joy Chen","Joe Chuang","David Dale","Christophe Ropers","Alexandre Mourachko","Eduardo Sánchez","Holger Schwenk","Tuan Tran","Arina Turkatenko","Carleigh Wood"],"pdf_url":"https://arxiv.org/pdf/2412.08268v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.01304v2","updated":"2024-12-12T17:26:04Z","published":"2024-03-02T20:25:50Z","title":"Improving the Validity of Automatically Generated Feedback via\n Reinforcement Learning","summary":" Automatically generating feedback via large language models (LLMs) in\nintelligent tutoring systems and online learning platforms has the potential to\nimprove the learning outcomes of many students. However, both feedback\ngeneration and evaluation are challenging: feedback content has to be valid\nespecially in subjects like math, which requires models to understand the\nproblem, the solution, and where the student's error lies. Feedback also has to\nbe pedagogically valid to reflect effective tutoring strategies, such as\nexplaining possible misconceptions and encouraging the student, among other\ndesirable features. In this work, we address both problems of automatically\ngenerating and evaluating feedback while considering both correctness and\nalignment. First, we propose a rubric for evaluating math feedback and show\nthat GPT-4 is able to effectively use it to annotate human-written and\nLLM-generated feedback. Second, we propose a framework for feedback generation\nthat optimizes both correctness and alignment using reinforcement learning\n(RL). Specifically, we use GPT-4's annotations to create preferences over\nfeedback pairs in an augmented dataset for training via direct preference\noptimization (DPO). We show that our methods significantly increase the\ncorrectness and alignment of generated feedback with Llama 2, an open-source\nLLM, qualitatively analyze our generation and evaluation systems using case\nstudies, and outline several areas for future work.\n","authors":["Alexander Scarlatos","Digory Smith","Simon Woodhead","Andrew Lan"],"pdf_url":"https://arxiv.org/pdf/2403.01304v2.pdf","comment":"Best student paper award, Published in AIED 2024: The 25th\n International Conference on Artificial Intelligence in Education"},{"id":"http://arxiv.org/abs/2412.09467v1","updated":"2024-12-12T17:15:49Z","published":"2024-12-12T17:15:49Z","title":"Audios Don't Lie: Multi-Frequency Channel Attention Mechanism for Audio\n Deepfake Detection","summary":" With the rapid development of artificial intelligence technology, the\napplication of deepfake technology in the audio field has gradually increased,\nresulting in a wide range of security risks. Especially in the financial and\nsocial security fields, the misuse of deepfake audios has raised serious\nconcerns. To address this challenge, this study proposes an audio deepfake\ndetection method based on multi-frequency channel attention mechanism (MFCA)\nand 2D discrete cosine transform (DCT). By processing the audio signal into a\nmelspectrogram, using MobileNet V2 to extract deep features, and combining it\nwith the MFCA module to weight different frequency channels in the audio\nsignal, this method can effectively capture the fine-grained frequency domain\nfeatures in the audio signal and enhance the Classification capability of fake\naudios. Experimental results show that compared with traditional methods, the\nmodel proposed in this study shows significant advantages in accuracy,\nprecision,recall, F1 score and other indicators. Especially in complex audio\nscenarios, this method shows stronger robustness and generalization\ncapabilities and provides a new idea for audio deepfake detection and has\nimportant practical application value. In the future, more advanced audio\ndetection technologies and optimization strategies will be explored to further\nimprove the accuracy and generalization capabilities of audio deepfake\ndetection.\n","authors":["Yangguang Feng"],"pdf_url":"https://arxiv.org/pdf/2412.09467v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09460v1","updated":"2024-12-12T17:11:22Z","published":"2024-12-12T17:11:22Z","title":"The Impact of Copyrighted Material on Large Language Models: A Norwegian\n Perspective","summary":" The use of copyrighted materials in training generative language models\nraises critical legal and ethical questions. This paper presents a framework\nfor and the results of empirically assessing the impact of copyrighted\nmaterials on the performance of large language models (LLMs) for Norwegian. We\nfound that both books and newspapers contribute positively when the models are\nevaluated on a diverse set of Norwegian benchmarks, while fiction works\npossibly lead to decreased performance. Our experiments could inform the\ncreation of a compensation scheme for authors whose works contribute to AI\ndevelopment.\n","authors":["Javier de la Rosa","Vladislav Mikhailov","Lemei Zhang","Freddy Wetjen","David Samuel","Peng Liu","Rolv-Arild Braaten","Petter Mæhlum","Magnus Breder Birkenes","Andrey Kutuzov","Tita Enstad","Svein Arne Brygfjeld","Jon Atle Gulla","Stephan Oepen","Erik Velldal","Wilfred Østgulen","Liljia Øvrelid","Aslak Sira Myhre"],"pdf_url":"https://arxiv.org/pdf/2412.09460v1.pdf","comment":"pre-print, under review"},{"id":"http://arxiv.org/abs/2411.05231v2","updated":"2024-12-12T16:40:18Z","published":"2024-11-07T22:51:47Z","title":"Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams","summary":" Recent advances in generative artificial intelligence (AI) have shown promise\nin accurately grading open-ended student responses. However, few prior works\nhave explored grading handwritten responses due to a lack of data and the\nchallenge of combining visual and textual information. In this work, we\nleverage state-of-the-art multi-modal AI models, in particular GPT-4o, to\nautomatically grade handwritten responses to college-level math exams. Using\nreal student responses to questions in a probability theory exam, we evaluate\nGPT-4o's alignment with ground-truth scores from human graders using various\nprompting techniques. We find that while providing rubrics improves alignment,\nthe model's overall accuracy is still too low for real-world settings, showing\nthere is significant room for growth in this task.\n","authors":["Adriana Caraeni","Alexander Scarlatos","Andrew Lan"],"pdf_url":"https://arxiv.org/pdf/2411.05231v2.pdf","comment":"Published in LAK 2025: The 15th International Learning Analytics and\n Knowledge Conference"},{"id":"http://arxiv.org/abs/2412.09429v1","updated":"2024-12-12T16:35:05Z","published":"2024-12-12T16:35:05Z","title":"From Intention To Implementation: Automating Biomedical Research via\n LLMs","summary":" Conventional biomedical research is increasingly labor-intensive due to the\nexponential growth of scientific literature and datasets. Artificial\nintelligence (AI), particularly Large Language Models (LLMs), has the potential\nto revolutionize this process by automating various steps. Still, significant\nchallenges remain, including the need for multidisciplinary expertise,\nlogicality of experimental design, and performance measurements. This paper\nintroduces BioResearcher, the first end-to-end automated system designed to\nstreamline the entire biomedical research process involving dry lab\nexperiments. BioResearcher employs a modular multi-agent architecture,\nintegrating specialized agents for search, literature processing, experimental\ndesign, and programming. By decomposing complex tasks into logically related\nsub-tasks and utilizing a hierarchical learning approach, BioResearcher\neffectively addresses the challenges of multidisciplinary requirements and\nlogical complexity. Furthermore, BioResearcher incorporates an LLM-based\nreviewer for in-process quality control and introduces novel evaluation metrics\nto assess the quality and automation of experimental protocols. BioResearcher\nsuccessfully achieves an average execution success rate of 63.07% across eight\npreviously unmet research objectives. The generated protocols averagely\noutperform typical agent systems by 22.0% on five quality metrics. The system\ndemonstrates significant potential to reduce researchers' workloads and\naccelerate biomedical discoveries, paving the way for future innovations in\nautomated research systems.\n","authors":["Yi Luo","Linghang Shi","Yihao Li","Aobo Zhuang","Yeyun Gong","Ling Liu","Lin Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09429v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11149v5","updated":"2024-12-12T16:33:01Z","published":"2024-09-17T13:03:12Z","title":"SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with\n Customisable Fairness Calibration","summary":" The development of unbiased large language models is widely recognized as\ncrucial, yet existing benchmarks fall short in detecting biases due to limited\nscope, contamination, and lack of a fairness baseline. SAGED(bias) is the first\nholistic benchmarking pipeline to address these problems. The pipeline\nencompasses five core stages: scraping materials, assembling benchmarks,\ngenerating responses, extracting numeric features, and diagnosing with\ndisparity metrics. SAGED includes metrics for max disparity, such as impact\nratio, and bias concentration, such as Max Z-scores. Noticing that metric tool\nbias and contextual bias in prompts can distort evaluation, SAGED implements\ncounterfactual branching and baseline calibration for mitigation. For\ndemonstration, we use SAGED on G20 Countries with popular 8b-level models\nincluding Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we\nfind that while Mistral and Qwen2 show lower max disparity and higher bias\nconcentration than Gemma2 and Llama3.1, all models are notably biased against\ncountries like Russia and (except for Qwen2) China. With further experiments to\nhave models role-playing U.S. presidents, we see bias amplifies and shifts in\nheterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in\nrole-playing, while Llama3.1 and Gemma2 role-play Trump notably more\nintensively than Biden and Harris, indicating role-playing performance bias in\nthese models.\n","authors":["Xin Guan","Nathaniel Demchak","Saloni Gupta","Ze Wang","Ediz Ertekin Jr.","Adriano Koshiyama","Emre Kazim","Zekun Wu"],"pdf_url":"https://arxiv.org/pdf/2409.11149v5.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2412.09416v1","updated":"2024-12-12T16:24:35Z","published":"2024-12-12T16:24:35Z","title":"Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical\n Ability Assessment of LLM-Powered AI Tutors","summary":" In this paper, we investigate whether current state-of-the-art large language\nmodels (LLMs) are effective as AI tutors and whether they demonstrate\npedagogical abilities necessary for good AI tutoring in educational dialogues.\nPrevious efforts towards evaluation have been limited to subjective protocols\nand benchmarks. To bridge this gap, we propose a unified evaluation taxonomy\nwith eight pedagogical dimensions based on key learning sciences principles,\nwhich is designed to assess the pedagogical value of LLM-powered AI tutor\nresponses grounded in student mistakes or confusion in the mathematical domain.\nWe release MRBench -- a new evaluation benchmark containing 192 conversations\nand 1,596 responses from seven state-of-the-art LLM-based and human tutors,\nproviding gold annotations for eight pedagogical dimensions. We assess\nreliability of the popular Prometheus2 LLM as an evaluator and analyze each\ntutor's pedagogical abilities, highlighting which LLMs are good tutors and\nwhich ones are more suitable as question-answering systems. We believe that the\npresented taxonomy, benchmark, and human-annotated labels will streamline the\nevaluation process and help track the progress in AI tutors' development.\n","authors":["Kaushal Kumar Maurya","KV Aditya Srivatsa","Kseniia Petukhova","Ekaterina Kochmar"],"pdf_url":"https://arxiv.org/pdf/2412.09416v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.09415v1","updated":"2024-12-12T16:23:12Z","published":"2024-12-12T16:23:12Z","title":"Text Generation Models for Luxembourgish with Limited Data: A Balanced\n Multilingual Strategy","summary":" This paper addresses the challenges in developing language models for\nless-represented languages, with a focus on Luxembourgish. Despite its active\ndevelopment, Luxembourgish faces a digital data scarcity, exacerbated by\nLuxembourg's multilingual context. We propose a novel text generation model\nbased on the T5 architecture, combining limited Luxembourgish data with equal\namounts, in terms of size and type, of German and French data. We hypothesise\nthat a model trained on Luxembourgish, German, and French will improve the\nmodel's cross-lingual transfer learning capabilities and outperform monolingual\nand large multilingual models. To verify this, the study at hand explores\nwhether multilingual or monolingual training is more beneficial for\nLuxembourgish language generation. For the evaluation, we introduce LuxGen, a\ntext generation benchmark that is the first of its kind for Luxembourgish.\n","authors":["Alistair Plum","Tharindu Ranasinghe","Christoph Purschke"],"pdf_url":"https://arxiv.org/pdf/2412.09415v1.pdf","comment":"Accepted at VarDial 2025"},{"id":"http://arxiv.org/abs/2412.09413v1","updated":"2024-12-12T16:20:36Z","published":"2024-12-12T16:20:36Z","title":"Imitate, Explore, and Self-Improve: A Reproduction Report on\n Slow-thinking Reasoning Systems","summary":" Recently, slow-thinking reasoning systems, such as o1, have demonstrated\nremarkable capabilities in solving complex reasoning tasks. These systems\ntypically engage in an extended thinking process before responding to a query,\nallowing them to generate more thorough, accurate, and well-reasoned solutions.\nThese systems are primarily developed and maintained by industry, with their\ncore techniques not publicly disclosed. In response, an increasing number of\nstudies from the research community aim to explore the technical foundations\nunderlying these powerful reasoning systems. Building on these prior efforts,\nthis paper presents a reproduction report on implementing o1-like reasoning\nsystems. We introduce an \"imitate, explore, and self-improve\" framework as our\nprimary technical approach to train the reasoning model. In the initial phase,\nwe use distilled long-form thought data to fine-tune the reasoning model,\nenabling it to invoke a slow-thinking mode. The model is then encouraged to\nexplore challenging problems by generating multiple rollouts, which can result\nin increasingly more high-quality trajectories that lead to correct answers.\nFurthermore, the model undergoes self-improvement by iteratively refining its\ntraining dataset. To verify the effectiveness of this approach, we conduct\nextensive experiments on three challenging benchmarks. The experimental results\ndemonstrate that our approach achieves competitive performance compared to\nindustry-level reasoning systems on these benchmarks.\n","authors":["Yingqian Min","Zhipeng Chen","Jinhao Jiang","Jie Chen","Jia Deng","Yiwen Hu","Yiru Tang","Jiapeng Wang","Xiaoxue Cheng","Huatong Song","Wayne Xin Zhao","Zheng Liu","Zhongyuan Wang","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2412.09413v1.pdf","comment":"Technical Report on Slow Thinking with LLMs: Part II"},{"id":"http://arxiv.org/abs/2412.00426v2","updated":"2024-12-12T16:19:14Z","published":"2024-11-30T10:52:24Z","title":"Few-Shot Domain Adaptation for Named-Entity Recognition via Joint\n Constrained k-Means and Subspace Selection","summary":" Named-entity recognition (NER) is a task that typically requires large\nannotated datasets, which limits its applicability across domains with varying\nentity definitions. This paper addresses few-shot NER, aiming to transfer\nknowledge to new domains with minimal supervision. Unlike previous approaches\nthat rely solely on limited annotated data, we propose a weakly supervised\nalgorithm that combines small labeled datasets with large amounts of unlabeled\ndata. Our method extends the k-means algorithm with label supervision, cluster\nsize constraints and domain-specific discriminative subspace selection. This\nunified framework achieves state-of-the-art results in few-shot NER on several\nEnglish datasets.\n","authors":["Ayoub Hammal","Benno Uthayasooriyar","Caio Corro"],"pdf_url":"https://arxiv.org/pdf/2412.00426v2.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2411.18564v2","updated":"2024-12-12T16:03:30Z","published":"2024-11-27T18:04:05Z","title":"Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities across\nvarious tasks, yet they often struggle with spatial reasoning. This paper\npresents a novel neural-symbolic framework that enhances LLMs' spatial\nreasoning abilities through iterative feedback between LLMs and Answer Set\nProgramming (ASP). We evaluate our approach on two benchmark datasets: StepGame\nand SparQA, implementing three distinct strategies: (1) direct prompting\nbaseline, (2) Facts+Rules prompting, and (3) DSPy-based LLM+ASP pipeline with\niterative refinement. Our experimental results demonstrate that the LLM+ASP\npipeline significantly outperforms baseline methods, achieving an average 82%\naccuracy on StepGame and 69% on SparQA, marking improvements of 40-50% and\n8-15% respectively over direct prompting. The success stems from three key\ninnovations: (1) effective separation of semantic parsing and logical reasoning\nthrough a modular pipeline, (2) iterative feedback mechanism between LLMs and\nASP solvers that improves program rate, and (3) robust error handling that\naddresses parsing, grounding, and solving failures. Additionally, we propose\nFacts+Rules as a lightweight alternative that achieves comparable performance\non complex SparQA dataset, while reducing computational overhead.Our analysis\nacross different LLM architectures (Deepseek, Llama3-70B, GPT-4.0 mini)\ndemonstrates the framework's generalizability and provides insights into the\ntrade-offs between implementation complexity and reasoning capability,\ncontributing to the development of more interpretable and reliable AI systems.\n","authors":["Rong Wang","Kun Sun","Jonas Kuhn"],"pdf_url":"https://arxiv.org/pdf/2411.18564v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09383v1","updated":"2024-12-12T15:50:55Z","published":"2024-12-12T15:50:55Z","title":"Neural Text Normalization for Luxembourgish using Real-Life Variation\n Data","summary":" Orthographic variation is very common in Luxembourgish texts due to the\nabsence of a fully-fledged standard variety. Additionally, developing NLP tools\nfor Luxembourgish is a difficult task given the lack of annotated and parallel\ndata, which is exacerbated by ongoing standardization. In this paper, we\npropose the first sequence-to-sequence normalization models using the ByT5 and\nmT5 architectures with training data obtained from word-level real-life\nvariation data. We perform a fine-grained, linguistically-motivated evaluation\nto test byte-based, word-based and pipeline-based models for their strengths\nand weaknesses in text normalization. We show that our sequence model using\nreal-life variation data is an effective approach for tailor-made normalization\nin Luxembourgish.\n","authors":["Anne-Marie Lutgen","Alistair Plum","Christoph Purschke","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2412.09383v1.pdf","comment":"Accepted at VarDial 2025"},{"id":"http://arxiv.org/abs/2412.09378v1","updated":"2024-12-12T15:46:43Z","published":"2024-12-12T15:46:43Z","title":"From Bench to Bedside: A Review of Clinical Trialsin Drug Discovery and\n Development","summary":" Clinical trials are an indispensable part of the drug development process,\nbridging the gap between basic research and clinical application. During the\ndevelopment of new drugs, clinical trials are used not only to evaluate the\nsafety and efficacy of the drug but also to explore its dosage, treatment\nregimens, and potential side effects. This review discusses the various stages\nof clinical trials, including Phase I (safety assessment), Phase II\n(preliminary efficacy evaluation), Phase III (large-scale validation), and\nPhase IV (post-marketing surveillance), highlighting the characteristics of\neach phase and their interrelationships. Additionally, the paper addresses the\nmajor challenges encountered in clinical trials, such as ethical issues,\nsubject recruitment difficulties, diversity and representativeness concerns,\nand proposes strategies for overcoming these challenges. With the advancement\nof technology, innovative technologies such as artificial intelligence, big\ndata, and digitalization are gradually transforming clinical trial design and\nimplementation, improving trial efficiency and data quality. The article also\nlooks forward to the future of clinical trials, particularly the impact of\nemerging therapies such as gene therapy and immunotherapy on trial design, as\nwell as the importance of regulatory reforms and global collaboration. In\nconclusion, the core role of clinical trials in drug development will continue\nto drive the progress of innovative drug development and clinical treatment.\n","authors":["Tianyang Wang","Ming Liu","Benji Peng","Xinyuan Song","Charles Zhang","Xintian Sun","Qian Niu","Junyu Liu","Silin Chen","Keyu Chen","Ming Li","Pohsun Feng","Ziqian Bi","Yunze Wang","Yichao Zhang","Cheng Fei","Lawrence KQ Yan"],"pdf_url":"https://arxiv.org/pdf/2412.09378v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2411.06908v2","updated":"2024-12-12T15:40:54Z","published":"2024-11-11T12:11:36Z","title":"EVQAScore: Efficient Video Question Answering Data Evaluation","summary":" Video question-answering (QA) is a core task in video understanding.\nEvaluating the quality of video QA and video caption data quality for training\nvideo large language models (VideoLLMs) is an essential challenge. Although\nvarious methods have been proposed for assessing video caption quality, there\nremains a lack of dedicated evaluation methods for Video QA. To address this\ngap, we introduce EVQAScore, a reference-free method that leverages keyword\nextraction to assess both video caption and video QA data quality.\nAdditionally, we incorporate frame sampling and rescaling techniques to enhance\nthe efficiency and robustness of our evaluation, this enables our score to\nevaluate the quality of extremely long videos. Our approach achieves\nstate-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for\nSpearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on\nthe VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using\nEVQAScore for data selection, we achieved SOTA results with only 12.5\\% of the\noriginal data volume, outperforming the previous SOTA method PAC-S and 100\\% of\ndata.\n","authors":["Hao Liang","Zirong Chen","Wentao Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.06908v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09370v1","updated":"2024-12-12T15:38:34Z","published":"2024-12-12T15:38:34Z","title":"Word Sense Linking: Disambiguating Outside the Sandbox","summary":" Word Sense Disambiguation (WSD) is the task of associating a word in a given\ncontext with its most suitable meaning among a set of possible candidates.\nWhile the task has recently witnessed renewed interest, with systems achieving\nperformances above the estimated inter-annotator agreement, at the time of\nwriting it still struggles to find downstream applications. We argue that one\nof the reasons behind this is the difficulty of applying WSD to plain text.\nIndeed, in the standard formulation, models work under the assumptions that a)\nall the spans to disambiguate have already been identified, and b) all the\npossible candidate senses of each span are provided, both of which are\nrequirements that are far from trivial. In this work, we present a new task\ncalled Word Sense Linking (WSL) where, given an input text and a reference\nsense inventory, systems have to both identify which spans to disambiguate and\nthen link them to their most suitable meaning.We put forward a\ntransformer-based architecture for the task and thoroughly evaluate both its\nperformance and those of state-of-the-art WSD systems scaled to WSL,\niteratively relaxing the assumptions of WSD. We hope that our work will foster\neasier integration of lexical semantics into downstream applications.\n","authors":["Andrei Stefan Bejgu","Edoardo Barba","Luigi Procopio","Alberte Fernández-Castro","Roberto Navigli"],"pdf_url":"https://arxiv.org/pdf/2412.09370v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09362v1","updated":"2024-12-12T15:29:36Z","published":"2024-12-12T15:29:36Z","title":"Falcon-UI: Understanding GUI Before Following User Instructions","summary":" Pursuing human-like interaction for Graphical User Interface (GUI) agents\nrequires understanding the GUI context and following user instructions.\nHowever, existing works typically couple these two aspects and focus more on\ninstruct-following abilities, while ignoring the importance of understanding\nthe GUI context. In this paper, we introduce an instruction-free GUI navigation\ndataset, termed Insight-UI Dataset, to enhance model comprehension of GUI\nenvironments. Insight-UI Dataset is automatically generated from the Common\nCrawl corpus, simulating various platforms -- including iOS, Android, Windows,\nand Linux -- across multiple resolutions on 312K domains. Although GUI\ninteractions vary by context, diverse interfaces share common internal\npatterns, such as clicking an item to view its details. It implies the\nfeasibility of independent GUI operation learning, followed by joint\noptimization with instruction tuning. Thereby, we develop the GUI agent model\nFalcon-UI, which is initially pretrained on Insight-UI Dataset and subsequently\nfine-tuned on Android and Web GUI datasets, including AITW, AITZ, Android\nControl, and Mind2Web. With 7 billion parameters, Falcon-UI achieves accuracy\ncomparable to the 72 billion-parameter Qwen2VL on AITZ, validating the\nalignment between GUI context comprehension and agent performance. Our code and\ndataset will be open-sourced.\n","authors":["Huawen Shen","Chang Liu","Gengluo Li","Xinlong Wang","Yu Zhou","Can Ma","Xiangyang Ji"],"pdf_url":"https://arxiv.org/pdf/2412.09362v1.pdf","comment":"18 pages, 14 figures"},{"id":"http://arxiv.org/abs/2412.09353v1","updated":"2024-12-12T15:22:03Z","published":"2024-12-12T15:22:03Z","title":"Causal Graphical Models for Vision-Language Compositional Understanding","summary":" Recent work has empirically shown that Vision-Language Models (VLMs) struggle\nto fully understand the compositional properties of the human language, usually\nmodeling an image caption as a \"bag of words\". As a result, they perform poorly\non compositional tasks, which require a deeper understanding of the different\nentities of a sentence (subject, verb, etc.) jointly with their mutual\nrelationships in order to be solved. In this paper, we model the dependency\nrelations among textual and visual tokens using a Causal Graphical Model (CGM),\nbuilt using a dependency parser, and we train a decoder conditioned by the VLM\nvisual encoder. Differently from standard autoregressive or parallel\npredictions, our decoder's generative process is partially-ordered following\nthe CGM structure. This structure encourages the decoder to learn only the main\ncausal dependencies in a sentence discarding spurious correlations. Using\nextensive experiments on five compositional benchmarks, we show that our method\nsignificantly outperforms all the state-of-the-art compositional approaches by\na large margin, and it also improves over methods trained using much larger\ndatasets.\n","authors":["Fiorenzo Parascandolo","Nicholas Moratelli","Enver Sangineto","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.09353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09341v1","updated":"2024-12-12T15:09:44Z","published":"2024-12-12T15:09:44Z","title":"Training LayoutLM from Scratch for Efficient Named-Entity Recognition in\n the Insurance Domain","summary":" Generic pre-trained neural networks may struggle to produce good results in\nspecialized domains like finance and insurance. This is due to a domain\nmismatch between training data and downstream tasks, as in-domain data are\noften scarce due to privacy constraints. In this work, we compare different\npre-training strategies for LayoutLM. We show that using domain-relevant\ndocuments improves results on a named-entity recognition (NER) problem using a\nnovel dataset of anonymized insurance-related financial documents called\nPayslips. Moreover, we show that we can achieve competitive results using a\nsmaller and faster model.\n","authors":["Benno Uthayasooriyar","Antoine Ly","Franck Vermet","Caio Corro"],"pdf_url":"https://arxiv.org/pdf/2412.09341v1.pdf","comment":"Coling 2025 workshop (FinNLP)"},{"id":"http://arxiv.org/abs/2403.02285v2","updated":"2024-12-12T15:06:07Z","published":"2024-03-04T18:15:14Z","title":"Detection of Non-recorded Word Senses in English and Swedish","summary":" This study addresses the task of Unknown Sense Detection in English and\nSwedish. The primary objective of this task is to determine whether the meaning\nof a particular word usage is documented in a dictionary or not. For this\npurpose, sense entries are compared with word usages from modern and historical\ncorpora using a pre-trained Word-in-Context embedder that allows us to model\nthis task in a few-shot scenario. Additionally, we use human annotations on the\ntarget corpora to adapt hyperparameters and evaluate our models using 5-fold\ncross-validation. Compared to a random sample from a corpus, our model is able\nto considerably increase the detected number of word usages with non-recorded\nsenses.\n","authors":["Jonathan Lautenschlager","Emma Sköldberg","Simon Hengchen","Dominik Schlechtweg"],"pdf_url":"https://arxiv.org/pdf/2403.02285v2.pdf","comment":"9 pages"},{"id":"http://arxiv.org/abs/2408.09849v2","updated":"2024-12-12T14:56:20Z","published":"2024-08-19T09:51:02Z","title":"Importance Weighting Can Help Large Language Models Self-Improve","summary":" Large language models (LLMs) have shown remarkable capability in numerous\ntasks and applications. However, fine-tuning LLMs using high-quality datasets\nunder external supervision remains prohibitively expensive. In response, LLM\nself-improvement approaches have been vibrantly developed recently. The typical\nparadigm of LLM self-improvement involves training LLM on self-generated data,\npart of which may be detrimental and should be filtered out due to the unstable\ndata quality. While current works primarily employs filtering strategies based\non answer correctness, in this paper, we demonstrate that filtering out correct\nbut with high distribution shift extent (DSE) samples could also benefit the\nresults of self-improvement. Given that the actual sample distribution is\nusually inaccessible, we propose a new metric called DS weight to approximate\nDSE, inspired by the Importance Weighting methods. Consequently, we integrate\nDS weight with self-consistency to comprehensively filter the self-generated\nsamples and fine-tune the language model. Experiments show that with only a\ntiny valid set (up to 5\\% size of the training set) to compute DS weight, our\napproach can notably promote the reasoning ability of current LLM\nself-improvement methods. The resulting performance is on par with methods that\nrely on external supervision from pre-trained reward models.\n","authors":["Chunyang Jiang","Chi-min Chan","Wei Xue","Qifeng Liu","Yike Guo"],"pdf_url":"https://arxiv.org/pdf/2408.09849v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09318v1","updated":"2024-12-12T14:43:03Z","published":"2024-12-12T14:43:03Z","title":"Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction","summary":" LLMs can generate human-like dialogues, yet their ability to simulate early\nchild-adult interactions remains largely unexplored. In this paper, we examined\nhow effectively LLMs can capture the distinctive features of child-caregiver\nlanguage in interaction, using both static and interactive benchmarking\nmethods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can\napproximate child-caregiver dialogues at the word and utterance level, but they\nstruggle to reproduce the child and caregiver's discursive patterns, exaggerate\nalignment, and fail to reach the level of diversity shown by humans. The\nbroader goal of this work is to initiate the development of a comprehensive\nbenchmark for LLMs in child-oriented applications.\n","authors":["Jing Liu","Abdellah Fourtassi"],"pdf_url":"https://arxiv.org/pdf/2412.09318v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.18446v2","updated":"2024-12-12T13:48:44Z","published":"2024-09-27T05:06:43Z","title":"Exploring Language Model Generalization in Low-Resource Extractive QA","summary":" In this paper, we investigate Extractive Question Answering (EQA) with Large\nLanguage Models (LLMs) under domain drift, i.e., can LLMs generalize to domains\nthat require specific knowledge such as medicine and law in a zero-shot fashion\nwithout additional in-domain training? To this end, we devise a series of\nexperiments to explain the performance gap empirically. Our findings suggest\nthat: (a) LLMs struggle with dataset demands of closed domains such as\nretrieving long answer spans; (b) Certain LLMs, despite showing strong overall\nperformance, display weaknesses in meeting basic requirements as discriminating\nbetween domain-specific senses of words which we link to pre-processing\ndecisions; (c) Scaling model parameters is not always effective for cross\ndomain generalization; and (d) Closed-domain datasets are quantitatively much\ndifferent than open-domain EQA datasets and current LLMs struggle to deal with\nthem. Our findings point out important directions for improving existing LLMs.\n","authors":["Saptarshi Sengupta","Wenpeng Yin","Preslav Nakov","Shreya Ghosh","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2409.18446v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.09282v1","updated":"2024-12-12T13:45:11Z","published":"2024-12-12T13:45:11Z","title":"CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of\n LLMs","summary":" Powerful large language models (LLMs) are increasingly expected to be\ndeployed with lower computational costs, enabling their capabilities on\nresource-constrained devices. Post-training quantization (PTQ) has emerged as a\nstar approach to achieve this ambition, with best methods compressing weights\nto less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector\nQuantization (CRVQ), a novel technique that significantly improves the\nperformance of PTQ baselines at the cost of only minimal additional bits. This\nstate-of-the-art extreme compression method achieves its results through two\nkey innovations: (1) carefully selecting and reordering a very small subset of\ncritical weight channels, and (2) leveraging multiple codebooks to relax the\nconstraint of critical channels. With our method, we demonstrate a 38.9%\nimprovement over the current strongest sub-2-bit PTQ baseline, enabling nearer\nlossless 1-bit compression. Furthermore, our approach offers flexible\ncustomization of quantization bit-width and performance, providing a wider\nrange of deployment options for diverse hardware platforms.\n","authors":["Yuzhuang Xu","Shiyu Ji","Qingfu Zhu","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2412.09282v1.pdf","comment":"5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2412.09280v1","updated":"2024-12-12T13:42:58Z","published":"2024-12-12T13:42:58Z","title":"Learning to Solve Domain-Specific Calculation Problems with\n Knowledge-Intensive Programs Generator","summary":" Domain Large Language Models (LLMs) are developed for domain-specific tasks\nbased on general LLMs. But it still requires professional knowledge to\nfacilitate the expertise for some domain-specific tasks. In this paper, we\ninvestigate into knowledge-intensive calculation problems. We find that the\nmath problems to be challenging for LLMs, when involving complex\ndomain-specific rules and knowledge documents, rather than simple formulations\nof terminologies. Therefore, we propose a pipeline to solve the domain-specific\ncalculation problems with Knowledge-Intensive Programs Generator more\neffectively, named as KIPG. It generates knowledge-intensive programs according\nto the domain-specific documents. For each query, key variables are extracted,\nthen outcomes which are dependent on domain knowledge are calculated with the\nprograms. By iterative preference alignment, the code generator learns to\nimprove the logic consistency with the domain knowledge. Taking legal domain as\nan example, we have conducted experiments to prove the effectiveness of our\npipeline, and extensive analysis on the modules. We also find that the code\ngenerator is also adaptable to other domains, without training on the new\nknowledge.\n","authors":["Chengyuan Liu","Shihang Wang","Lizhi Qing","Jun Lin","Ji Zhang","Fei Wu","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2412.09280v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.16995v2","updated":"2024-12-12T13:33:56Z","published":"2023-10-25T20:48:16Z","title":"TOP-Training: Target-Oriented Pretraining for Medical Extractive\n Question Answering","summary":" We study extractive question-answering in the medical domain (Medical-EQA).\nThis problem has two main challenges: (i) domain specificity, as most AI models\nlack necessary domain knowledge, and (ii) extraction-based answering style,\nwhich restricts most autoregressive LLMs due to potential hallucinations. To\nhandle those challenges, we propose TOP-Training, a target-oriented\npre-training paradigm that stands out among all domain adaptation techniques\nwith two desirable features: (i) TOP-Training moves one step further than\npopular domain-oriented fine-tuning since it not only moves closer to the\ntarget domain, but also familiarizes itself with the target dataset, and (ii)\nit does not assume the existence of a large set of unlabeled instances from the\ntarget domain. Specifically, for a target Medical-EQA dataset, we extract its\nentities and leverage large language models (LLMs) to generate synthetic texts\ncontaining those entities; we then demonstrate that pretraining on this\nsynthetic text data yields better performance on the target Medical-EQA\nbenchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new\npretraining technique to effectively adapt LLMs to better solve a target\nproblem, (ii) TOP-Training has a wide application scope because it does not\nrequire the target problem to have a large set of unlabeled data, and (iii) our\nexperiments highlight the limitations of autoregressive LLMs, emphasizing\nTOP-Training as a means to unlock the true potential of bidirectional LLMs.\n","authors":["Saptarshi Sengupta","Connor Heaton","Shreya Ghosh","Wenpeng Yin","Preslav Nakov","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2310.16995v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.09269v1","updated":"2024-12-12T13:31:58Z","published":"2024-12-12T13:31:58Z","title":"Towards Understanding the Robustness of LLM-based Evaluations under\n Perturbations","summary":" Traditional evaluation metrics like BLEU and ROUGE fall short when capturing\nthe nuanced qualities of generated text, particularly when there is no single\nground truth. In this paper, we explore the potential of Large Language Models\n(LLMs), specifically Google Gemini 1, to serve as automatic evaluators for\nnon-standardized metrics in summarization and dialog-based tasks. We conduct\nexperiments across multiple prompting strategies to examine how LLMs fare as\nquality evaluators when compared with human judgments on the SummEval and USR\ndatasets, asking the model to generate both a score as well as a justification\nfor the score. Furthermore, we explore the robustness of the LLM evaluator by\nusing perturbed inputs. Our findings suggest that while LLMs show promise,\ntheir alignment with human evaluators is limited, they are not robust against\nperturbations and significant improvements are required for their standalone\nuse as reliable evaluators for subjective metrics.\n","authors":["Manav Chaudhary","Harshit Gupta","Savita Bhat","Vasudeva Varma"],"pdf_url":"https://arxiv.org/pdf/2412.09269v1.pdf","comment":"Accepted at ICON 2024"},{"id":"http://arxiv.org/abs/2412.09263v1","updated":"2024-12-12T13:21:09Z","published":"2024-12-12T13:21:09Z","title":"First Train to Generate, then Generate to Train: UnitedSynT5 for\n Few-Shot NLI","summary":" Natural Language Inference (NLI) tasks require identifying the relationship\nbetween sentence pairs, typically classified as entailment, contradiction, or\nneutrality. While the current state-of-the-art (SOTA) model, Entailment\nFew-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural\nLanguage Inference (SNLI) dataset, further advancements are constrained by the\ndataset's limitations. To address this, we propose a novel approach leveraging\nsynthetic data augmentation to enhance dataset diversity and complexity. We\npresent UnitedSynT5, an advanced extension of EFL that leverages a T5-based\ngenerator to synthesize additional premise-hypothesis pairs, which are\nrigorously cleaned and integrated into the training data. These augmented\nexamples are processed within the EFL framework, embedding labels directly into\nhypotheses for consistency. We train a GTR-T5-XL model on this expanded\ndataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset,\n94.01% accuracy on the E-SNLI dataset, and 92.57% accuracy on the MultiNLI\ndataset, surpassing the previous SOTA models. This research demonstrates the\npotential of synthetic data augmentation in improving NLI models, offering a\npath forward for further advancements in natural language understanding tasks.\n","authors":["Sourav Banerjee","Anush Mahajan","Ayushi Agarwal","Eishkaran Singh"],"pdf_url":"https://arxiv.org/pdf/2412.09263v1.pdf","comment":"14 pages"},{"id":"http://arxiv.org/abs/2412.09247v1","updated":"2024-12-12T12:57:55Z","published":"2024-12-12T12:57:55Z","title":"Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by\n Utilizing Generative LLMs","summary":" Satire detection is essential for accurately extracting opinions from textual\ndata and combating misinformation online. However, the lack of diverse corpora\nfor satire leads to the problem of stylistic bias which impacts the models'\ndetection performances. This study proposes a debiasing approach for satire\ndetection, focusing on reducing biases in training data by utilizing generative\nlarge language models. The approach is evaluated in both cross-domain (irony\ndetection) and cross-lingual (English) settings. Results show that the\ndebiasing method enhances the robustness and generalizability of the models for\nsatire and irony detection tasks in Turkish and English. However, its impact on\ncausal language models, such as Llama-3.1, is limited. Additionally, this work\ncurates and presents the Turkish Satirical News Dataset with detailed human\nannotations, with case studies on classification, debiasing, and\nexplainability.\n","authors":["Asli Umay Ozturk","Recep Firat Cekinel","Pinar Karagoz"],"pdf_url":"https://arxiv.org/pdf/2412.09247v1.pdf","comment":"Accepted to BUCC2025 Workshop @COLING2025"},{"id":"http://arxiv.org/abs/2408.02976v2","updated":"2024-12-12T12:52:51Z","published":"2024-08-06T06:16:00Z","title":"Empathy Level Alignment via Reinforcement Learning for Empathetic\n Response Generation","summary":" Empathetic response generation, aiming to understand the user's situation and\nfeelings and respond empathically, is crucial in building human-like dialogue\nsystems. Traditional approaches typically employ maximum likelihood estimation\nas the optimization objective during training, yet fail to align the empathy\nlevels between generated and target responses. To this end, we propose an\nempathetic response generation framework using reinforcement learning (EmpRL).\nThe framework develops an effective empathy reward function and generates\nempathetic responses by maximizing the expected reward through reinforcement\nlearning. EmpRL utilizes the pre-trained T5 model as the generator and further\nfine-tunes it to initialize the policy. To align the empathy levels between\ngenerated and target responses within a given context, an empathy reward\nfunction containing three empathy communication mechanisms -- emotional\nreaction, interpretation, and exploration -- is constructed using pre-designed\nand pre-trained empathy identifiers. During reinforcement learning training,\nthe proximal policy optimization algorithm is used to fine-tune the policy,\nenabling the generation of empathetic responses. Both automatic and human\nevaluations demonstrate that the proposed EmpRL framework significantly\nimproves the quality of generated responses, enhances the similarity in empathy\nlevels between generated and target responses, and produces empathetic\nresponses covering both affective and cognitive aspects.\n","authors":["Hui Ma","Bo Zhang","Bo Xu","Jian Wang","Hongfei Lin","Xiao Sun"],"pdf_url":"https://arxiv.org/pdf/2408.02976v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16048v3","updated":"2024-12-12T12:01:43Z","published":"2024-02-25T10:13:04Z","title":"How Likely Do LLMs with CoT Mimic Human Reasoning?","summary":" Chain-of-thought emerges as a promising technique for eliciting reasoning\ncapabilities from Large Language Models (LLMs). However, it does not always\nimprove task performance or accurately represent reasoning processes, leaving\nunresolved questions about its usage. In this paper, we diagnose the underlying\nmechanism by comparing the reasoning process of LLMs with humans, using causal\nanalysis to understand the relationships between the problem instruction,\nreasoning, and the answer in LLMs. Our empirical study reveals that LLMs often\ndeviate from the ideal causal chain, resulting in spurious correlations and\npotential consistency errors (inconsistent reasoning and answers). We also\nexamine various factors influencing the causal structure, finding that\nin-context learning with examples strengthens it, while post-training\ntechniques like supervised fine-tuning and reinforcement learning on human\nfeedback weaken it. To our surprise, the causal structure cannot be\nstrengthened by enlarging the model size only, urging research on new\ntechniques. We hope that this preliminary study will shed light on\nunderstanding and improving the reasoning process in LLM.\n","authors":["Guangsheng Bao","Hongbo Zhang","Cunxiang Wang","Linyi Yang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.16048v3.pdf","comment":"COLING 2025 Camera Version (8 pages, 3 figures, 18 tables)"},{"id":"http://arxiv.org/abs/2412.09203v1","updated":"2024-12-12T11:57:59Z","published":"2024-12-12T11:57:59Z","title":"CleanComedy: Creating Friendly Humor through Generative Techniques","summary":" Humor generation is a challenging task in natural language processing due to\nlimited resources and the quality of existing datasets. Available humor\nlanguage resources often suffer from toxicity and duplication, limiting their\neffectiveness for training robust models. This paper proposes CleanComedy, a\nspecialized, partially annotated toxicity-filtered corpus of English and\nRussian jokes collected from various sources. We study the effectiveness of our\ndata filtering approach through a survey on humor and toxicity levels in\nvarious joke groups. In addition, we study advances in computer humor\ngeneration by comparing jokes written by humans with various groups of\ngenerative jokes, including our baseline models trained on the CleanComedy\ndatasets.\n","authors":["Dmitry Vikhorev","Daria Galimzianova","Svetlana Gorovaia","Elizaveta Zhemchuzhina","Ivan P. Yamshchikov"],"pdf_url":"https://arxiv.org/pdf/2412.09203v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08163v2","updated":"2024-12-12T11:42:11Z","published":"2024-12-11T07:37:26Z","title":"NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech\n Detection using Ensembling of BERT-based models","summary":" This paper explores hate speech detection in Devanagari-scripted languages,\nfocusing on Hindi and Nepali, for Subtask B of the CHIPSAL@COLING 2025 Shared\nTask. Using a range of transformer-based models such as XLM-RoBERTa, MURIL, and\nIndicBERT, we examine their effectiveness in navigating the nuanced boundary\nbetween hate speech and free expression. Our best performing model, implemented\nas ensemble of multilingual BERT models achieve Recall of 0.7762 (Rank 3/31 in\nterms of recall) and F1 score of 0.6914 (Rank 17/31). To address class\nimbalance, we used backtranslation for data augmentation, and cosine similarity\nto preserve label consistency after augmentation. This work emphasizes the need\nfor hate speech detection in Devanagari-scripted languages and presents a\nfoundation for further research.\n","authors":["Anmol Guragain","Nadika Poudel","Rajesh Piryani","Bishesh Khanal"],"pdf_url":"https://arxiv.org/pdf/2412.08163v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.13516v5","updated":"2024-12-12T11:29:32Z","published":"2024-02-21T03:58:49Z","title":"ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity\n within Large Language Models","summary":" Activation sparsity refers to the existence of considerable\nweakly-contributed elements among activation outputs. As a prevalent property\nof the models using the ReLU activation function, activation sparsity has been\nproven a promising paradigm to boost model inference efficiency. Nevertheless,\nmost large language models (LLMs) adopt activation functions without intrinsic\nactivation sparsity (e.g., GELU and Swish). Some recent efforts have explored\nintroducing ReLU or its variants as the substitutive activation function to\nhelp LLMs achieve activation sparsity and inference acceleration, but few can\nsimultaneously obtain high sparsity and comparable model performance. This\npaper introduces a simple and effective sparsification method named \"ProSparse\"\nto push LLMs for higher activation sparsity while maintaining comparable\nperformance. Specifically, after substituting the activation function of LLMs\nwith ReLU, ProSparse adopts progressive sparsity regularization with a factor\nsmoothly increasing along the multi-stage sine curves. This can enhance\nactivation sparsity and mitigate performance degradation by avoiding radical\nshifts in activation distributions. With ProSparse, we obtain high sparsity of\n89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size\nMiniCPM-1B, respectively, achieving comparable performance to their original\nSwish-activated versions. These present the most sparsely activated models\namong open-source LLaMA versions and competitive end-size models, considerably\nsurpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference\nacceleration experiments further demonstrate the significant practical\nacceleration potential of LLMs with higher activation sparsity, obtaining up to\n4.52$\\times$ inference speedup.\n","authors":["Chenyang Song","Xu Han","Zhengyan Zhang","Shengding Hu","Xiyu Shi","Kuai Li","Chen Chen","Zhiyuan Liu","Guangli Li","Tao Yang","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2402.13516v5.pdf","comment":"19 pages, 4 figures, 9 tables"},{"id":"http://arxiv.org/abs/2412.04100v2","updated":"2024-12-12T11:12:03Z","published":"2024-12-05T12:10:42Z","title":"Missing Melodies: AI Music Generation and its \"Nearly\" Complete Omission\n of the Global South","summary":" Recent advances in generative AI have sparked renewed interest and expanded\npossibilities for music generation. However, the performance and versatility of\nthese systems across musical genres are heavily influenced by the availability\nof training data. We conducted an extensive analysis of over one million hours\nof audio datasets used in AI music generation research and manually reviewed\nmore than 200 papers from eleven prominent AI and music conferences and\norganizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,\nNeurIPS, NIME, SMC) to identify a critical gap in the fair representation and\ninclusion of the musical genres of the Global South in AI research. Our\nfindings reveal a stark imbalance: approximately 86% of the total dataset hours\nand over 93% of researchers focus primarily on music from the Global North.\nHowever, around 40% of these datasets include some form of non-Western music,\ngenres from the Global South account for only 14.6% of the data. Furthermore,\napproximately 51% of the papers surveyed concentrate on symbolic music\ngeneration, a method that often fails to capture the cultural nuances inherent\nin music from regions such as South Asia, the Middle East, and Africa. As AI\nincreasingly shapes the creation and dissemination of music, the significant\nunderrepresentation of music genres in datasets and research presents a serious\nthreat to global musical diversity. We also propose some important steps to\nmitigate these risks and foster a more inclusive future for AI-driven music\ngeneration.\n","authors":["Atharva Mehta","Shivam Chauhan","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2412.04100v2.pdf","comment":"Submitted to CACM, 12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.09173v1","updated":"2024-12-12T11:03:25Z","published":"2024-12-12T11:03:25Z","title":"ReFF: Reinforcing Format Faithfulness in Language Models across Varied\n Tasks","summary":" Following formatting instructions to generate well-structured content is a\nfundamental yet often unmet capability for large language models (LLMs). To\nstudy this capability, which we refer to as format faithfulness, we present\nFormatBench, a comprehensive format-related benchmark. Compared to previous\nformat-related benchmarks, FormatBench involves a greater variety of tasks in\nterms of application scenes (traditional NLP tasks, creative works, autonomous\nagency tasks), human-LLM interaction styles (single-turn instruction,\nmulti-turn chat), and format types (inclusion, wrapping, length, coding).\nMoreover, each task in FormatBench is attached with a format checker program.\nExtensive experiments on the benchmark reveal that state-of-the-art open- and\nclosed-source LLMs still suffer from severe deficiency in format faithfulness.\nBy virtue of the decidable nature of formats, we propose to Reinforce Format\nFaithfulness (ReFF) to help LLMs generate formatted output as instructed\nwithout compromising general quality. Without any annotated data, ReFF can\nsubstantially improve the format faithfulness rate (e.g., from 21.6% in\noriginal LLaMA3 to 95.0% on caption segmentation task), while keep the general\nquality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with\nlabeled training data, ReFF can simultaneously improve both format faithfulness\n(e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from\n47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to\nexplain how ReFF improves both format faithfulness and general quality.\n","authors":["Jiashu Yao","Heyan Huang","Zeming Liu","Haoyu Wen","Wei Su","Boao Qian","Yuhang Guo"],"pdf_url":"https://arxiv.org/pdf/2412.09173v1.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2404.04108v2","updated":"2024-12-12T10:56:35Z","published":"2024-04-05T14:04:07Z","title":"Large language models as oracles for instantiating ontologies with\n domain-specific knowledge","summary":" Background. Endowing intelligent systems with semantic data commonly requires\ndesigning and instantiating ontologies with domain-specific knowledge.\nEspecially in the early phases, those activities are typically performed\nmanually by human experts possibly leveraging on their own experience. The\nresulting process is therefore time-consuming, error-prone, and often biased by\nthe personal background of the ontology designer. Objective. To mitigate that\nissue, we propose a novel domain-independent approach to automatically\ninstantiate ontologies with domain-specific knowledge, by leveraging on large\nlanguage models (LLMs) as oracles. Method. Starting from (i) an initial schema\ncomposed by inter-related classes and properties and (ii) a set of query\ntemplates, our method queries the LLM multiple times, and generates instances\nfor both classes and properties from its replies. Thus, the ontology is\nautomatically filled with domain-specific knowledge, compliant to the initial\nschema. As a result, the ontology is quickly and automatically enriched with\nmanifold instances, which experts may consider to keep, adjust, discard, or\ncomplement according to their own needs and expertise. Contribution. We\nformalise our method in general way and instantiate it over various LLMs, as\nwell as on a concrete case study. We report experiments rooted in the\nnutritional domain where an ontology of food meals and their ingredients is\nautomatically instantiated from scratch, starting from a categorisation of\nmeals and their relationships. There, we analyse the quality of the generated\nontologies and compare ontologies attained by exploiting different LLMs.\nExperimentally, our approach achieves a quality metric that is up to five times\nhigher than the state-of-the-art, while reducing erroneous entities and\nrelations by up to ten times. Finally, we provide a SWOT analysis of the\nproposed method.\n","authors":["Giovanni Ciatto","Andrea Agiollo","Matteo Magnini","Andrea Omicini"],"pdf_url":"https://arxiv.org/pdf/2404.04108v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09165v1","updated":"2024-12-12T10:50:26Z","published":"2024-12-12T10:50:26Z","title":"When Text Embedding Meets Large Language Model: A Comprehensive Survey","summary":" Text embedding has become a foundational technology in natural language\nprocessing (NLP) during the deep learning era, driving advancements across a\nwide array of downstream tasks. While many natural language understanding\nchallenges can now be modeled using generative paradigms and leverage the\nrobust generative and comprehension capabilities of large language models\n(LLMs), numerous practical applications, such as semantic matching, clustering,\nand information retrieval, continue to rely on text embeddings for their\nefficiency and effectiveness. In this survey, we categorize the interplay\nbetween LLMs and text embeddings into three overarching themes: (1)\nLLM-augmented text embedding, enhancing traditional embedding methods with\nLLMs; (2) LLMs as text embedders, utilizing their innate capabilities for\nembedding generation; and (3) Text embedding understanding with LLMs,\nleveraging LLMs to analyze and interpret embeddings. By organizing these\nefforts based on interaction patterns rather than specific downstream\napplications, we offer a novel and systematic overview of contributions from\nvarious research and application domains in the era of LLMs. Furthermore, we\nhighlight the unresolved challenges that persisted in the pre-LLM era with\npre-trained language models (PLMs) and explore the emerging obstacles brought\nforth by LLMs. Building on this analysis, we outline prospective directions for\nthe evolution of text embedding, addressing both theoretical and practical\nopportunities in the rapidly advancing landscape of NLP.\n","authors":["Zhijie Nie","Zhangchi Feng","Mingxin Li","Cunwang Zhang","Yanzhao Zhang","Dingkun Long","Richong Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09165v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2405.20612v2","updated":"2024-12-12T10:46:44Z","published":"2024-05-31T03:59:15Z","title":"UniBias: Unveiling and Mitigating LLM Bias through Internal Attention\n and FFN Manipulation","summary":" Large language models (LLMs) have demonstrated impressive capabilities in\nvarious tasks using the in-context learning (ICL) paradigm. However, their\neffectiveness is often compromised by inherent bias, leading to prompt\nbrittleness, i.e., sensitivity to design settings such as example selection,\norder, and prompt formatting. Previous studies have addressed LLM bias through\nexternal adjustment of model outputs, but the internal mechanisms that lead to\nsuch bias remain unexplored. Our work delves into these mechanisms,\nparticularly investigating how feedforward neural networks (FFNs) and attention\nheads result in the bias of LLMs. By Interpreting the contribution of\nindividual FFN vectors and attention heads, we identify the biased LLM\ncomponents that skew LLMs' prediction toward specific labels. To mitigate these\nbiases, we introduce UniBias, an inference-only method that effectively\nidentifies and eliminates biased FFN vectors and attention heads. Extensive\nexperiments across 12 NLP datasets demonstrate that UniBias significantly\nenhances ICL performance and alleviates prompt brittleness of LLMs.\n","authors":["Hanzhang Zhou","Zijian Feng","Zixiao Zhu","Junlang Qian","Kezhi Mao"],"pdf_url":"https://arxiv.org/pdf/2405.20612v2.pdf","comment":"Accepted to NeurIPS 2024"},{"id":"http://arxiv.org/abs/2407.10510v2","updated":"2024-12-12T10:40:22Z","published":"2024-07-15T08:06:37Z","title":"TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription\n Prediction","summary":" Traditional Chinese medicine (TCM) has relied on specific combinations of\nherbs in prescriptions to treat various symptoms and signs for thousands of\nyears. Predicting TCM prescriptions poses a fascinating technical challenge\nwith significant practical implications. However, this task faces limitations\ndue to the scarcity of high-quality clinical datasets and the complex\nrelationship between symptoms and herbs. To address these issues, we introduce\n\\textit{DigestDS}, a novel dataset comprising practical medical records from\nexperienced experts in digestive system diseases. We also propose a method,\nTCM-FTP (TCM Fine-Tuning Pre-trained), to leverage pre-trained large language\nmodels (LLMs) via supervised fine-tuning on \\textit{DigestDS}. Additionally, we\nenhance computational efficiency using a low-rank adaptation technique.\nMoreover, TCM-FTP incorporates data augmentation by permuting herbs within\nprescriptions, exploiting their order-agnostic nature. Impressively, TCM-FTP\nachieves an F1-score of 0.8031, significantly outperforming previous methods.\nFurthermore, it demonstrates remarkable accuracy in dosage prediction,\nachieving a normalized mean square error of 0.0604. In contrast, LLMs without\nfine-tuning exhibit poor performance. Although LLMs have demonstrated\nwide-ranging capabilities, our work underscores the necessity of fine-tuning\nfor TCM prescription prediction and presents an effective way to accomplish\nthis.\n","authors":["Xingzhi Zhou","Xin Dong","Chunhao Li","Yuning Bai","Yulong Xu","Ka Chun Cheung","Simon See","Xinpeng Song","Runshun Zhang","Xuezhong Zhou","Nevin L. Zhang"],"pdf_url":"https://arxiv.org/pdf/2407.10510v2.pdf","comment":"Camera-ready version to be published in BIBM 2024"},{"id":"http://arxiv.org/abs/2412.08237v2","updated":"2024-12-12T10:01:11Z","published":"2024-12-11T09:38:50Z","title":"TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch","summary":" It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS\nworks typically employ complex data processing pipelines to obtain high-quality\ntraining data. These sophisticated pipelines require excellent models at each\nstage (e.g., speech denoising, speech enhancement, speaker diarization, and\npunctuation models), which themselves demand high-quality training data and are\nrarely open-sourced. Even with state-of-the-art models, issues persist, such as\nincomplete background noise removal and misalignment between punctuation and\nactual speech pauses. Moreover, the stringent filtering strategies often retain\nonly 10-30\\% of the original data, significantly impeding data scaling efforts.\nIn this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to\ndesign a simplified yet effective TTS data processing pipeline that maintains\ndata quality while substantially reducing data acquisition costs, achieving a\ndata retention rate of over 50\\%. Beyond data scaling challenges, LLM-based TTS\nsystems also incur higher deployment costs compared to conventional approaches.\nCurrent systems typically use LLMs solely for text-to-token generation, while\nrequiring separate models (e.g., flow matching models) for token-to-waveform\ngeneration, which cannot be directly executed by LLM inference engines, further\ncomplicating deployment. To address these challenges, we eliminate redundant\nmodules in both LLM and flow components, replacing the flow model backbone with\nan LLM architecture. Building upon this simplified flow backbone, we propose a\nunified architecture for both streaming and non-streaming inference,\nsignificantly reducing deployment costs. Finally, we explore the feasibility of\nunifying TTS and ASR tasks using the same data for training, thanks to the\nsimplified pipeline and the S3Tokenizer that reduces the quality requirements\nfor TTS training data.\n","authors":["Xingchen Song","Mengtao Xing","Changwei Ma","Shengqiang Li","Di Wu","Binbin Zhang","Fuping Pan","Dinghao Zhou","Yuekai Zhang","Shun Lei","Zhendong Peng","Zhiyong Wu"],"pdf_url":"https://arxiv.org/pdf/2412.08237v2.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2412.09102v1","updated":"2024-12-12T09:29:59Z","published":"2024-12-12T09:29:59Z","title":"PolyIPA -- Multilingual Phoneme-to-Grapheme Conversion Model","summary":" This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme\nconversion model designed for multilingual name transliteration, onomastic\nresearch, and information retrieval. The model leverages two helper models\ndeveloped for data augmentation: IPA2vec for finding soundalikes across\nlanguages, and similarIPA for handling phonetic notation variations. Evaluated\non a test set that spans multiple languages and writing systems, the model\nachieves a mean Character Error Rate of 0.055 and a character-level BLEU score\nof 0.914, with particularly strong performance on languages with shallow\northographies. The implementation of beam search further improves practical\nutility, with top-3 candidates reducing the effective error rate by 52.7\\% (to\nCER: 0.026), demonstrating the model's effectiveness for cross-linguistic\napplications.\n","authors":["Davor Lauc"],"pdf_url":"https://arxiv.org/pdf/2412.09102v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.09131v5","updated":"2024-12-12T09:22:14Z","published":"2024-03-14T06:49:16Z","title":"ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between\n Professional and Non-Professional Responses","summary":" Large Language Models (LLMs) have demonstrated efficacy in various linguistic\napplications, including question answering and controlled text generation.\nHowever, studies into their ability to switch between opposite styles of\nresponses in professional domains remain underexplored. This study introduces a\nnovel approach, named ProSwitch, which enables a language model to switch\nbetween professional and non-professional answers, by tuning and evaluating\nthrough the guidance of domain and style knowledge. ProSwitch unfolds in three\nphases: LLM-augmented preparation to collect domain knowledge and QA pairs,\ninstruction tuning to optimize LLMs with multiple levels of knowledge, and\ncomprehensive evaluation to assess both style discrimination and\nreference-based quality of the generated text. Comparative analysis of\nProSwitch against general and specialized LLMs reveals that our approach\noutperforms baselines in switching between professional and non-professional\nresponses.\n","authors":["Chang Zong","Yuyan Chen","Weiming Lu","Jian Shao","Yongfeng Huang","Heng Chang","Yueting Zhuang"],"pdf_url":"https://arxiv.org/pdf/2403.09131v5.pdf","comment":"8 pages main body, 16 pages total"},{"id":"http://arxiv.org/abs/2412.09094v1","updated":"2024-12-12T09:22:04Z","published":"2024-12-12T09:22:04Z","title":"Filter-then-Generate: Large Language Models with Structure-Text Adapter\n for Knowledge Graph Completion","summary":" Large Language Models (LLMs) present massive inherent knowledge and superior\nsemantic comprehension capability, which have revolutionized various tasks in\nnatural language processing. Despite their success, a critical gap remains in\nenabling LLMs to perform knowledge graph completion (KGC). Empirical evidence\nsuggests that LLMs consistently perform worse than conventional KGC approaches,\neven through sophisticated prompt design or tailored instruction-tuning.\nFundamentally, applying LLMs on KGC introduces several critical challenges,\nincluding a vast set of entity candidates, hallucination issue of LLMs, and\nunder-exploitation of the graph structure. To address these challenges, we\npropose a novel instruction-tuning-based method, namely FtG. Specifically, we\npresent a \\textit{filter-then-generate} paradigm and formulate the KGC task\ninto a multiple-choice question format. In this way, we can harness the\ncapability of LLMs while mitigating the issue casused by hallucinations.\nMoreover, we devise a flexible ego-graph serialization prompt and employ a\nstructure-text adapter to couple structure and text information in a\ncontextualized manner. Experimental results demonstrate that FtG achieves\nsubstantial performance gain compared to existing state-of-the-art methods. The\ninstruction dataset and code are available at\n\\url{https://github.com/LB0828/FtG}.\n","authors":["Ben Liu","Jihai Zhang","Fangquan Lin","Cheng Yang","Min Peng"],"pdf_url":"https://arxiv.org/pdf/2412.09094v1.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2412.09084v1","updated":"2024-12-12T09:11:45Z","published":"2024-12-12T09:11:45Z","title":"Evaluating Pixel Language Models on Non-Standardized Languages","summary":" We explore the potential of pixel-based models for transfer learning from\nstandard languages to dialects. These models convert text into images that are\ndivided into patches, enabling a continuous vocabulary representation that\nproves especially useful for out-of-vocabulary words common in dialectal data.\nUsing German as a case study, we compare the performance of pixel-based models\nto token-based models across various syntactic and semantic tasks. Our results\nshow that pixel-based models outperform token-based models in part-of-speech\ntagging, dependency parsing and intent detection for zero-shot dialect\nevaluation by up to 26 percentage points in some scenarios, though not in\nStandard German. However, pixel-based models fall short in topic\nclassification. These findings emphasize the potential of pixel-based models\nfor handling dialectal data, though further research should be conducted to\nassess their effectiveness in various linguistic contexts.\n","authors":["Alberto Muñoz-Ortiz","Verena Blaschke","Barbara Plank"],"pdf_url":"https://arxiv.org/pdf/2412.09084v1.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2411.00062v2","updated":"2024-12-12T09:11:30Z","published":"2024-10-31T08:15:32Z","title":"Evolving Alignment via Asymmetric Self-Play","summary":" Current RLHF frameworks for aligning large language models (LLMs) typically\nassume a fixed prompt distribution, which is sub-optimal and limits the\nscalability of alignment and generalizability of models. To address this, we\nintroduce a general open-ended RLHF framework that casts alignment as an\nasymmetric game between two players: (i) a creator that generates increasingly\ninformative prompt distributions using reward signals, and (ii) a solver that\nlearns to produce more preferred responses on prompts produced by the creator.\nThis framework of Evolving Alignment via Asymmetric Self-Play (eva), results in\na simple and efficient approach that can utilize any existing RLHF algorithm\nfor scalable alignment. eva outperforms state-of-the-art methods on widely-used\nbenchmarks, without the need of any additional human crafted prompts.\nSpecifically, eva improves the win rate of Gemma-2-9B-it on Arena-Hard from\n51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7%\nwith SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and\nmatching claude-3-opus. This improvement is persistent even when new human\ncrafted prompts are introduced. Finally, we show eva is effective and robust\nunder various ablation settings.\n","authors":["Ziyu Ye","Rishabh Agarwal","Tianqi Liu","Rishabh Joshi","Sarmishta Velury","Quoc V. Le","Qijun Tan","Yuan Liu"],"pdf_url":"https://arxiv.org/pdf/2411.00062v2.pdf","comment":"35 pages, spotlight @ neurips language gamification workshop"},{"id":"http://arxiv.org/abs/2412.09078v1","updated":"2024-12-12T09:01:18Z","published":"2024-12-12T09:01:18Z","title":"Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning","summary":" Large Language Models (LLMs) have shown remarkable abilities across various\nlanguage tasks, but solving complex reasoning problems remains a challenge.\nWhile existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT)\nenhance reasoning by decomposing problems or structuring prompts, they\ntypically perform a single pass of reasoning and may fail to revisit flawed\npaths, compromising accuracy. To address this, we propose a novel reasoning\nframework called Forest-of-Thought (FoT), which integrates multiple reasoning\ntrees to leverage collective decision-making for solving complex logical\nproblems. FoT utilizes sparse activation strategies to select the most relevant\nreasoning paths, improving both efficiency and accuracy. Additionally, we\nintroduce a dynamic self-correction strategy that enables real-time error\ncorrection and learning from past mistakes, as well as consensus-guided\ndecision making strategies to optimize correctness and computational resources.\nExperimental results demonstrate that the FoT framework, combined with these\nstrategies, significantly enhances the reasoning capabilities of LLMs, enabling\nthem to solve complex tasks with greater precision and efficiency.\n","authors":["Zhenni Bi","Kai Han","Chuanjian Liu","Yehui Tang","Yunhe Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09078v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.09049v1","updated":"2024-12-12T08:19:01Z","published":"2024-12-12T08:19:01Z","title":"Dial-In LLM: Human-Aligned Dialogue Intent Clustering with\n LLM-in-the-loop","summary":" The discovery of customer intention from dialogue plays an important role in\nautomated support system. However, traditional text clustering methods are\npoorly aligned with human perceptions due to the shift from embedding distance\nto semantic distance, and existing quantitative metrics for text clustering may\nnot accurately reflect the true quality of intent clusters. In this paper, we\nleverage the superior language understanding capabilities of Large Language\nModels (LLMs) for designing better-calibrated intent clustering algorithms. We\nfirst establish the foundation by verifying the robustness of fine-tuned LLM\nutility in semantic coherence evaluation and cluster naming, resulting in an\naccuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled\nground truth. Then, we propose an iterative clustering algorithm that\nfacilitates cluster-level refinement and the continuous discovery of\nhigh-quality intent clusters. Furthermore, we present several LLM-in-the-loop\nsemi-supervised clustering techniques tailored for intent discovery from\ncustomer service dialogue. Experiments on a large-scale industrial dataset\ncomprising 1,507 intent clusters demonstrate the effectiveness of the proposed\ntechniques. The methods outperformed existing counterparts, achieving 6.25%\nimprovement in quantitative metrics and 12% enhancement in application-level\nperformance when constructing an intent classifier.\n","authors":["Mengze Hong","Yuanfeng Song","Di Jiang","Wailing Ng","Yanjie Sun","Chen Jason Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09046v1","updated":"2024-12-12T08:15:16Z","published":"2024-12-12T08:15:16Z","title":"Multi-Task Learning with LLMs for Implicit Sentiment Analysis:\n Data-level and Task-level Automatic Weight Learning","summary":" Implicit sentiment analysis (ISA) presents significant challenges due to the\nabsence of salient cue words. Previous methods have struggled with insufficient\ndata and limited reasoning capabilities to infer underlying opinions.\nIntegrating multi-task learning (MTL) with large language models (LLMs) offers\nthe potential to enable models of varying sizes to reliably perceive and\nrecognize genuine opinions in ISA. However, existing MTL approaches are\nconstrained by two sources of uncertainty: data-level uncertainty, arising from\nhallucination problems in LLM-generated contextual information, and task-level\nuncertainty, stemming from the varying capacities of models to process\ncontextual information. To handle these uncertainties, we introduce MT-ISA, a\nnovel MTL framework that enhances ISA by leveraging the generation and\nreasoning capabilities of LLMs through automatic MTL. Specifically, MT-ISA\nconstructs auxiliary tasks using generative LLMs to supplement sentiment\nelements and incorporates automatic MTL to fully exploit auxiliary data. We\nintroduce data-level and task-level automatic weight learning (AWL), which\ndynamically identifies relationships and prioritizes more reliable data and\ncritical tasks, enabling models of varying sizes to adaptively learn\nfine-grained weights based on their reasoning capabilities. We investigate\nthree strategies for data-level AWL, while also introducing homoscedastic\nuncertainty for task-level AWL. Extensive experiments reveal that models of\nvarying sizes achieve an optimal balance between primary prediction and\nauxiliary tasks in MT-ISA. This underscores the effectiveness and adaptability\nof our approach.\n","authors":["Wenna Lai","Haoran Xie","Guandong Xu","Qing Li"],"pdf_url":"https://arxiv.org/pdf/2412.09046v1.pdf","comment":"11 pages, 6 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2412.09045v1","updated":"2024-12-12T08:13:32Z","published":"2024-12-12T08:13:32Z","title":"Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain\n Chinese Word Segmentation","summary":" Inspired by early research on exploring naturally annotated data for Chinese\nWord Segmentation (CWS), and also by recent research on integration of speech\nand text processing, this work for the first time proposes to explicitly mine\nword boundaries from speech-text parallel data. We employ the Montreal Forced\nAligner (MFA) toolkit to perform character-level alignment on speech-text data,\ngiving pauses as candidate word boundaries. Based on detailed analysis of\ncollected pauses, we propose an effective probability-based strategy for\nfiltering unreliable word boundaries. To more effectively utilize word\nboundaries as extra training data, we also propose a robust complete-then-train\n(CTT) strategy. We conduct cross-domain CWS experiments on two target domains,\ni.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the\nevaluation data of AISHELL2. Experiments demonstrate the effectiveness of our\nproposed approach.\n","authors":["Xuebin Wang","Lei Zhang","Zhenghua Li","Shilin Zhou","Chen Gong","Yang Hou"],"pdf_url":"https://arxiv.org/pdf/2412.09045v1.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2405.16884v3","updated":"2024-12-12T08:05:10Z","published":"2024-05-27T07:05:27Z","title":"Match, Compare, or Select? An Investigation of Large Language Models for\n Entity Matching","summary":" Entity matching (EM) is a critical step in entity resolution (ER). Recently,\nentity matching based on large language models (LLMs) has shown great promise.\nHowever, current LLM-based entity matching approaches typically follow a binary\nmatching paradigm that ignores the global consistency among record\nrelationships. In this paper, we investigate various methodologies for\nLLM-based entity matching that incorporate record interactions from different\nperspectives. Specifically, we comprehensively compare three representative\nstrategies: matching, comparing, and selecting, and analyze their respective\nadvantages and challenges in diverse scenarios. Based on our findings, we\nfurther design a compound entity matching framework (ComEM) that leverages the\ncomposition of multiple strategies and LLMs. ComEM benefits from the advantages\nof different sides and achieves improvements in both effectiveness and\nefficiency. Experimental results on 8 ER datasets and 10 LLMs verify the\nsuperiority of incorporating record interactions through the selecting\nstrategy, as well as the further cost-effectiveness brought by ComEM.\n","authors":["Tianshu Wang","Xiaoyang Chen","Hongyu Lin","Xuanang Chen","Xianpei Han","Hao Wang","Zhenyu Zeng","Le Sun"],"pdf_url":"https://arxiv.org/pdf/2405.16884v3.pdf","comment":"Accepted at COLING 2025. Our code is available at\n https://github.com/tshu-w/ComEM"},{"id":"http://arxiv.org/abs/2406.13282v3","updated":"2024-12-12T08:00:36Z","published":"2024-06-19T07:23:33Z","title":"Understanding the RoPE Extensions of Long-Context LLMs: An Attention\n Perspective","summary":" Enabling LLMs to handle lengthy context is currently a research hotspot. Most\nLLMs are built upon rotary position embedding (RoPE), a popular position\nencoding method. Therefore, a prominent path is to extrapolate the RoPE trained\non comparably short texts to far longer texts. A heavy bunch of efforts have\nbeen dedicated to boosting the extrapolation via extending the formulations of\nthe RoPE, however, few of them have attempted to showcase their inner workings\ncomprehensively. In this paper, we are driven to offer a straightforward yet\nin-depth understanding of RoPE extensions from an attention perspective and on\ntwo benchmarking tasks. A broad array of experiments reveals several valuable\nfindings: 1) Maintaining attention patterns to those at the pretrained length\nimproves extrapolation; 2) Large attention uncertainty leads to retrieval\nerrors; 3) Using longer continual pretraining lengths for RoPE extensions could\nreduce attention uncertainty and significantly enhance extrapolation.\n","authors":["Meizhi Zhong","Chen Zhang","Yikun Lei","Xikai Liu","Yan Gao","Yao Hu","Kehai Chen","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.13282v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09036v1","updated":"2024-12-12T07:52:56Z","published":"2024-12-12T07:52:56Z","title":"ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based\n on Layer Uncertainty","summary":" Large Language models (LLMs) have become a research hotspot. To accelerate\nthe inference of LLMs, storing computed caches in memory has become the\nstandard technique. However, as the inference length increases, growing KV\ncaches might lead to out-of-memory issues. Many existing methods address this\nissue through KV cache compression, primarily by preserving key tokens\nthroughout all layers to reduce information loss. Most of them allocate a\nuniform budget size for each layer to retain. However, we observe that the\nminimum budget sizes needed to retain essential information vary across layers\nand models based on the perspectives of attention and hidden state output.\nBuilding on this observation, this paper proposes a simple yet effective KV\ncache compression method that leverages layer uncertainty to allocate budget\nsize for each layer. Experimental results show that the proposed method can\nreduce memory usage of the KV caches to only $\\sim$20\\% when compared to Full\nKV inference while achieving nearly lossless performance.\n","authors":["Meizhi Zhong","Xikai Liu","Chen Zhang","Yikun Lei","Yan Gao","Yao Hu","Kehai Chen","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09036v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11147v2","updated":"2024-12-12T07:50:43Z","published":"2024-09-17T12:58:29Z","title":"Reasoning Graph Enhanced Exemplars Retrieval for In-Context Learning","summary":" Large language models (LLMs) have exhibited remarkable few-shot learning\ncapabilities and unified the paradigm of NLP tasks through the in-context\nlearning (ICL) technique. Despite the success of ICL, the quality of the\nexemplar demonstrations can significantly influence the LLM's performance.\nExisting exemplar selection methods mainly focus on the semantic similarity\nbetween queries and candidate exemplars. On the other hand, the logical\nconnections between reasoning steps can be beneficial to depict the\nproblem-solving process as well. In this paper, we proposes a novel method\nnamed Reasoning Graph-enhanced Exemplar Retrieval (RGER). RGER first quires LLM\nto generate an initial response, then expresses intermediate problem-solving\nsteps to a graph structure. After that, it employs graph kernel to select\nexemplars with semantic and structural similarity. Extensive experiments\ndemonstrate the structural relationship is helpful to the alignment of queries\nand candidate exemplars. The efficacy of RGER on math and logit reasoning tasks\nshowcases its superiority over state-of-the-art retrieval-based approaches. Our\ncode is released at https://github.com/Yukang-Lin/RGER.\n","authors":["Yukang Lin","Bingchen Zhong","Shuoran Jiang","Joanna Siebert","Qingcai Chen"],"pdf_url":"https://arxiv.org/pdf/2409.11147v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09034v1","updated":"2024-12-12T07:49:06Z","published":"2024-12-12T07:49:06Z","title":"Dialogue Language Model with Large-Scale Persona Data Engineering","summary":" Maintaining persona consistency is paramount in the application of\nopen-domain dialogue systems, as exemplified by models like ChatGPT. Despite\nsignificant advancements, the limited scale and diversity of current persona\ndialogue datasets remain challenges to achieving robust persona-consistent\ndialogue models. In this study, drawing inspiration from the success of\nlarge-scale pre-training, we introduce PPDS, an open-domain persona dialogue\nsystem that employs extensive generative pre-training on a persona dialogue\ndataset to enhance persona consistency. Specifically, we present a persona\nextraction model designed to autonomously and precisely generate vast persona\ndialogue datasets. Additionally, we unveil a pioneering persona augmentation\ntechnique to address the invalid persona bias inherent in the constructed\ndataset. Both quantitative and human evaluations consistently highlight the\nsuperior response quality and persona consistency of our proposed model,\nunderscoring its effectiveness.\n","authors":["Mengze Hong","Chen Zhang","Chaotao Chen","Rongzhong Lian","Di Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.09034v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.07890v2","updated":"2024-12-12T07:45:33Z","published":"2024-07-10T17:57:58Z","title":"Training on the Test Task Confounds Evaluation and Emergence","summary":" We study a fundamental problem in the evaluation of large language models\nthat we call training on the test task. Unlike wrongful practices like training\non the test data, leakage, or data contamination, training on the test task is\nnot a malpractice. Rather, the term describes a growing set of practices that\nutilize knowledge about evaluation tasks at training time. We demonstrate that\ntraining on the test task confounds both relative model evaluations and claims\nabout emergent capabilities. We argue that the seeming superiority of one model\nfamily over another may be explained by a different degree of training on the\ntest task. To this end, we propose an effective method to adjust for the effect\nof training on the test task on benchmark evaluations. Put simply, to fine-tune\neach model under comparison on the same task-relevant data before evaluation.\nWe then show that instances of emergent behavior disappear gradually as models\ntrain on the test task. Our work promotes a new perspective on the evaluation\nof large language models with broad implications for benchmarking and the study\nof emergent capabilities\n","authors":["Ricardo Dominguez-Olmedo","Florian E. Dorner","Moritz Hardt"],"pdf_url":"https://arxiv.org/pdf/2407.07890v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09025v1","updated":"2024-12-12T07:40:55Z","published":"2024-12-12T07:40:55Z","title":"Shiksha: A Technical Domain focused Translation Dataset and Model for\n Indian Languages","summary":" Neural Machine Translation (NMT) models are typically trained on datasets\nwith limited exposure to Scientific, Technical and Educational domains.\nTranslation models thus, in general, struggle with tasks that involve\nscientific understanding or technical jargon. Their performance is found to be\neven worse for low-resource Indian languages. Finding a translation dataset\nthat tends to these domains in particular, poses a difficult challenge. In this\npaper, we address this by creating a multilingual parallel corpus containing\nmore than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality\ntranslation pairs across 8 Indian languages. We achieve this by bitext mining\nhuman-translated transcriptions of NPTEL video lectures. We also finetune and\nevaluate NMT models using this corpus and surpass all other publicly available\nmodels at in-domain tasks. We also demonstrate the potential for generalizing\nto out-of-domain translation tasks by improving the baseline by over 2 BLEU on\naverage for these Indian languages on the Flores+ benchmark. We are pleased to\nrelease our model and dataset via this link: https://huggingface.co/SPRINGLab.\n","authors":["Advait Joglekar","Srinivasan Umesh"],"pdf_url":"https://arxiv.org/pdf/2412.09025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01268v2","updated":"2024-12-12T07:29:41Z","published":"2024-10-02T06:24:51Z","title":"Deep Learning and Machine Learning, Advancing Big Data Analytics and\n Management: Unveiling AI's Potential Through Tools, Techniques, and\n Applications","summary":" Artificial intelligence (AI), machine learning, and deep learning have become\ntransformative forces in big data analytics and management, enabling\ngroundbreaking advancements across diverse industries. This article delves into\nthe foundational concepts and cutting-edge developments in these fields, with a\nparticular focus on large language models (LLMs) and their role in natural\nlanguage processing, multimodal reasoning, and autonomous decision-making.\nHighlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores\ntheir applications in data analysis, model design, and optimization.\n The integration of advanced algorithms like neural networks, reinforcement\nlearning, and generative models has enhanced the capabilities of AI systems to\nprocess, visualize, and interpret complex datasets. Additionally, the emergence\nof technologies like edge computing and automated machine learning (AutoML)\ndemocratizes access to AI, empowering users across skill levels to engage with\nintelligent systems. This work also underscores the importance of ethical\nconsiderations, transparency, and fairness in the deployment of AI\ntechnologies, paving the way for responsible innovation.\n Through practical insights into hardware configurations, software\nenvironments, and real-world applications, this article serves as a\ncomprehensive resource for researchers and practitioners. By bridging\ntheoretical underpinnings with actionable strategies, it showcases the\npotential of AI and LLMs to revolutionize big data management and drive\nmeaningful advancements across domains such as healthcare, finance, and\nautonomous systems.\n","authors":["Pohsun Feng","Ziqian Bi","Yizhu Wen","Xuanhe Pan","Benji Peng","Ming Liu","Jiawei Xu","Keyu Chen","Junyu Liu","Caitlyn Heqi Yin","Sen Zhang","Jinlang Wang","Qian Niu","Ming Li","Tianyang Wang"],"pdf_url":"https://arxiv.org/pdf/2410.01268v2.pdf","comment":"This book contains 155 pages and 9 figures"},{"id":"http://arxiv.org/abs/2412.09014v1","updated":"2024-12-12T07:24:16Z","published":"2024-12-12T07:24:16Z","title":"Improvement in Sign Language Translation Using Text CTC Alignment","summary":" Current sign language translation (SLT) approaches often rely on gloss-based\nsupervision with Connectionist Temporal Classification (CTC), limiting their\nability to handle non-monotonic alignments between sign language video and\nspoken text. In this work, we propose a novel method combining joint\nCTC/Attention and transfer learning. The joint CTC/Attention introduces\nhierarchical encoding and integrates CTC with the attention mechanism during\ndecoding, effectively managing both monotonic and non-monotonic alignments.\nMeanwhile, transfer learning helps bridge the modality gap between vision and\nlanguage in SLT. Experimental results on two widely adopted benchmarks,\nRWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves\nresults comparable to state-of-the-art and outperforms the pure-attention\nbaseline. Additionally, this work opens a new door for future research into\ngloss-free SLT using text-based CTC alignment.\n","authors":["Sihan Tan","Taro Miyazaki","Nabeela Khan","Kazuhiro Nakadai"],"pdf_url":"https://arxiv.org/pdf/2412.09014v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09012v1","updated":"2024-12-12T07:23:52Z","published":"2024-12-12T07:23:52Z","title":"What Makes Cryptic Crosswords Challenging for LLMs?","summary":" Cryptic crosswords are puzzles that rely on general knowledge and the\nsolver's ability to manipulate language on different levels, dealing with\nvarious types of wordplay. Previous research suggests that solving such puzzles\nis challenging even for modern NLP models, including Large Language Models\n(LLMs). However, there is little to no research on the reasons for their poor\nperformance on this task. In this paper, we establish the benchmark results for\nthree popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance\non this task is still significantly below that of humans. We also investigate\nwhy these models struggle to achieve superior performance. We release our code\nand introduced datasets at\nhttps://github.com/bodasadallah/decrypting-crosswords.\n","authors":["Abdelrahman Sadallah","Daria Kotova","Ekaterina Kochmar"],"pdf_url":"https://arxiv.org/pdf/2412.09012v1.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2411.07474v2","updated":"2024-12-12T07:21:19Z","published":"2024-11-12T01:26:41Z","title":"Controlled Evaluation of Syntactic Knowledge in Multilingual Language\n Models","summary":" Language models (LMs) are capable of acquiring elements of human-like\nsyntactic knowledge. Targeted syntactic evaluation tests have been employed to\nmeasure how well they form generalizations about syntactic phenomena in\nhigh-resource languages such as English. However, we still lack a thorough\nunderstanding of LMs' capacity for syntactic generalizations in low-resource\nlanguages, which are responsible for much of the diversity of syntactic\npatterns worldwide. In this study, we develop targeted syntactic evaluation\ntests for three low-resource languages (Basque, Hindi, and Swahili) and use\nthem to evaluate five families of open-access multilingual Transformer LMs. We\nfind that some syntactic tasks prove relatively easy for LMs while others\n(agreement in sentences containing indirect objects in Basque, agreement across\na prepositional phrase in Swahili) are challenging. We additionally uncover\nissues with publicly available Transformers, including a bias toward the\nhabitual aspect in Hindi in multilingual BERT and underperformance compared to\nsimilar-sized models in XGLM-4.5B.\n","authors":["Daria Kryvosheieva","Roger Levy"],"pdf_url":"https://arxiv.org/pdf/2411.07474v2.pdf","comment":"LoResLM workshop at COLING 2025"},{"id":"http://arxiv.org/abs/2406.11073v2","updated":"2024-12-12T06:44:12Z","published":"2024-06-16T21:02:02Z","title":"Exploring the Limitations of Detecting Machine-Generated Text","summary":" Recent improvements in the quality of the generations by large language\nmodels have spurred research into identifying machine-generated text. Such work\noften presents high-performing detectors. However, humans and machines can\nproduce text in different styles and domains, yet the performance impact of\nsuch on machine generated text detection systems remains unclear. In this\npaper, we audit the classification performance for detecting machine-generated\ntext by evaluating on texts with varying writing styles. We find that\nclassifiers are highly sensitive to stylistic changes and differences in text\ncomplexity, and in some cases degrade entirely to random classifiers. We\nfurther find that detection systems are particularly susceptible to misclassify\neasy-to-read texts while they have high performance for complex texts, leading\nto concerns about the reliability of detection systems. We recommend that\nfuture work attends to stylistic factors and reading difficulty levels of\nhuman-written and machine-generated text.\n","authors":["Jad Doughman","Osama Mohammed Afzal","Hawau Olamide Toyin","Shady Shehata","Preslav Nakov","Zeerak Talat"],"pdf_url":"https://arxiv.org/pdf/2406.11073v2.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2412.08985v1","updated":"2024-12-12T06:38:40Z","published":"2024-12-12T06:38:40Z","title":"Assessing the Robustness of Retrieval-Augmented Generation Systems in\n K-12 Educational Question Answering with Knowledge Discrepancies","summary":" Retrieval-Augmented Generation (RAG) systems have demonstrated remarkable\npotential as question answering systems in the K-12 Education domain, where\nknowledge is typically queried within the restricted scope of authoritative\ntextbooks. However, the discrepancy between textbooks and the parametric\nknowledge in Large Language Models (LLMs) could undermine the effectiveness of\nRAG systems. To systematically investigate the robustness of RAG systems under\nsuch knowledge discrepancies, we present EduKDQA, a question answering dataset\nthat simulates knowledge discrepancies in real applications by applying\nhypothetical knowledge updates in answers and source documents. EduKDQA\nincludes 3,005 questions covering five subjects, under a comprehensive question\ntypology from the perspective of context utilization and knowledge integration.\nWe conducted extensive experiments on retrieval and question answering\nperformance. We find that most RAG systems suffer from a substantial\nperformance drop in question answering with knowledge discrepancies, while\nquestions that require integration of contextual knowledge and parametric\nknowledge pose a challenge to LLMs.\n","authors":["Tianshi Zheng","Weihan Li","Jiaxin Bai","Weiqi Wang","Yangqiu Song"],"pdf_url":"https://arxiv.org/pdf/2412.08985v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2409.09269v3","updated":"2024-12-12T06:26:09Z","published":"2024-09-14T02:29:36Z","title":"Guiding Vision-Language Model Selection for Visual Question-Answering\n Across Tasks, Domains, and Knowledge Types","summary":" Visual Question-Answering (VQA) has become key to user experience,\nparticularly after improved generalization capabilities of Vision-Language\nModels (VLMs). But evaluating VLMs for an application requirement using a\nstandardized framework in practical settings is still challenging. This paper\naims to solve that using an end-to-end framework. We present VQA360 - a novel\ndataset derived from established VQA benchmarks, annotated with task types,\napplication domains, and knowledge types, for a comprehensive evaluation. We\nalso introduce GoEval, a multimodal evaluation metric developed using GPT-4o,\nachieving a correlation factor of 56.71% with human judgments. Our experiments\nwith state-of-the-art VLMs reveal that no single model excels universally,\nthus, making a right choice a key design decision. Proprietary models such as\nGemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source\nmodels like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive\nstrengths, while providing additional advantages. Our framework can also be\nextended to other tasks.\n","authors":["Neelabh Sinha","Vinija Jain","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2409.09269v3.pdf","comment":"Accepted at The First Workshop of Evaluation of Multi-Modal\n Generation (EvalMG) in 31st International Conference on Computational\n Linguistics (COLING), 2025. 8 pages + references + 6 pages of Appendix"},{"id":"http://arxiv.org/abs/2412.08316v2","updated":"2024-12-12T06:18:37Z","published":"2024-12-11T11:53:14Z","title":"Rumor Detection on Social Media with Temporal Propagation Structure\n Optimization","summary":" Traditional methods for detecting rumors on social media primarily focus on\nanalyzing textual content, often struggling to capture the complexity of online\ninteractions. Recent research has shifted towards leveraging graph neural\nnetworks to model the hierarchical conversation structure that emerges during\nrumor propagation. However, these methods tend to overlook the temporal aspect\nof rumor propagation and may disregard potential noise within the propagation\nstructure. In this paper, we propose a novel approach that incorporates\ntemporal information by constructing a weighted propagation tree, where the\nweight of each edge represents the time interval between connected posts.\nDrawing upon the theory of structural entropy, we transform this tree into a\ncoding tree. This transformation aims to preserve the essential structure of\nrumor propagation while reducing noise. Finally, we introduce a recursive\nneural network to learn from the coding tree for rumor veracity prediction.\nExperimental results on two common datasets demonstrate the superiority of our\napproach.\n","authors":["Xingyu Peng","Junran Wu","Ruomei Liu","Ke Xu"],"pdf_url":"https://arxiv.org/pdf/2412.08316v2.pdf","comment":"COLING'25"},{"id":"http://arxiv.org/abs/2409.18417v2","updated":"2024-12-12T06:18:36Z","published":"2024-09-27T03:15:07Z","title":"VickreyFeedback: Cost-efficient Data Construction for Reinforcement\n Learning from Human Feedback","summary":" This paper addresses the cost-efficiency aspect of Reinforcement Learning\nfrom Human Feedback (RLHF). RLHF leverages datasets of human preferences over\noutputs of large language models (LLM)s to instill human expectations into\nLLMs. Although preference annotation comes with a monetized cost, the economic\nutility of a preference dataset has not been considered by far. What\nexacerbates this situation is that, given complex intransitive or cyclic\nrelationships in preference datasets, existing algorithms for fine-tuning LLMs\nare still far from capturing comprehensive preferences. This raises severe\ncost-efficiency concerns in production environments, where preference data\naccumulate over time. In this paper, we discuss the fine-tuning of LLMs as a\nmonetized economy and introduce an auction mechanism to improve the efficiency\nof preference data collection in dollar terms. We show that introducing an\nauction mechanism can play an essential role in enhancing the cost-efficiency\nof RLHF, while maintaining satisfactory model performance. Experimental results\ndemonstrate that our proposed auction-based protocol is cost-effective for\nfine-tuning LLMs concentrating on high-quality feedback.\n","authors":["Guoxi Zhang","Jiuding Duan"],"pdf_url":"https://arxiv.org/pdf/2409.18417v2.pdf","comment":"16 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.08972v1","updated":"2024-12-12T06:08:46Z","published":"2024-12-12T06:08:46Z","title":"RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World\n Scenarios","summary":" This paper introduces RuleArena, a novel and challenging benchmark designed\nto evaluate the ability of large language models (LLMs) to follow complex,\nreal-world rules in reasoning. Covering three practical domains -- airline\nbaggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs'\nproficiency in handling intricate natural language instructions that demand\nlong-context understanding, logical reasoning, and accurate mathematical\ncomputation. Two key attributes distinguish RuleArena from traditional\nrule-based reasoning benchmarks: (1) it extends beyond standard first-order\nlogic representations, and (2) it is grounded in authentic, practical\nscenarios, providing insights into the suitability and reliability of LLMs for\nreal-world applications. Our findings reveal several notable limitations in\nLLMs: (1) they struggle to identify and apply the appropriate rules, frequently\nbecoming confused by similar but distinct regulations, (2) they cannot\nconsistently perform accurate mathematical computations, even when they\ncorrectly identify the relevant rules, and (3) in general, they perform poorly\nin the benchmark. These results highlight significant challenges in advancing\nLLMs' rule-guided reasoning capabilities in real-life applications.\n","authors":["Ruiwen Zhou","Wenyue Hua","Liangming Pan","Sitao Cheng","Xiaobao Wu","En Yu","William Yang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08972v1.pdf","comment":"Data and Codes are available at\n https://github.com/skyriver-2000/RuleArena"},{"id":"http://arxiv.org/abs/2412.08970v1","updated":"2024-12-12T06:04:31Z","published":"2024-12-12T06:04:31Z","title":"Reasoning-Aware Query-Focused Summarization over Multi-Table Data","summary":" Query-focused summarization over multi-table data is a challenging yet\ncritical task for extracting precise and relevant information from structured\ndata. Existing methods often rely on complex preprocessing steps and struggle\nto generalize across domains or handle the logical reasoning required for\nmulti-table queries. In this paper, we propose QueryTableSummarizer++, an\nend-to-end generative framework leveraging large language models (LLMs)\nenhanced with table-aware pre-training, query-aligned fine-tuning, and\nreinforcement learning with feedback. Our method eliminates the need for\nintermediate serialization steps and directly generates query-relevant\nsummaries. Experiments on a benchmark dataset demonstrate that\nQueryTableSummarizer++ significantly outperforms state-of-the-art baselines in\nterms of BLEU, ROUGE, and F1-score. Additional analyses highlight its\nscalability, generalization across domains, and robust handling of complex\nqueries. Human evaluation further validates the superior quality and practical\napplicability of the generated summaries, establishing QueryTableSummarizer++\nas a highly effective solution for multi-table summarization tasks.\n","authors":["Xiaochuan Lin","Xiangyong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.08970v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08955v1","updated":"2024-12-12T05:36:51Z","published":"2024-12-12T05:36:51Z","title":"Align, Generate, Learn: A Novel Closed-Loop Framework for Cross-Lingual\n In-Context Learning","summary":" Cross-lingual in-context learning (XICL) has emerged as a transformative\nparadigm for leveraging large language models (LLMs) to tackle multilingual\ntasks, especially for low-resource languages. However, existing approaches\noften rely on external retrievers or task-specific fine-tuning, limiting their\nscalability and generalizability. In this paper, we propose a novel\nself-supervised framework that harnesses the generative capabilities of LLMs to\ninternally select and utilize task-relevant examples. Our method introduces two\nkey objectives: a retrieval-generation alignment loss to optimize the quality\nof selected examples and a semantic coherence loss to ensure cross-lingual\nconsistency. Through extensive experiments on multilingual benchmarks, our\napproach achieves state-of-the-art performance, significantly outperforming\nexisting baselines. Further analysis highlights its robustness across diverse\nlanguage families and its ability to generalize to unseen tasks. Human\nevaluations confirm the superior fluency, relevance, and semantic correctness\nof outputs generated by our method. This work provides a scalable, effective,\nand generalizable solution for cross-lingual in-context learning.\n","authors":["Mateo Alejandro Rojas","Rafael Carranza"],"pdf_url":"https://arxiv.org/pdf/2412.08955v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.15650v3","updated":"2024-12-12T05:34:36Z","published":"2024-04-24T05:08:55Z","title":"Return of EM: Entity-driven Answer Set Expansion for QA Evaluation","summary":" Recently, directly using large language models (LLMs) has been shown to be\nthe most reliable method to evaluate QA models. However, it suffers from\nlimited interpretability, high cost, and environmental harm. To address these,\nwe propose to use soft EM with entity-driven answer set expansion. Our approach\nexpands the gold answer set to include diverse surface forms, based on the\nobservation that the surface forms often follow particular patterns depending\non the entity type. The experimental results show that our method outperforms\ntraditional evaluation methods by a large margin. Moreover, the reliability of\nour evaluation method is comparable to that of LLM-based ones, while offering\nthe benefits of high interpretability and reduced environmental harm.\n","authors":["Dongryeol Lee","Minwoo Lee","Kyungmin Min","Joonsuk Park","Kyomin Jung"],"pdf_url":"https://arxiv.org/pdf/2404.15650v3.pdf","comment":"Accepted at COLING 2025 (16 pages, 4 figures, 11 tables)"},{"id":"http://arxiv.org/abs/2402.06126v4","updated":"2024-12-12T05:28:56Z","published":"2024-02-09T01:18:16Z","title":"Learn To be Efficient: Build Structured Sparsity in Large Language\n Models","summary":" Large Language Models (LLMs) have achieved remarkable success with their\nbillion-level parameters, yet they incur high inference overheads. The\nemergence of activation sparsity in LLMs provides a natural approach to reduce\nthis cost by involving only parts of the parameters for inference. However,\nexisting methods only focus on utilizing this naturally formed activation\nsparsity in a post-training setting, overlooking the potential for further\namplifying this inherent sparsity. In this paper, we hypothesize that LLMs can\nlearn to be efficient by achieving more structured activation sparsity. To\nachieve this, we introduce a novel training algorithm, Learn-To-be-Efficient\n(LTE), designed to train efficiency-aware LLMs to learn to activate fewer\nneurons and achieve a better trade-off between sparsity and performance.\nFurthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based\nmodels, LTE can also be applied to LLMs like LLaMA using non-ReLU activations.\nExtensive evaluation on language understanding, language generation, and\ninstruction tuning tasks show that LTE consistently outperforms SOTA baselines.\nAlong with our hardware-aware custom kernel implementation, LTE reduces\nLLaMA2-7B inference latency by 25% at 50% sparsity.\n","authors":["Haizhong Zheng","Xiaoyan Bai","Xueshen Liu","Z. Morley Mao","Beidi Chen","Fan Lai","Atul Prakash"],"pdf_url":"https://arxiv.org/pdf/2402.06126v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08948v1","updated":"2024-12-12T05:26:43Z","published":"2024-12-12T05:26:43Z","title":"Mojito: Motion Trajectory and Intensity Control for Video Generation","summary":" Recent advancements in diffusion models have shown great promise in producing\nhigh-quality video content. However, efficiently training diffusion models\ncapable of integrating directional guidance and controllable motion intensity\nremains a challenging and under-explored area. This paper introduces Mojito, a\ndiffusion model that incorporates both \\textbf{Mo}tion tra\\textbf{j}ectory and\n\\textbf{i}ntensi\\textbf{t}y contr\\textbf{o}l for text to video generation.\nSpecifically, Mojito features a Directional Motion Control module that\nleverages cross-attention to efficiently direct the generated object's motion\nwithout additional training, alongside a Motion Intensity Modulator that uses\noptical flow maps generated from videos to guide varying levels of motion\nintensity. Extensive experiments demonstrate Mojito's effectiveness in\nachieving precise trajectory and intensity control with high computational\nefficiency, generating motion patterns that closely match specified directions\nand intensities, providing realistic dynamics that align well with natural\nmotion in real-world scenarios.\n","authors":["Xuehai He","Shuohang Wang","Jianwei Yang","Xiaoxia Wu","Yiping Wang","Kuan Wang","Zheng Zhan","Olatunji Ruwase","Yelong Shen","Xin Eric Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08948v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08946v1","updated":"2024-12-12T05:22:49Z","published":"2024-12-12T05:22:49Z","title":"MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for\n Multi-Task Learning","summary":" Recently, LoRA has emerged as a crucial technique for fine-tuning large\npre-trained models, yet its performance in multi-task learning scenarios often\nfalls short. In contrast, the MoE architecture presents a natural solution to\nthis issue. However, it introduces challenges such as mutual interference of\ndata across multiple domains and knowledge forgetting of various tasks.\nAdditionally, MoE significantly increases the number of parameters, posing a\ncomputational cost challenge. Therefore, in this paper, we propose MoSLD, a\nmixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these\nchallenges by sharing the upper projection matrix in LoRA among different\nexperts, encouraging the model to learn general knowledge across tasks, while\nstill allowing the lower projection matrix to focus on the unique features of\neach task. The application of dropout alleviates the imbalanced update of\nparameter matrix and mitigates parameter overfitting in LoRA. Extensive\nexperiments demonstrate that our model exhibits excellent performance in both\nsingle-task and multi-task scenarios, with robust out-of-domain generalization\ncapabilities.\n","authors":["Lulu Zhao","Weihao Zeng","Xiaofeng Shi","Hua Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.08946v1.pdf","comment":"Accept by COLING 2025"},{"id":"http://arxiv.org/abs/2402.10882v6","updated":"2024-12-12T05:18:18Z","published":"2024-02-16T18:36:36Z","title":"Universal Prompt Optimizer for Safe Text-to-Image Generation","summary":" Text-to-Image (T2I) models have shown great performance in generating images\nbased on textual prompts. However, these models are vulnerable to unsafe input\nto generate unsafe content like sexual, harassment and illegal-activity images.\nExisting studies based on image checker, model fine-tuning and embedding\nblocking are impractical in real-world applications. Hence, we propose the\nfirst universal prompt optimizer for safe T2I (POSI) generation in black-box\nscenario. We first construct a dataset consisting of toxic-clean prompt pairs\nby GPT-3.5 Turbo. To guide the optimizer to have the ability of converting\ntoxic prompt to clean prompt while preserving semantic information, we design a\nnovel reward function measuring toxicity and text alignment of generated images\nand train the optimizer through Proximal Policy Optimization. Experiments show\nthat our approach can effectively reduce the likelihood of various T2I models\nin generating inappropriate images, with no significant impact on text\nalignment. It is also flexible to be combined with methods to achieve better\nperformance. Our code is available at https://github.com/wu-zongyu/POSI.\n","authors":["Zongyu Wu","Hongcheng Gao","Yueze Wang","Xiang Zhang","Suhang Wang"],"pdf_url":"https://arxiv.org/pdf/2402.10882v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08285v2","updated":"2024-12-12T05:10:43Z","published":"2024-12-11T11:00:33Z","title":"Adaptive Prompting for Continual Relation Extraction: A Within-Task\n Variance Perspective","summary":" To address catastrophic forgetting in Continual Relation Extraction (CRE),\nmany current approaches rely on memory buffers to rehearse previously learned\nknowledge while acquiring new tasks. Recently, prompt-based methods have\nemerged as potent alternatives to rehearsal-based strategies, demonstrating\nstrong empirical performance. However, upon analyzing existing prompt-based\napproaches for CRE, we identified several critical limitations, such as\ninaccurate prompt selection, inadequate mechanisms for mitigating forgetting in\nshared parameters, and suboptimal handling of cross-task and within-task\nvariances. To overcome these challenges, we draw inspiration from the\nrelationship between prefix-tuning and mixture of experts, proposing a novel\napproach that employs a prompt pool for each task, capturing variations within\neach task while enhancing cross-task variances. Furthermore, we incorporate a\ngenerative model to consolidate prior knowledge within shared parameters,\neliminating the need for explicit data storage. Extensive experiments validate\nthe efficacy of our approach, demonstrating superior performance over\nstate-of-the-art prompt-based and rehearsal-free methods in continual relation\nextraction.\n","authors":["Minh Le","Tien Ngoc Luu","An Nguyen The","Thanh-Thien Le","Trang Nguyen","Tung Thanh Nguyen","Linh Ngo Van","Thien Huu Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08285v2.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2411.07870v5","updated":"2024-12-12T05:02:11Z","published":"2024-11-12T15:26:17Z","title":"Trustful LLMs: Customizing and Grounding Text Generation with Knowledge\n Bases and Dual Decoders","summary":" Although people are impressed by the content generation skills of large\nlanguage models, the use of LLMs, such as ChatGPT, is limited by the domain\ngrounding of the content. The correctness and groundedness of the generated\ncontent need to be based on a verified context, such as results from\nRetrieval-Augmented Generation (RAG). One important issue when adapting LLMs to\na customized domain is that the generated responses are often incomplete, or\nthe additions are not verified and may even be hallucinated. Prior studies on\nhallucination detection have focused on evaluation metrics, which are not\neasily adaptable to dynamic domains and can be vulnerable to attacks like\njail-breaking. In this work, we propose 1) a post-processing algorithm that\nleverages knowledge triplets in RAG context to correct hallucinations and 2) a\ndual-decoder model that fuses RAG context to guide the generation process.\n","authors":["Xiaofeng Zhu","Jaya Krishna Mandivarapu"],"pdf_url":"https://arxiv.org/pdf/2411.07870v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07334v2","updated":"2024-12-12T05:01:04Z","published":"2024-12-10T09:25:39Z","title":"Frame Representation Hypothesis: Multi-Token LLM Interpretability and\n Concept-Guided Text Generation","summary":" Interpretability is a key challenge in fostering trust for Large Language\nModels (LLMs), which stems from the complexity of extracting reasoning from\nmodel's parameters. We present the Frame Representation Hypothesis, a\ntheoretically robust framework grounded in the Linear Representation Hypothesis\n(LRH) to interpret and control LLMs by modeling multi-token words. Prior\nresearch explored LRH to connect LLM representations with linguistic concepts,\nbut was limited to single token analysis. As most words are composed of several\ntokens, we extend LRH to multi-token words, thereby enabling usage on any\ntextual data with thousands of concepts. To this end, we propose words can be\ninterpreted as frames, ordered sequences of vectors that better capture\ntoken-word relationships. Then, concepts can be represented as the average of\nword frames sharing a common concept. We showcase these tools through Top-k\nConcept-Guided Decoding, which can intuitively steer text generation using\nconcepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3\nfamilies, demonstrating gender and language biases, exposing harmful content,\nbut also potential to remediate them, leading to safer and more transparent\nLLMs. Code is available at\nhttps://github.com/phvv-me/frame-representation-hypothesis.git\n","authors":["Pedro H. V. Valois","Lincon S. Souza","Erica K. Shimomoto","Kazuhiro Fukui"],"pdf_url":"https://arxiv.org/pdf/2412.07334v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08937v1","updated":"2024-12-12T04:58:32Z","published":"2024-12-12T04:58:32Z","title":"Multi-Scale Heterogeneous Text-Attributed Graph Datasets From Diverse\n Domains","summary":" Heterogeneous Text-Attributed Graphs (HTAGs), where different types of\nentities are not only associated with texts but also connected by diverse\nrelationships, have gained widespread popularity and application across various\ndomains. However, current research on text-attributed graph learning\npredominantly focuses on homogeneous graphs, which feature a single node and\nedge type, thus leaving a gap in understanding how methods perform on HTAGs.\nOne crucial reason is the lack of comprehensive HTAG datasets that offer\noriginal textual content and span multiple domains of varying sizes. To this\nend, we introduce a collection of challenging and diverse benchmark datasets\nfor realistic and reproducible evaluation of machine learning models on HTAGs.\nOur HTAG datasets are multi-scale, span years in duration, and cover a wide\nrange of domains, including movie, community question answering, academic,\nliterature, and patent networks. We further conduct benchmark experiments on\nthese datasets with various graph neural networks. All source data, dataset\nconstruction codes, processed HTAGs, data loaders, benchmark codes, and\nevaluation setup are publicly available at GitHub and Hugging Face.\n","authors":["Yunhui Liu","Qizhuo Xie","Jinwei Shi","Jiaxu Shen","Tieke He"],"pdf_url":"https://arxiv.org/pdf/2412.08937v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08920v1","updated":"2024-12-12T04:06:54Z","published":"2024-12-12T04:06:54Z","title":"From Text to Trajectory: Exploring Complex Constraint Representation and\n Decomposition in Safe Reinforcement Learning","summary":" Safe reinforcement learning (RL) requires the agent to finish a given task\nwhile obeying specific constraints. Giving constraints in natural language form\nhas great potential for practical scenarios due to its flexible transfer\ncapability and accessibility. Previous safe RL methods with natural language\nconstraints typically need to design cost functions manually for each\nconstraint, which requires domain expertise and lacks flexibility. In this\npaper, we harness the dual role of text in this task, using it not only to\nprovide constraint but also as a training signal. We introduce the\nTrajectory-level Textual Constraints Translator (TTCT) to replace the manually\ndesigned cost function. Our empirical results demonstrate that TTCT effectively\ncomprehends textual constraint and trajectory, and the policies trained by TTCT\ncan achieve a lower violation rate than the standard cost function. Extra\nstudies are conducted to demonstrate that the TTCT has zero-shot transfer\ncapability to adapt to constraint-shift environments.\n","authors":["Pusen Dong","Tianchen Zhu","Yue Qiu","Haoyi Zhou","Jianxin Li"],"pdf_url":"https://arxiv.org/pdf/2412.08920v1.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.08905v1","updated":"2024-12-12T03:37:41Z","published":"2024-12-12T03:37:41Z","title":"Phi-4 Technical Report","summary":" We present phi-4, a 14-billion parameter language model developed with a\ntraining recipe that is centrally focused on data quality. Unlike most language\nmodels, where pre-training is based primarily on organic data sources such as\nweb content or code, phi-4 strategically incorporates synthetic data throughout\nthe training process. While previous models in the Phi family largely distill\nthe capabilities of a teacher model (specifically GPT-4), phi-4 substantially\nsurpasses its teacher model on STEM-focused QA capabilities, giving evidence\nthat our data-generation and post-training techniques go beyond distillation.\nDespite minimal changes to the phi-3 architecture, phi-4 achieves strong\nperformance relative to its size -- especially on reasoning-focused benchmarks\n-- due to improved data, training curriculum, and innovations in the\npost-training scheme.\n","authors":["Marah Abdin","Jyoti Aneja","Harkirat Behl","Sébastien Bubeck","Ronen Eldan","Suriya Gunasekar","Michael Harrison","Russell J. Hewett","Mojan Javaheripi","Piero Kauffmann","James R. Lee","Yin Tat Lee","Yuanzhi Li","Weishung Liu","Caio C. T. Mendes","Anh Nguyen","Eric Price","Gustavo de Rosa","Olli Saarikivi","Adil Salim","Shital Shah","Xin Wang","Rachel Ward","Yue Wu","Dingli Yu","Cyril Zhang","Yi Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.08905v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04144v2","updated":"2024-12-12T03:30:34Z","published":"2024-12-05T13:12:51Z","title":"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale\n Mitigates Performance Tradeoffs","summary":" Model merging has shown great promise at combining expert models, but the\nbenefit of merging is unclear when merging ``generalist'' models trained on\nmany tasks. We explore merging in the context of large (~100B) models, by\nrecycling checkpoints that exhibit tradeoffs among different tasks. Such\ncheckpoints are often created in the process of developing a frontier model,\nand many suboptimal ones are usually discarded. Given a pool of model\ncheckpoints obtained from different training runs (e.g., different stages,\nobjectives, hyperparameters, and data mixtures), which naturally show tradeoffs\nacross different language capabilities (e.g., instruction following vs. code\ngeneration), we investigate whether merging can recycle such suboptimal models\ninto a Pareto-optimal one. Our optimization algorithm tunes the weight of each\ncheckpoint in a linear combination, resulting in a Pareto-optimal models that\noutperforms both individual models and merge-based baselines. Further analysis\nshows that good merges tend to include almost all checkpoints with non-zero\nweights, indicating that even seemingly bad initial checkpoints can contribute\nto good final merges.\n","authors":["Muhammad Khalifa","Yi-Chern Tan","Arash Ahmadian","Tom Hosking","Honglak Lee","Lu Wang","Ahmet Üstün","Tom Sherborne","Matthias Gallé"],"pdf_url":"https://arxiv.org/pdf/2412.04144v2.pdf","comment":"13 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.11641v2","updated":"2024-12-12T03:26:57Z","published":"2024-01-22T01:06:17Z","title":"Revolutionizing Finance with LLMs: An Overview of Applications and\n Insights","summary":" In recent years, Large Language Models (LLMs) like ChatGPT have seen\nconsiderable advancements and have been applied in diverse fields. Built on the\nTransformer architecture, these models are trained on extensive datasets,\nenabling them to understand and generate human language effectively. In the\nfinancial domain, the deployment of LLMs is gaining momentum. These models are\nbeing utilized for automating financial report generation, forecasting market\ntrends, analyzing investor sentiment, and offering personalized financial\nadvice. Leveraging their natural language processing capabilities, LLMs can\ndistill key insights from vast financial data, aiding institutions in making\ninformed investment choices and enhancing both operational efficiency and\ncustomer satisfaction. In this study, we provide a comprehensive overview of\nthe emerging integration of LLMs into various financial tasks. Additionally, we\nconducted holistic tests on multiple financial tasks through the combination of\nnatural language instructions. Our findings show that GPT-4 effectively follow\nprompt instructions across various financial tasks. This survey and evaluation\nof LLMs in the financial domain aim to deepen the understanding of LLMs'\ncurrent role in finance for both financial practitioners and LLM researchers,\nidentify new research and application prospects, and highlight how these\ntechnologies can be leveraged to solve practical challenges in the finance\nindustry.\n","authors":["Huaqin Zhao","Zhengliang Liu","Zihao Wu","Yiwei Li","Tianze Yang","Peng Shu","Shaochen Xu","Haixing Dai","Lin Zhao","Hanqi Jiang","Yi Pan","Junhao Chen","Yifan Zhou","Gengchen Mai","Ninghao Liu","Tianming Liu"],"pdf_url":"https://arxiv.org/pdf/2401.11641v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08900v1","updated":"2024-12-12T03:24:49Z","published":"2024-12-12T03:24:49Z","title":"AI-assisted Knowledge Discovery in Biomedical Literature to Support\n Decision-making in Precision Oncology","summary":" The delivery of appropriate targeted therapies to cancer patients requires\nthe complete analysis of the molecular profiling of tumors and the patient's\nclinical characteristics in the context of existing knowledge and recent\nfindings described in biomedical literature and several other sources. We\nevaluated the potential contributions of specific natural language processing\nsolutions to support knowledge discovery from biomedical literature. Two models\nfrom the Bidirectional Encoder Representations from Transformers (BERT) family,\ntwo Large Language Models, and PubTator 3.0 were tested for their ability to\nsupport the named entity recognition (NER) and the relation extraction (RE)\ntasks. PubTator 3.0 and the BioBERT model performed best in the NER task (best\nF1-score equal to 0.93 and 0.89, respectively), while BioBERT outperformed all\nother solutions in the RE task (best F1-score 0.79) and a specific use case it\nwas applied to by recognizing nearly all entity mentions and most of the\nrelations.\n","authors":["Ting He","Kory Kreimeyer","Mimi Najjar","Jonathan Spiker","Maria Fatteh","Valsamo Anagnostou","Taxiarchis Botsis"],"pdf_url":"https://arxiv.org/pdf/2412.08900v1.pdf","comment":"Accepted at AMIA Annual Symposium 2024"},{"id":"http://arxiv.org/abs/2410.07561v2","updated":"2024-12-12T02:47:05Z","published":"2024-10-10T02:58:52Z","title":"AI-Press: A Multi-Agent News Generating and Feedback Simulation System\n Powered by Large Language Models","summary":" The rise of various social platforms has transformed journalism. The growing\ndemand for news content has led to the increased use of large language models\n(LLMs) in news production due to their speed and cost-effectiveness. However,\nLLMs still encounter limitations in professionalism and ethical judgment in\nnews generation. Additionally, predicting public feedback is usually difficult\nbefore news is released. To tackle these challenges, we introduce AI-Press, an\nautomated news drafting and polishing system based on multi-agent collaboration\nand Retrieval-Augmented Generation. We develop a feedback simulation system\nthat generates public feedback considering demographic distributions. Through\nextensive quantitative and qualitative evaluations, our system shows\nsignificant improvements in news-generating capabilities and verifies the\neffectiveness of public feedback simulation.\n","authors":["Xiawei Liu","Shiyue Yang","Xinnong Zhang","Haoyu Kuang","Libo Sun","Yihang Yang","Siming Chen","Xuanjing Huang","Zhongyu Wei"],"pdf_url":"https://arxiv.org/pdf/2410.07561v2.pdf","comment":"18 pages, 4 figures"},{"id":"http://arxiv.org/abs/2403.11802v4","updated":"2024-12-12T02:45:29Z","published":"2024-03-18T14:01:45Z","title":"Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark\n for Evaluating Long-Context Large Language Models","summary":" Despite recent efforts to develop large language models with robust\nlong-context capabilities, the lack of long-context benchmarks means that\nrelatively little is known about their performance. To alleviate this gap, in\nthis paper, we propose \\textbf{Counting-Stars}, a multi-evidence,\nposition-aware, and scalable benchmark designed to evaluate the multi-evidence\nretrieval capabilities of long-context LLMs. \\textbf{Counting-Stars} comprises\ntwo counting-based multiple pieces of evidence retrieval tasks: searching and\nreasoning. Using Counting-Stars, we conducted experiments to evaluate several\nlong-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4,\nand Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro\nachieves the best overall results, while GPT-4 Turbo exhibits the most stable\nperformance across various tasks. Furthermore, our analysis of these LLMs,\nwhich have been extended to handle long-context scenarios, indicates that\nsignificant room for improvement remains as the length of the input context and\nthe complexity of the tasks increase.\n","authors":["Mingyang Song","Mao Zheng","Xuan Luo"],"pdf_url":"https://arxiv.org/pdf/2403.11802v4.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2406.12644v4","updated":"2024-12-12T02:37:52Z","published":"2024-06-18T14:12:27Z","title":"Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for\n Large Language Models Aligned with Human Cognitive Principles","summary":" Assessing the effectiveness of large language models (LLMs) in performing\ndifferent tasks is crucial for understanding their strengths and weaknesses.\nThis paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human\ncognitive principles and designed to assess LLMs by examining the cognitive\ndemands of various tasks. The HPT utilizes the Hierarchical Prompting Framework\n(HPF), which structures five unique prompting strategies in a hierarchical\norder based on their cognitive requirement on LLMs when compared to human\nmental capabilities. It assesses the complexity of tasks with the Hierarchical\nPrompting Index (HPI), which demonstrates the cognitive competencies of LLMs\nacross diverse datasets and offers insights into the cognitive demands that\ndatasets place on different LLMs. This approach enables a comprehensive\nevaluation of an LLMs problem solving abilities and the intricacy of a dataset,\noffering a standardized metric for task complexity. Extensive experiments with\nmultiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63%\ncompared to baseline performance, with GSM8k being the most cognitively complex\ntask among reasoning and coding tasks with an average HPI of 3.20 confirming\nthe effectiveness of HPT. To support future research and reproducibility in\nthis domain, the implementations of HPT and HPF are available here.\n","authors":["Devichand Budagam","Ashutosh Kumar","Mahsa Khoshnoodi","Sankalp KJ","Vinija Jain","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2406.12644v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08864v1","updated":"2024-12-12T01:52:25Z","published":"2024-12-12T01:52:25Z","title":"A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning\n Instructions","summary":" Synthesizing high-quality reasoning data for continual training has been\nproven to be effective in enhancing the performance of Large Language Models\n(LLMs). However, previous synthetic approaches struggle to easily scale up data\nand incur high costs in the pursuit of high quality. In this paper, we propose\nthe Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable\nframework for high-quality reasoning data synthesis. Inspired by knowledge\ngraphs, we extracted knowledge points from seed data and constructed a\nknowledge point relationships graph to explore their interconnections. By\nexploring the implicit relationships among knowledge, our method achieves\n$\\times$255 data expansion. Furthermore, GSDP led by open-source models,\nachieves synthesis quality comparable to GPT-4-0613 while maintaining\n$\\times$100 lower costs. To tackle the most challenging mathematical reasoning\ntask, we present the GSDP-MATH dataset comprising over 1.91 million pairs of\nmath problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on\nMistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating\nthe effectiveness of our method. The dataset and models trained in this paper\nwill be available.\n","authors":["Jiankang Wang","Jianjun Xu","Xiaorui Wang","Yuxin Wang","Mengting Xing","Shancheng Fang","Zhineng Chen","Hongtao Xie","Yongdong Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.08864v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.07329v2","updated":"2024-12-12T01:35:21Z","published":"2024-07-10T02:56:55Z","title":"Probability of Differentiation Reveals Brittleness of Homogeneity Bias\n in GPT-4","summary":" Homogeneity bias in Large Language Models (LLMs) refers to their tendency to\nhomogenize the representations of some groups compared to others. Previous\nstudies documenting this bias have predominantly used encoder models, which may\nhave inadvertently introduced biases. To address this limitation, we prompted\nGPT-4 to generate single word/expression completions associated with 18\nsituation cues-specific, measurable elements of environments that influence how\nindividuals perceive situations and compared the variability of these\ncompletions using probability of differentiation. This approach directly\nassessed homogeneity bias from the model's outputs, bypassing encoder models.\nAcross five studies, we find that homogeneity bias is highly volatile across\nsituation cues and writing prompts, suggesting that the bias observed in past\nwork may reflect those within encoder models rather than LLMs. Furthermore, we\nfind that homogeneity bias in LLMs is brittle, as even minor and arbitrary\nchanges in prompts can significantly alter the expression of biases. Future\nwork should further explore how variations in syntactic features and topic\nchoices in longer text generations influence homogeneity bias in LLMs.\n","authors":["Messi H. J. Lee","Calvin K. Lai"],"pdf_url":"https://arxiv.org/pdf/2407.07329v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06287v2","updated":"2024-12-12T01:20:14Z","published":"2024-12-09T08:19:28Z","title":"PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking\n Large Language Models","summary":" The emergence of Large Language Models (LLMs) in the medical domain has\nstressed a compelling need for standard datasets to evaluate their\nquestion-answering (QA) performance. Although there have been several benchmark\ndatasets for medical QA, they either cover common knowledge across different\ndepartments or are specific to another department rather than pediatrics.\nMoreover, some of them are limited to objective questions and do not measure\nthe generation capacity of LLMs. Therefore, they cannot comprehensively assess\nthe QA ability of LLMs in pediatrics. To fill this gap, we construct\nPediaBench, the first Chinese pediatric dataset for LLM evaluation.\nSpecifically, it contains 4,565 objective questions and 1,632 subjective\nquestions spanning 12 pediatric disease groups. It adopts an integrated scoring\ncriterion based on different difficulty levels to thoroughly assess the\nproficiency of an LLM in instruction following, knowledge understanding,\nclinical case analysis, etc. Finally, we validate the effectiveness of\nPediaBench with extensive experiments on 20 open-source and commercial LLMs.\nThrough an in-depth analysis of experimental results, we offer insights into\nthe ability of LLMs to answer pediatric questions in the Chinese context,\nhighlighting their limitations for further improvements. Our code and data are\npublished at https://github.com/ACMISLab/PediaBench.\n","authors":["Qian Zhang","Panfeng Chen","Jiali Li","Linkun Feng","Shuyu Liu","Heng Zhao","Mei Chen","Hui Li","Yanhao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.06287v2.pdf","comment":"21 pages, 12 figures"},{"id":"http://arxiv.org/abs/2412.08846v1","updated":"2024-12-12T00:52:11Z","published":"2024-12-12T00:52:11Z","title":"Exploring Large Language Models on Cross-Cultural Values in Connection\n with Training Methodology","summary":" Large language models (LLMs) closely interact with humans, and thus need an\nintimate understanding of the cultural values of human society. In this paper,\nwe explore how open-source LLMs make judgments on diverse categories of\ncultural values across countries, and its relation to training methodology such\nas model sizes, training corpus, alignment, etc. Our analysis shows that LLMs\ncan judge socio-cultural norms similar to humans but less so on social systems\nand progress. In addition, LLMs tend to judge cultural values biased toward\nWestern culture, which can be improved with training on the multilingual\ncorpus. We also find that increasing model size helps a better understanding of\nsocial values, but smaller models can be enhanced by using synthetic data. Our\nanalysis reveals valuable insights into the design methodology of LLMs in\nconnection with their understanding of cultural values.\n","authors":["Minsang Kim","Seungjun Baek"],"pdf_url":"https://arxiv.org/pdf/2412.08846v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.16594v3","updated":"2024-12-12T00:35:03Z","published":"2024-11-25T17:28:44Z","title":"From Generation to Judgment: Opportunities and Challenges of\n LLM-as-a-judge","summary":" Assessment and evaluation have long been critical challenges in artificial\nintelligence (AI) and natural language processing (NLP). However, traditional\nmethods, whether matching-based or embedding-based, often fall short of judging\nsubtle attributes and delivering satisfactory results. Recent advancements in\nLarge Language Models (LLMs) inspire the \"LLM-as-a-judge\" paradigm, where LLMs\nare leveraged to perform scoring, ranking, or selection across various tasks\nand applications. This paper provides a comprehensive survey of LLM-based\njudgment and assessment, offering an in-depth overview to advance this emerging\nfield. We begin by giving detailed definitions from both input and output\nperspectives. Then we introduce a comprehensive taxonomy to explore\nLLM-as-a-judge from three dimensions: what to judge, how to judge and where to\njudge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and\nhighlight key challenges and promising directions, aiming to provide valuable\ninsights and inspire future research in this promising research area. Paper\nlist and more resources about LLM-as-a-judge can be found at\n\\url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and\n\\url{https://llm-as-a-judge.github.io}.\n","authors":["Dawei Li","Bohan Jiang","Liangjie Huang","Alimohammad Beigi","Chengshuai Zhao","Zhen Tan","Amrita Bhattacharjee","Yuxuan Jiang","Canyu Chen","Tianhao Wu","Kai Shu","Lu Cheng","Huan Liu"],"pdf_url":"https://arxiv.org/pdf/2411.16594v3.pdf","comment":"v3: add missing citations; 32 pages, 5 figures"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2412.09625v1","updated":"2024-12-12T18:59:59Z","published":"2024-12-12T18:59:59Z","title":"Illusion3D: 3D Multiview Illusion with 2D Diffusion Priors","summary":" Automatically generating multiview illusions is a compelling challenge, where\na single piece of visual content offers distinct interpretations from different\nviewing perspectives. Traditional methods, such as shadow art and wire art,\ncreate interesting 3D illusions but are limited to simple visual outputs (i.e.,\nfigure-ground or line drawing), restricting their artistic expressiveness and\npractical versatility. Recent diffusion-based illusion generation methods can\ngenerate more intricate designs but are confined to 2D images. In this work, we\npresent a simple yet effective approach for creating 3D multiview illusions\nbased on user-provided text prompts or images. Our method leverages a\npre-trained text-to-image diffusion model to optimize the textures and geometry\nof neural 3D representations through differentiable rendering. When viewed from\nmultiple angles, this produces different interpretations. We develop several\ntechniques to improve the quality of the generated 3D multiview illusions. We\ndemonstrate the effectiveness of our approach through extensive experiments and\nshowcase illusion generation with diverse 3D forms.\n","authors":["Yue Feng","Vaibhav Sanjay","Spencer Lutz","Badour AlBahar","Songwei Ge","Jia-Bin Huang"],"pdf_url":"https://arxiv.org/pdf/2412.09625v1.pdf","comment":"Project page: https://3d-multiview-illusion.github.io/"},{"id":"http://arxiv.org/abs/2412.09626v1","updated":"2024-12-12T18:59:59Z","published":"2024-12-12T18:59:59Z","title":"FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free\n Scale Fusion","summary":" Visual diffusion models achieve remarkable progress, yet they are typically\ntrained at limited resolutions due to the lack of high-resolution data and\nconstrained computation resources, hampering their ability to generate\nhigh-fidelity images or videos at higher resolutions. Recent efforts have\nexplored tuning-free strategies to exhibit the untapped potential\nhigher-resolution visual generation of pre-trained models. However, these\nmethods are still prone to producing low-quality visual content with repetitive\npatterns. The key obstacle lies in the inevitable increase in high-frequency\ninformation when the model generates visual content exceeding its training\nresolution, leading to undesirable repetitive patterns deriving from the\naccumulated errors. To tackle this challenge, we propose FreeScale, a\ntuning-free inference paradigm to enable higher-resolution visual generation\nvia scale fusion. Specifically, FreeScale processes information from different\nreceptive scales and then fuses it by extracting desired frequency components.\nExtensive experiments validate the superiority of our paradigm in extending the\ncapabilities of higher-resolution visual generation for both image and video\nmodels. Notably, compared with the previous best-performing method, FreeScale\nunlocks the generation of 8k-resolution images for the first time.\n","authors":["Haonan Qiu","Shiwei Zhang","Yujie Wei","Ruihang Chu","Hangjie Yuan","Xiang Wang","Yingya Zhang","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09626v1.pdf","comment":"Project Page: http://haonanqiu.com/projects/FreeScale.html"},{"id":"http://arxiv.org/abs/2412.09627v1","updated":"2024-12-12T18:59:59Z","published":"2024-12-12T18:59:59Z","title":"Doe-1: Closed-Loop Autonomous Driving with Large World Model","summary":" End-to-end autonomous driving has received increasing attention due to its\npotential to learn from large amounts of data. However, most existing methods\nare still open-loop and suffer from weak scalability, lack of high-order\ninteractions, and inefficient decision-making. In this paper, we explore a\nclosed-loop framework for autonomous driving and propose a large Driving wOrld\nmodEl (Doe-1) for unified perception, prediction, and planning. We formulate\nautonomous driving as a next-token generation problem and use multi-modal\ntokens to accomplish different tasks. Specifically, we use free-form texts\n(i.e., scene descriptions) for perception and generate future predictions\ndirectly in the RGB space with image tokens. For planning, we employ a\nposition-aware tokenizer to effectively encode action into discrete tokens. We\ntrain a multi-modal transformer to autoregressively generate perception,\nprediction, and planning tokens in an end-to-end and unified manner.\nExperiments on the widely used nuScenes dataset demonstrate the effectiveness\nof Doe-1 in various tasks including visual question-answering,\naction-conditioned video generation, and motion planning. Code:\nhttps://github.com/wzzheng/Doe.\n","authors":["Wenzhao Zheng","Zetian Xia","Yuanhui Huang","Sicheng Zuo","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.09627v1.pdf","comment":"Code is available at: https://github.com/wzzheng/Doe"},{"id":"http://arxiv.org/abs/2412.09624v1","updated":"2024-12-12T18:59:57Z","published":"2024-12-12T18:59:57Z","title":"GenEx: Generating an Explorable World","summary":" Understanding, navigating, and exploring the 3D physical real world has long\nbeen a central challenge in the development of artificial intelligence. In this\nwork, we take a step toward this goal by introducing GenEx, a system capable of\nplanning complex embodied world exploration, guided by its generative\nimagination that forms priors (expectations) about the surrounding\nenvironments. GenEx generates an entire 3D-consistent imaginative environment\nfrom as little as a single RGB image, bringing it to life through panoramic\nvideo streams. Leveraging scalable 3D world data curated from Unreal Engine,\nour generative model is rounded in the physical world. It captures a continuous\n360-degree environment with little effort, offering a boundless landscape for\nAI agents to explore and interact with. GenEx achieves high-quality world\ngeneration, robust loop consistency over long trajectories, and demonstrates\nstrong 3D capabilities such as consistency and active 3D mapping. Powered by\ngenerative imagination of the world, GPT-assisted agents are equipped to\nperform complex embodied tasks, including both goal-agnostic exploration and\ngoal-driven navigation. These agents utilize predictive expectation regarding\nunseen parts of the physical world to refine their beliefs, simulate different\noutcomes based on potential decisions, and make more informed choices. In\nsummary, we demonstrate that GenEx provides a transformative platform for\nadvancing embodied AI in imaginative spaces and brings potential for extending\nthese capabilities to real-world exploration.\n","authors":["Taiming Lu","Tianmin Shu","Junfei Xiao","Luoxin Ye","Jiahao Wang","Cheng Peng","Chen Wei","Daniel Khashabi","Rama Chellappa","Alan Yuille","Jieneng Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09624v1.pdf","comment":"Website: GenEx.world"},{"id":"http://arxiv.org/abs/2412.09623v1","updated":"2024-12-12T18:59:56Z","published":"2024-12-12T18:59:56Z","title":"OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video\n Generation","summary":" As virtual reality gains popularity, the demand for controllable creation of\nimmersive and dynamic omnidirectional videos (ODVs) is increasing. While\nprevious text-to-ODV generation methods achieve impressive results, they\nstruggle with content inaccuracies and inconsistencies due to reliance solely\non textual inputs. Although recent motion control techniques provide\nfine-grained control for video generation, directly applying these methods to\nODVs often results in spatial distortion and unsatisfactory performance,\nespecially with complex spherical motions. To tackle these challenges, we\npropose OmniDrag, the first approach enabling both scene- and object-level\nmotion control for accurate, high-quality omnidirectional image-to-video\ngeneration. Building on pretrained video diffusion models, we introduce an\nomnidirectional control module, which is jointly fine-tuned with temporal\nattention layers to effectively handle complex spherical motion. In addition,\nwe develop a novel spherical motion estimator that accurately extracts\nmotion-control signals and allows users to perform drag-style ODV generation by\nsimply drawing handle and target points. We also present a new dataset, named\nMove360, addressing the scarcity of ODV data with large scene and object\nmotions. Experiments demonstrate the significant superiority of OmniDrag in\nachieving holistic scene-level and fine-grained object-level control for ODV\ngeneration. The project page is available at\nhttps://lwq20020127.github.io/OmniDrag.\n","authors":["Weiqi Li","Shijie Zhao","Chong Mou","Xuhan Sheng","Zhenyu Zhang","Qian Wang","Junlin Li","Li Zhang","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09623v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09622v1","updated":"2024-12-12T18:59:55Z","published":"2024-12-12T18:59:55Z","title":"LoRACLR: Contrastive Adaptation for Customization of Diffusion Models","summary":" Recent advances in text-to-image customization have enabled high-fidelity,\ncontext-rich generation of personalized images, allowing specific concepts to\nappear in a variety of scenarios. However, current methods struggle with\ncombining multiple personalized models, often leading to attribute entanglement\nor requiring separate training to preserve concept distinctiveness. We present\nLoRACLR, a novel approach for multi-concept image generation that merges\nmultiple LoRA models, each fine-tuned for a distinct concept, into a single,\nunified model without additional individual fine-tuning. LoRACLR uses a\ncontrastive objective to align and merge the weight spaces of these models,\nensuring compatibility while minimizing interference. By enforcing distinct yet\ncohesive representations for each concept, LoRACLR enables efficient, scalable\nmodel composition for high-quality, multi-concept image synthesis. Our results\nhighlight the effectiveness of LoRACLR in accurately merging multiple concepts,\nadvancing the capabilities of personalized image generation.\n","authors":["Enis Simsar","Thomas Hofmann","Federico Tombari","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.09622v1.pdf","comment":"Project page: https://loraclr.github.io/"},{"id":"http://arxiv.org/abs/2412.09620v1","updated":"2024-12-12T18:59:54Z","published":"2024-12-12T18:59:54Z","title":"Learning Camera Movement Control from Real-World Drone Videos","summary":" This study seeks to automate camera movement control for filming existing\nsubjects into attractive videos, contrasting with the creation of non-existent\ncontent by directly generating the pixels. We select drone videos as our test\ncase due to their rich and challenging motion patterns, distinctive viewing\nangles, and precise controls. Existing AI videography methods struggle with\nlimited appearance diversity in simulation training, high costs of recording\nexpert operations, and difficulties in designing heuristic-based goals to cover\nall scenarios. To avoid these issues, we propose a scalable method that\ninvolves collecting real-world training data to improve diversity, extracting\ncamera trajectories automatically to minimize annotation costs, and training an\neffective architecture that does not rely on heuristics. Specifically, we\ncollect 99k high-quality trajectories by running 3D reconstruction on online\nvideos, connecting camera poses from consecutive frames to formulate 3D camera\npaths, and using Kalman filter to identify and remove low-quality data.\nMoreover, we introduce DVGFormer, an auto-regressive transformer that leverages\nthe camera path and images from all past frames to predict camera movement in\nthe next frame. We evaluate our system across 38 synthetic natural scenes and 7\nreal city 3D scans. We show that our system effectively learns to perform\nchallenging camera movements such as navigating through obstacles, maintaining\nlow altitude to increase perceived speed, and orbiting towers and buildings,\nwhich are very useful for recording high-quality videos. Data and code are\navailable at dvgformer.github.io.\n","authors":["Yunzhong Hou","Liang Zheng","Philip Torr"],"pdf_url":"https://arxiv.org/pdf/2412.09620v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09621v1","updated":"2024-12-12T18:59:54Z","published":"2024-12-12T18:59:54Z","title":"Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos","summary":" Learning to understand dynamic 3D scenes from imagery is crucial for\napplications ranging from robotics to scene reconstruction. Yet, unlike other\nproblems where large-scale supervised training has enabled rapid progress,\ndirectly supervising methods for recovering 3D motion remains challenging due\nto the fundamental difficulty of obtaining ground truth annotations. We present\na system for mining high-quality 4D reconstructions from internet stereoscopic,\nwide-angle videos. Our system fuses and filters the outputs of camera pose\nestimation, stereo depth estimation, and temporal tracking methods into\nhigh-quality dynamic 3D reconstructions. We use this method to generate\nlarge-scale data in the form of world-consistent, pseudo-metric 3D point clouds\nwith long-term motion trajectories. We demonstrate the utility of this data by\ntraining a variant of DUSt3R to predict structure and 3D motion from real-world\nimage pairs, showing that training on our reconstructed data enables\ngeneralization to diverse real-world scenes. Project page:\nhttps://stereo4d.github.io\n","authors":["Linyi Jin","Richard Tucker","Zhengqi Li","David Fouhey","Noah Snavely","Aleksander Holynski"],"pdf_url":"https://arxiv.org/pdf/2412.09621v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09619v1","updated":"2024-12-12T18:59:53Z","published":"2024-12-12T18:59:53Z","title":"SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices\n with Efficient Architectures and Training","summary":" Existing text-to-image (T2I) diffusion models face several limitations,\nincluding large model sizes, slow runtime, and low-quality generation on mobile\ndevices. This paper aims to address all of these challenges by developing an\nextremely small and fast T2I model that generates high-resolution and\nhigh-quality images on mobile platforms. We propose several techniques to\nachieve this goal. First, we systematically examine the design choices of the\nnetwork architecture to reduce model parameters and latency, while ensuring\nhigh-quality generation. Second, to further improve generation quality, we\nemploy cross-architecture knowledge distillation from a much larger model,\nusing a multi-level approach to guide the training of our model from scratch.\nThird, we enable a few-step generation by integrating adversarial guidance with\nknowledge distillation. For the first time, our model SnapGen, demonstrates the\ngeneration of 1024x1024 px images on a mobile device around 1.4 seconds. On\nImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for\n256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our\nmodel with merely 379M parameters, surpasses large-scale models with billions\nof parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x\nsmaller than IF-XL).\n","authors":["Dongting Hu","Jierun Chen","Xijie Huang","Huseyin Coskun","Arpit Sahni","Aarush Gupta","Anujraaj Goyal","Dishani Lahiri","Rajesh Singh","Yerlan Idelbayev","Junli Cao","Yanyu Li","Kwang-Ting Cheng","S. -H. Gary Chan","Mingming Gong","Sergey Tulyakov","Anil Kag","Yanwu Xu","Jian Ren"],"pdf_url":"https://arxiv.org/pdf/2412.09619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09618v1","updated":"2024-12-12T18:59:48Z","published":"2024-12-12T18:59:48Z","title":"EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via\n Multimodal LLM","summary":" Significant achievements in personalization of diffusion models have been\nwitnessed. Conventional tuning-free methods mostly encode multiple reference\nimages by averaging their image embeddings as the injection condition, but such\nan image-independent operation cannot perform interaction among images to\ncapture consistent visual elements within multiple references. Although the\ntuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent\nelements within multiple images through the training process, it necessitates\nspecific finetuning for each distinct image group. This paper introduces\nEasyRef, a novel plug-and-play adaptation method that enables diffusion models\nto be conditioned on multiple reference images and the text prompt. To\neffectively exploit consistent visual elements within multiple images, we\nleverage the multi-image comprehension and instruction-following capabilities\nof the multimodal large language model (MLLM), prompting it to capture\nconsistent visual elements based on the instruction. Besides, injecting the\nMLLM's representations into the diffusion process through adapters can easily\ngeneralize to unseen domains, mining the consistent visual elements within\nunseen data. To mitigate computational costs and enhance fine-grained detail\npreservation, we introduce an efficient reference aggregation strategy and a\nprogressive training scheme. Finally, we introduce MRBench, a new\nmulti-reference image generation benchmark. Experimental results demonstrate\nEasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based\nmethods like LoRA, achieving superior aesthetic quality and robust zero-shot\ngeneralization across diverse domains.\n","authors":["Zhuofan Zong","Dongzhi Jiang","Bingqi Ma","Guanglu Song","Hao Shao","Dazhong Shen","Yu Liu","Hongsheng Li"],"pdf_url":"https://arxiv.org/pdf/2412.09618v1.pdf","comment":"Tech report"},{"id":"http://arxiv.org/abs/2412.09616v1","updated":"2024-12-12T18:59:46Z","published":"2024-12-12T18:59:46Z","title":"V2PE: Improving Multimodal Long-Context Capability of Vision-Language\n Models with Variable Visual Position Encoding","summary":" Vision-Language Models (VLMs) have shown promising capabilities in handling\nvarious multimodal tasks, yet they struggle in long-context scenarios,\nparticularly in tasks involving videos, high-resolution images, or lengthy\nimage-text documents. In our work, we first conduct an empirical analysis of\nthe long-context capabilities of VLMs using our augmented long-context\nmultimodal datasets. Our findings reveal that directly applying the positional\nencoding mechanism used for textual tokens to visual tokens is suboptimal, and\nVLM performance degrades sharply when the position encoding exceeds the model's\ncontext window. To address this, we propose Variable Visual Position Encoding\n(V2PE), a novel positional encoding approach that employs variable and smaller\nincrements for visual tokens, enabling more efficient management of long\nmultimodal sequences. Our experiments demonstrate the effectiveness of V2PE to\nenhances VLMs' ability to effectively understand and reason over long\nmultimodal contexts. We further integrate V2PE with our augmented long-context\nmultimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned\nmodel achieves strong performance on both standard and long-context multimodal\ntasks. Notably, when the sequence length of the training dataset is increased\nto 256K tokens, the model is capable of processing multimodal sequences up to\n1M tokens, highlighting its potential for real-world long-context applications.\n","authors":["Junqi Ge","Ziyi Chen","Jintao Lin","Jinguo Zhu","Xihui Liu","Jifeng Dai","Xizhou Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.09616v1.pdf","comment":"The code and models will be available at\n https://github.com/OpenGVLab/V2PE"},{"id":"http://arxiv.org/abs/2412.09614v1","updated":"2024-12-12T18:59:41Z","published":"2024-12-12T18:59:41Z","title":"Context Canvas: Enhancing Text-to-Image Diffusion Models with Knowledge\n Graph-Based RAG","summary":" We introduce a novel approach to enhance the capabilities of text-to-image\nmodels by incorporating a graph-based RAG. Our system dynamically retrieves\ndetailed character information and relational data from the knowledge graph,\nenabling the generation of visually accurate and contextually rich images. This\ncapability significantly improves upon the limitations of existing T2I models,\nwhich often struggle with the accurate depiction of complex or culturally\nspecific subjects due to dataset constraints. Furthermore, we propose a novel\nself-correcting mechanism for text-to-image models to ensure consistency and\nfidelity in visual outputs, leveraging the rich context from the graph to guide\ncorrections. Our qualitative and quantitative experiments demonstrate that\nContext Canvas significantly enhances the capabilities of popular models such\nas Flux, Stable Diffusion, and DALL-E, and improves the functionality of\nControlNet for fine-grained image editing tasks. To our knowledge, Context\nCanvas represents the first application of graph-based RAG in enhancing T2I\nmodels, representing a significant advancement for producing high-fidelity,\ncontext-aware multi-faceted images.\n","authors":["Kavana Venkatesh","Yusuf Dalva","Ismini Lourentzou","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.09614v1.pdf","comment":"Project Page: https://context-canvas.github.io/"},{"id":"http://arxiv.org/abs/2412.09611v1","updated":"2024-12-12T18:59:40Z","published":"2024-12-12T18:59:40Z","title":"FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers","summary":" Rectified flow models have emerged as a dominant approach in image\ngeneration, showcasing impressive capabilities in high-quality image synthesis.\nHowever, despite their effectiveness in visual generation, rectified flow\nmodels often struggle with disentangled editing of images. This limitation\nprevents the ability to perform precise, attribute-specific modifications\nwithout affecting unrelated aspects of the image. In this paper, we introduce\nFluxSpace, a domain-agnostic image editing method leveraging a representation\nspace with the ability to control the semantics of images generated by\nrectified flow transformers, such as Flux. By leveraging the representations\nlearned by the transformer blocks within the rectified flow models, we propose\na set of semantically interpretable representations that enable a wide range of\nimage editing tasks, from fine-grained image editing to artistic creation. This\nwork offers a scalable and effective image editing approach, along with its\ndisentanglement capabilities.\n","authors":["Yusuf Dalva","Kavana Venkatesh","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.09611v1.pdf","comment":"Project Page: https://fluxspace.github.io"},{"id":"http://arxiv.org/abs/2412.09612v1","updated":"2024-12-12T18:59:40Z","published":"2024-12-12T18:59:40Z","title":"Olympus: A Universal Task Router for Computer Vision Tasks","summary":" We introduce Olympus, a new approach that transforms Multimodal Large\nLanguage Models (MLLMs) into a unified framework capable of handling a wide\narray of computer vision tasks. Utilizing a controller MLLM, Olympus delegates\nover 20 specialized tasks across images, videos, and 3D objects to dedicated\nmodules. This instruction-based routing enables complex workflows through\nchained actions without the need for training heavy generative models. Olympus\neasily integrates with existing MLLMs, expanding their capabilities with\ncomparable performance. Experimental results demonstrate that Olympus achieves\nan average routing accuracy of 94.75% across 20 tasks and precision of 91.82%\nin chained action scenarios, showcasing its effectiveness as a universal task\nrouter that can solve a diverse range of computer vision tasks. Project page:\nhttps://github.com/yuanze-lin/Olympus_page\n","authors":["Yuanze Lin","Yunsheng Li","Dongdong Chen","Weijian Xu","Ronald Clark","Philip H. S. Torr"],"pdf_url":"https://arxiv.org/pdf/2412.09612v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2412.09613v1","updated":"2024-12-12T18:59:40Z","published":"2024-12-12T18:59:40Z","title":"PVC: Progressive Visual Token Compression for Unified Image and Video\n Processing in Large Vision-Language Models","summary":" Large Vision-Language Models (VLMs) have been extended to understand both\nimages and videos. Visual token compression is leveraged to reduce the\nconsiderable token length of visual inputs. To meet the needs of different\ntasks, existing high-performance models usually process images and videos\nseparately with different token compression strategies, limiting the\ncapabilities of combining images and videos. To this end, we extend each image\ninto a \"static\" video and introduce a unified token compression strategy called\nProgressive Visual Token Compression (PVC), where the tokens of each frame are\nprogressively encoded and adaptively compressed to supplement the information\nnot extracted from previous frames. Video tokens are efficiently compressed\nwith exploiting the inherent temporal redundancy. Images are repeated as static\nvideos, and the spatial details can be gradually supplemented in multiple\nframes. PVC unifies the token compressing of images and videos. With a limited\nnumber of tokens per frame (64 tokens by default), spatial details and temporal\nchanges can still be preserved. Experiments show that our model achieves\nstate-of-the-art performance across various video understanding benchmarks,\nincluding long video tasks and fine-grained short video tasks. Meanwhile, our\nunified token compression strategy incurs no performance loss on image\nbenchmarks, particularly in detail-sensitive tasks.\n","authors":["Chenyu Yang","Xuan Dong","Xizhou Zhu","Weijie Su","Jiahao Wang","Hao Tian","Zhe Chen","Wenhai Wang","Lewei Lu","Jifeng Dai"],"pdf_url":"https://arxiv.org/pdf/2412.09613v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09608v1","updated":"2024-12-12T18:59:34Z","published":"2024-12-12T18:59:34Z","title":"Representing Long Volumetric Video with Temporal Gaussian Hierarchy","summary":" This paper aims to address the challenge of reconstructing long volumetric\nvideos from multi-view RGB videos. Recent dynamic view synthesis methods\nleverage powerful 4D representations, like feature grids or point cloud\nsequences, to achieve high-quality rendering results. However, they are\ntypically limited to short (1~2s) video clips and often suffer from large\nmemory footprints when dealing with longer videos. To solve this issue, we\npropose a novel 4D representation, named Temporal Gaussian Hierarchy, to\ncompactly model long volumetric videos. Our key observation is that there are\ngenerally various degrees of temporal redundancy in dynamic scenes, which\nconsist of areas changing at different speeds. Motivated by this, our approach\nbuilds a multi-level hierarchy of 4D Gaussian primitives, where each level\nseparately describes scene regions with different degrees of content change,\nand adaptively shares Gaussian primitives to represent unchanged scene content\nover different temporal segments, thus effectively reducing the number of\nGaussian primitives. In addition, the tree-like structure of the Gaussian\nhierarchy allows us to efficiently represent the scene at a particular moment\nwith a subset of Gaussian primitives, leading to nearly constant GPU memory\nusage during the training or rendering regardless of the video length.\nExtensive experimental results demonstrate the superiority of our method over\nalternative methods in terms of training cost, rendering speed, and storage\nusage. To our knowledge, this work is the first approach capable of efficiently\nhandling minutes of volumetric video data while maintaining state-of-the-art\nrendering quality. Our project page is available at:\nhttps://zju3dv.github.io/longvolcap.\n","authors":["Zhen Xu","Yinghao Xu","Zhiyuan Yu","Sida Peng","Jiaming Sun","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.09608v1.pdf","comment":"SIGGRAPH Asia 2024 (TOG). Project page:\n https://zju3dv.github.io/longvolcap"},{"id":"http://arxiv.org/abs/2412.09607v1","updated":"2024-12-12T18:59:31Z","published":"2024-12-12T18:59:31Z","title":"Spectral Image Tokenizer","summary":" Image tokenizers map images to sequences of discrete tokens, and are a\ncrucial component of autoregressive transformer-based image generation. The\ntokens are typically associated with spatial locations in the input image,\narranged in raster scan order, which is not ideal for autoregressive modeling.\nIn this paper, we propose to tokenize the image spectrum instead, obtained from\na discrete wavelet transform (DWT), such that the sequence of tokens represents\nthe image in a coarse-to-fine fashion. Our tokenizer brings several advantages:\n1) it leverages that natural images are more compressible at high frequencies,\n2) it can take and reconstruct images of different resolutions without\nretraining, 3) it improves the conditioning for next-token prediction --\ninstead of conditioning on a partial line-by-line reconstruction of the image,\nit takes a coarse reconstruction of the full image, 4) it enables partial\ndecoding where the first few generated tokens can reconstruct a coarse version\nof the image, 5) it enables autoregressive models to be used for image\nupsampling. We evaluate the tokenizer reconstruction metrics as well as\nmultiscale image generation, text-guided image upsampling and editing.\n","authors":["Carlos Esteves","Mohammed Suhail","Ameesh Makadia"],"pdf_url":"https://arxiv.org/pdf/2412.09607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09606v1","updated":"2024-12-12T18:59:28Z","published":"2024-12-12T18:59:28Z","title":"Feat2GS: Probing Visual Foundation Models with Gaussian Splatting","summary":" Given that visual foundation models (VFMs) are trained on extensive datasets\nbut often limited to 2D images, a natural question arises: how well do they\nunderstand the 3D world? With the differences in architecture and training\nprotocols (i.e., objectives, proxy tasks), a unified framework to fairly and\ncomprehensively probe their 3D awareness is urgently needed. Existing works on\n3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or\ntwo-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately,\nthese tasks ignore texture awareness, and require 3D data as ground-truth,\nwhich limits the scale and diversity of their evaluation set. To address these\nissues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM\nfeatures extracted from unposed images. This allows us to probe 3D awareness\nfor geometry and texture via novel view synthesis, without requiring 3D data.\nAdditionally, the disentanglement of 3DGS parameters - geometry\n($\\boldsymbol{x}, \\alpha, \\Sigma$) and texture ($\\boldsymbol{c}$) - enables\nseparate analysis of texture and geometry awareness. Under Feat2GS, we conduct\nextensive experiments to probe the 3D awareness of several VFMs, and\ninvestigate the ingredients that lead to a 3D aware VFM. Building on these\nfindings, we develop several variants that achieve state-of-the-art across\ndiverse datasets. This makes Feat2GS useful for probing VFMs, and as a\nsimple-yet-effective baseline for novel-view synthesis. Code and data will be\nmade available at https://fanegg.github.io/Feat2GS/.\n","authors":["Yue Chen","Xingyu Chen","Anpei Chen","Gerard Pons-Moll","Yuliang Xiu"],"pdf_url":"https://arxiv.org/pdf/2412.09606v1.pdf","comment":"Project Page: https://fanegg.github.io/Feat2GS/"},{"id":"http://arxiv.org/abs/2412.09604v1","updated":"2024-12-12T18:59:26Z","published":"2024-12-12T18:59:26Z","title":"SynerGen-VL: Towards Synergistic Image Understanding and Generation with\n Vision Experts and Token Folding","summary":" The remarkable success of Large Language Models (LLMs) has extended to the\nmultimodal domain, achieving outstanding performance in image understanding and\ngeneration. Recent efforts to develop unified Multimodal Large Language Models\n(MLLMs) that integrate these capabilities have shown promising results.\nHowever, existing approaches often involve complex designs in model\narchitecture or training pipeline, increasing the difficulty of model training\nand scaling. In this paper, we propose SynerGen-VL, a simple yet powerful\nencoder-free MLLM capable of both image understanding and generation. To\naddress challenges identified in existing encoder-free unified MLLMs, we\nintroduce the token folding mechanism and the vision-expert-based progressive\nalignment pretraining strategy, which effectively support high-resolution image\nunderstanding while reducing training complexity. After being trained on\nlarge-scale mixed image-text data with a unified next-token prediction\nobjective, SynerGen-VL achieves or surpasses the performance of existing\nencoder-free unified MLLMs with comparable or smaller parameter sizes, and\nnarrows the gap with task-specific state-of-the-art models, highlighting a\npromising path toward future unified MLLMs. Our code and models shall be\nreleased.\n","authors":["Hao Li","Changyao Tian","Jie Shao","Xizhou Zhu","Zhaokai Wang","Jinguo Zhu","Wenhan Dou","Xiaogang Wang","Hongsheng Li","Lewei Lu","Jifeng Dai"],"pdf_url":"https://arxiv.org/pdf/2412.09604v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09603v1","updated":"2024-12-12T18:59:25Z","published":"2024-12-12T18:59:25Z","title":"Do Multimodal Large Language Models See Like Humans?","summary":" Multimodal Large Language Models (MLLMs) have achieved impressive results on\nvarious vision tasks, leveraging recent advancements in large language models.\nHowever, a critical question remains unaddressed: do MLLMs perceive visual\ninformation similarly to humans? Current benchmarks lack the ability to\nevaluate MLLMs from this perspective. To address this challenge, we introduce\nHVSBench, a large-scale benchmark designed to assess the alignment between\nMLLMs and the human visual system (HVS) on fundamental vision tasks that mirror\nhuman vision. HVSBench curated over 85K multimodal samples, spanning 13\ncategories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing,\nFree-Viewing, and Searching. Extensive experiments demonstrate the\neffectiveness of our benchmark in providing a comprehensive evaluation of\nMLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models\nshow significant room for improvement, with most achieving only moderate\nresults. Our experiments reveal that HVSBench presents a new and significant\nchallenge for cutting-edge MLLMs. We believe that HVSBench will facilitate\nresearch on human-aligned and explainable MLLMs, marking a key step in\nunderstanding how MLLMs perceive and process visual information.\n","authors":["Jiaying Lin","Shuquan Ye","Rynson W. H. Lau"],"pdf_url":"https://arxiv.org/pdf/2412.09603v1.pdf","comment":"Project page: https://jiaying.link/HVSBench/"},{"id":"http://arxiv.org/abs/2412.09602v1","updated":"2024-12-12T18:59:13Z","published":"2024-12-12T18:59:13Z","title":"Hidden Biases of End-to-End Driving Datasets","summary":" End-to-end driving systems have made rapid progress, but have so far not been\napplied to the challenging new CARLA Leaderboard 2.0. Further, while there is a\nlarge body of literature on end-to-end architectures and training strategies,\nthe impact of the training dataset is often overlooked. In this work, we make a\nfirst attempt at end-to-end driving for Leaderboard 2.0. Instead of\ninvestigating architectures, we systematically analyze the training dataset,\nleading to new insights: (1) Expert style significantly affects downstream\npolicy performance. (2) In complex data sets, the frames should not be weighted\non the basis of simplistic criteria such as class frequencies. (3) Instead,\nestimating whether a frame changes the target labels compared to previous\nframes can reduce the size of the dataset without removing important\ninformation. By incorporating these findings, our model ranks first and second\nrespectively on the map and sensors tracks of the 2024 CARLA Challenge, and\nsets a new state-of-the-art on the Bench2Drive test routes. Finally, we uncover\na design flaw in the current evaluation metrics and propose a modification for\nfuture challenges. Our dataset, code, and pre-trained models are publicly\navailable at https://github.com/autonomousvision/carla_garage.\n","authors":["Julian Zimmerlin","Jens Beißwenger","Bernhard Jaeger","Andreas Geiger","Kashyap Chitta"],"pdf_url":"https://arxiv.org/pdf/2412.09602v1.pdf","comment":"Technical report for the CVPR 2024 Workshop on Foundation Models for\n Autonomous Systems. Runner-up of the track 'CARLA Autonomous Driving\n Challenge' in the 2024 Autonomous Grand Challenge\n (https://opendrivelab.com/challenge2024/)"},{"id":"http://arxiv.org/abs/2412.09601v1","updated":"2024-12-12T18:59:11Z","published":"2024-12-12T18:59:11Z","title":"TimeRefine: Temporal Grounding with Time Refining Video LLM","summary":" Video temporal grounding aims to localize relevant temporal boundaries in a\nvideo given a textual prompt. Recent work has focused on enabling Video LLMs to\nperform video temporal grounding via next-token prediction of temporal\ntimestamps. However, accurately localizing timestamps in videos remains\nchallenging for Video LLMs when relying solely on temporal token prediction.\nOur proposed TimeRefine addresses this challenge in two ways. First, instead of\ndirectly predicting the start and end timestamps, we reformulate the temporal\ngrounding task as a temporal refining task: the model first makes rough\npredictions and then refines them by predicting offsets to the target segment.\nThis refining process is repeated multiple times, through which the model\nprogressively self-improves its temporal localization accuracy. Second, to\nenhance the model's temporal perception capabilities, we incorporate an\nauxiliary prediction head that penalizes the model more if a predicted segment\ndeviates further from the ground truth, thus encouraging the model to make\ncloser and more accurate predictions. Our plug-and-play method can be\nintegrated into most LLM-based temporal grounding approaches. The experimental\nresults demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on\nthe ActivityNet and Charades-STA datasets, respectively. Code and pretrained\nmodels will be released.\n","authors":["Xizi Wang","Feng Cheng","Ziyang Wang","Huiyu Wang","Md Mohaiminul Islam","Lorenzo Torresani","Mohit Bansal","Gedas Bertasius","David Crandall"],"pdf_url":"https://arxiv.org/pdf/2412.09601v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09600v1","updated":"2024-12-12T18:59:01Z","published":"2024-12-12T18:59:01Z","title":"Owl-1: Omni World Model for Consistent Long Video Generation","summary":" Video generation models (VGMs) have received extensive attention recently and\nserve as promising candidates for general-purpose large vision models. While\nthey can only generate short videos each time, existing methods achieve long\nvideo generation by iteratively calling the VGMs, using the last-frame output\nas the condition for the next-round generation. However, the last frame only\ncontains short-term fine-grained information about the scene, resulting in\ninconsistency in the long horizon. To address this, we propose an Omni World\nmodeL (Owl-1) to produce long-term coherent and comprehensive conditions for\nconsistent long video generation. As videos are observations of the underlying\nevolving world, we propose to model the long-term developments in a latent\nspace and use VGMs to film them into videos. Specifically, we represent the\nworld with a latent state variable which can be decoded into explicit video\nobservations. These observations serve as a basis for anticipating temporal\ndynamics which in turn update the state variable. The interaction between\nevolving dynamics and persistent state enhances the diversity and consistency\nof the long videos. Extensive experiments show that Owl-1 achieves comparable\nperformance with SOTA methods on VBench-I2V and VBench-Long, validating its\nability to generate high-quality video observations. Code:\nhttps://github.com/huang-yh/Owl.\n","authors":["Yuanhui Huang","Wenzhao Zheng","Yuan Gao","Xin Tao","Pengfei Wan","Di Zhang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.09600v1.pdf","comment":"Code is available at: https://github.com/huang-yh/Owl"},{"id":"http://arxiv.org/abs/2412.09599v1","updated":"2024-12-12T18:59:00Z","published":"2024-12-12T18:59:00Z","title":"RatBodyFormer: Rodent Body Surface from Keypoints","summary":" Rat behavior modeling goes to the heart of many scientific studies, yet the\ntextureless body surface evades automatic analysis as it literally has no\nkeypoints that detectors can find. The movement of the body surface, however,\nis a rich source of information for deciphering the rat behavior. We introduce\ntwo key contributions to automatically recover densely 3D sampled rat body\nsurface points, passively. The first is RatDome, a novel multi-camera system\nfor rat behavior capture, and a large-scale dataset captured with it that\nconsists of pairs of 3D keypoints and 3D body surface points. The second is\nRatBodyFormer, a novel network to transform detected keypoints to 3D body\nsurface points. RatBodyFormer is agnostic to the exact locations of the 3D body\nsurface points in the training data and is trained with masked-learning. We\nexperimentally validate our framework with a number of real-world experiments.\nOur results collectively serve as a novel foundation for automated rat behavior\nanalysis and will likely have far-reaching implications for biomedical and\nneuroscientific research.\n","authors":["Ayaka Higami","Karin Oshima","Tomoyo Isoguchi Shiramatsu","Hirokazu Takahashi","Shohei Nobuhara","Ko Nishino"],"pdf_url":"https://arxiv.org/pdf/2412.09599v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09597v1","updated":"2024-12-12T18:58:42Z","published":"2024-12-12T18:58:42Z","title":"LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video\n Generation Priors","summary":" Single-image 3D reconstruction remains a fundamental challenge in computer\nvision due to inherent geometric ambiguities and limited viewpoint information.\nRecent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D\npriors learned from large-scale video data. However, leveraging these priors\neffectively faces three key challenges: (1) degradation in quality across large\ncamera motions, (2) difficulties in achieving precise camera control, and (3)\ngeometric distortions inherent to the diffusion process that damage 3D\nconsistency. We address these challenges by proposing LiftImage3D, a framework\nthat effectively releases LVDMs' generative priors while ensuring 3D\nconsistency. Specifically, we design an articulated trajectory strategy to\ngenerate video frames, which decomposes video sequences with large camera\nmotions into ones with controllable small motions. Then we use robust neural\nmatching models, i.e. MASt3R, to calibrate the camera poses of generated frames\nand produce corresponding point clouds. Finally, we propose a distortion-aware\n3D Gaussian splatting representation, which can learn independent distortions\nbetween frames and output undistorted canonical Gaussians. Extensive\nexperiments demonstrate that LiftImage3D achieves state-of-the-art performance\non two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and\ngeneralizes well to diverse in-the-wild images, from cartoon illustrations to\ncomplex real-world scenes.\n","authors":["Yabo Chen","Chen Yang","Jiemin Fang","Xiaopeng Zhang","Lingxi Xie","Wei Shen","Wenrui Dai","Hongkai Xiong","Qi Tian"],"pdf_url":"https://arxiv.org/pdf/2412.09597v1.pdf","comment":"Project page: https://liftimage3d.github.io/"},{"id":"http://arxiv.org/abs/2406.09390v2","updated":"2024-12-12T18:58:34Z","published":"2024-06-13T17:59:05Z","title":"LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living","summary":" Current Large Language Vision Models (LLVMs) trained on web videos perform\nwell in general video understanding but struggle with fine-grained details,\ncomplex human-object interactions (HOI), and view-invariant representation\nlearning essential for Activities of Daily Living (ADL). This limitation stems\nfrom a lack of specialized ADL video instruction-tuning datasets and\ninsufficient modality integration to capture discriminative action\nrepresentations. To address this, we propose a semi-automated framework for\ncurating ADL datasets, creating ADL-X, a multiview, multimodal RGBS\ninstruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM\nintegrating videos, 3D skeletons, and HOIs to model ADL's complex\nspatiotemporal relationships. For training LLAVIDAL a simple joint alignment of\nall modalities yields suboptimal results; thus, we propose a Multimodal\nProgressive (MMPro) training strategy, incorporating modalities in stages\nfollowing a curriculum. We also establish ADL MCQ and video description\nbenchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL\nachieves state-of-the-art performance across ADL benchmarks. Code and data will\nbe made publicly available at: https://adl-x.github.io/.\n","authors":["Dominick Reilly","Rajatsubhra Chakraborty","Arkaprava Sinha","Manish Kumar Govind","Pu Wang","Francois Bremond","Le Xue","Srijan Das"],"pdf_url":"https://arxiv.org/pdf/2406.09390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09596v1","updated":"2024-12-12T18:58:30Z","published":"2024-12-12T18:58:30Z","title":"InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for\n Long-term Streaming Video and Audio Interactions","summary":" Creating AI systems that can interact with environments over long periods,\nsimilar to human cognition, has been a longstanding research goal. Recent\nadvancements in multimodal large language models (MLLMs) have made significant\nstrides in open-world understanding. However, the challenge of continuous and\nsimultaneous streaming perception, memory, and reasoning remains largely\nunexplored. Current MLLMs are constrained by their sequence-to-sequence\narchitecture, which limits their ability to process inputs and generate\nresponses simultaneously, akin to being unable to think while perceiving.\nFurthermore, relying on long contexts to store historical data is impractical\nfor long-term interactions, as retaining all information becomes costly and\ninefficient. Therefore, rather than relying on a single foundation model to\nperform all functions, this project draws inspiration from the concept of the\nSpecialized Generalist AI and introduces disentangled streaming perception,\nreasoning, and memory mechanisms, enabling real-time interaction with streaming\nvideo and audio input. The proposed framework InternLM-XComposer2.5-OmniLive\n(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:\nProcesses multimodal information in real-time, storing key details in memory\nand triggering reasoning in response to user queries. (2) Multi-modal Long\nMemory Module: Integrates short-term and long-term memory, compressing\nshort-term memories into long-term ones for efficient retrieval and improved\naccuracy. (3) Reasoning Module: Responds to queries and executes reasoning\ntasks, coordinating with the perception and memory modules. This project\nsimulates human-like cognition, enabling multimodal large language models to\nprovide continuous and adaptive service over time.\n","authors":["Pan Zhang","Xiaoyi Dong","Yuhang Cao","Yuhang Zang","Rui Qian","Xilin Wei","Lin Chen","Yifei Li","Junbo Niu","Shuangrui Ding","Qipeng Guo","Haodong Duan","Xin Chen","Han Lv","Zheng Nie","Min Zhang","Bin Wang","Wenwei Zhang","Xinyue Zhang","Jiaye Ge","Wei Li","Jingwen Li","Zhongying Tu","Conghui He","Xingcheng Zhang","Kai Chen","Yu Qiao","Dahua Lin","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09596v1.pdf","comment":"Github Repo:\n https://github.com/InternLM/InternLM-XComposer/tree/main/InternLM-XComposer-2.5-OmniLive"},{"id":"http://arxiv.org/abs/2412.09593v1","updated":"2024-12-12T18:58:09Z","published":"2024-12-12T18:58:09Z","title":"Neural LightRig: Unlocking Accurate Object Normal and Material\n Estimation with Multi-Light Diffusion","summary":" Recovering the geometry and materials of objects from a single image is\nchallenging due to its under-constrained nature. In this paper, we present\nNeural LightRig, a novel framework that boosts intrinsic estimation by\nleveraging auxiliary multi-lighting conditions from 2D diffusion priors.\nSpecifically, 1) we first leverage illumination priors from large-scale\ndiffusion models to build our multi-light diffusion model on a synthetic\nrelighting dataset with dedicated designs. This diffusion model generates\nmultiple consistent images, each illuminated by point light sources in\ndifferent directions. 2) By using these varied lighting images to reduce\nestimation uncertainty, we train a large G-buffer model with a U-Net backbone\nto accurately predict surface normals and materials. Extensive experiments\nvalidate that our approach significantly outperforms state-of-the-art methods,\nenabling accurate surface normal and PBR material estimation with vivid\nrelighting effects. Code and dataset are available on our project page at\nhttps://projects.zxhezexin.com/neural-lightrig.\n","authors":["Zexin He","Tengfei Wang","Xin Huang","Xingang Pan","Ziwei Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09593v1.pdf","comment":"Project page: https://projects.zxhezexin.com/neural-lightrig"},{"id":"http://arxiv.org/abs/2412.09586v1","updated":"2024-12-12T18:55:30Z","published":"2024-12-12T18:55:30Z","title":"Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders","summary":" We address the problem of gaze target estimation, which aims to predict where\na person is looking in a scene. Predicting a person's gaze target requires\nreasoning both about the person's appearance and the contents of the scene.\nPrior works have developed increasingly complex, hand-crafted pipelines for\ngaze target estimation that carefully fuse features from separate scene\nencoders, head encoders, and auxiliary models for signals like depth and pose.\nMotivated by the success of general-purpose feature extractors on a variety of\nvisual tasks, we propose Gaze-LLE, a novel transformer framework that\nstreamlines gaze target estimation by leveraging features from a frozen DINOv2\nencoder. We extract a single feature representation for the scene, and apply a\nperson-specific positional prompt to decode gaze with a lightweight module. We\ndemonstrate state-of-the-art performance across several gaze benchmarks and\nprovide extensive analysis to validate our design choices. Our code is\navailable at: http://github.com/fkryan/gazelle .\n","authors":["Fiona Ryan","Ajay Bati","Sangmin Lee","Daniel Bolya","Judy Hoffman","James M. Rehg"],"pdf_url":"https://arxiv.org/pdf/2412.09586v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09585v1","updated":"2024-12-12T18:55:18Z","published":"2024-12-12T18:55:18Z","title":"OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary\n Embedding Distillation","summary":" The standard practice for developing contemporary MLLMs is to feed features\nfrom vision encoder(s) into the LLM and train with natural language\nsupervision. In this work, we posit an overlooked opportunity to optimize the\nintermediate LLM representations through a vision perspective (objective),\ni.e., solely natural language supervision is sub-optimal for the MLLM's visual\nunderstanding ability. To that end, we propose OLA-VLM, the first approach\ndistilling knowledge into the LLM's hidden representations from a set of target\nvisual representations. Firstly, we formulate the objective during the\npretraining stage in MLLMs as a coupled optimization of predictive visual\nembedding and next text-token prediction. Secondly, we investigate MLLMs\ntrained solely with natural language supervision and identify a positive\ncorrelation between the quality of visual representations within these models\nand their downstream performance. Moreover, upon probing our OLA-VLM, we\nobserve improved representation quality owing to the embedding optimization.\nThirdly, we demonstrate that our OLA-VLM outperforms the single and\nmulti-encoder baselines, proving our approach's superiority over explicitly\nfeeding the corresponding features to the LLM. Particularly, OLA-VLM boosts\nperformance by an average margin of up to 2.5% on various benchmarks, with a\nnotable improvement of 8.7% on the Depth task in CV-Bench. Our code is\nopen-sourced at https://github.com/SHI-Labs/OLA-VLM .\n","authors":["Jitesh Jain","Zhengyuan Yang","Humphrey Shi","Jianfeng Gao","Jianwei Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09585v1.pdf","comment":"Project Page: https://praeclarumjj3.github.io/ola_vlm/"},{"id":"http://arxiv.org/abs/2412.09582v1","updated":"2024-12-12T18:54:48Z","published":"2024-12-12T18:54:48Z","title":"Neptune: The Long Orbit to Benchmarking Long Video Understanding","summary":" This paper describes a semi-automatic pipeline to generate challenging\nquestion-answer-decoy sets for understanding long videos. Many existing video\ndatasets and models are focused on short clips (10s-30s). While some long video\ndatasets do exist, they can often be solved by powerful image models applied\nper frame (and often to very few frames) in a video, and are usually manually\nannotated at high cost. In order to mitigate both these problems, we propose a\nscalable dataset creation pipeline which leverages large models (VLMs and\nLLMs), to automatically generate dense, time-aligned video captions, as well as\ntough question answer decoy sets for video segments (up to 15 minutes in\nlength). Our dataset Neptune covers a broad range of long video reasoning\nabilities and consists of a subset that emphasizes multimodal reasoning. Since\nexisting metrics for open-ended question answering are either rule-based or may\nrely on proprietary models, we provide a new open source model-based metric GEM\nto score open-ended responses on Neptune. Benchmark evaluations reveal that\nmost current open-source long video models perform poorly on Neptune,\nparticularly on questions testing temporal ordering, counting and state\nchanges. Through Neptune, we aim to spur the development of more advanced\nmodels capable of understanding long videos. The dataset is available at\nhttps://github.com/google-deepmind/neptune\n","authors":["Arsha Nagrani","Mingda Zhang","Ramin Mehran","Rachel Hornung","Nitesh Bharadwaj Gundavarapu","Nilpa Jha","Austin Myers","Xingyi Zhou","Boqing Gong","Cordelia Schmid","Mikhail Sirotenko","Yukun Zhu","Tobias Weyand"],"pdf_url":"https://arxiv.org/pdf/2412.09582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09573v1","updated":"2024-12-12T18:52:53Z","published":"2024-12-12T18:52:53Z","title":"FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D\n Reconstruction","summary":" Existing sparse-view reconstruction models heavily rely on accurate known\ncamera poses. However, deriving camera extrinsics and intrinsics from\nsparse-view images presents significant challenges. In this work, we present\nFreeSplatter, a highly scalable, feed-forward reconstruction framework capable\nof generating high-quality 3D Gaussians from uncalibrated sparse-view images\nand recovering their camera parameters in mere seconds. FreeSplatter is built\nupon a streamlined transformer architecture, comprising sequential\nself-attention blocks that facilitate information exchange among multi-view\nimage tokens and decode them into pixel-wise 3D Gaussian primitives. The\npredicted Gaussian primitives are situated in a unified reference frame,\nallowing for high-fidelity 3D modeling and instant camera parameter estimation\nusing off-the-shelf solvers. To cater to both object-centric and scene-level\nreconstruction, we train two model variants of FreeSplatter on extensive\ndatasets. In both scenarios, FreeSplatter outperforms state-of-the-art\nbaselines in terms of reconstruction quality and pose estimation accuracy.\nFurthermore, we showcase FreeSplatter's potential in enhancing the productivity\nof downstream applications, such as text/image-to-3D content creation.\n","authors":["Jiale Xu","Shenghua Gao","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2412.09573v1.pdf","comment":"Project page: https://bluestyle97.github.io/projects/freesplatter/"},{"id":"http://arxiv.org/abs/2409.19069v3","updated":"2024-12-12T18:48:25Z","published":"2024-09-27T18:11:00Z","title":"Localizing Memorization in SSL Vision Encoders","summary":" Recent work on studying memorization in self-supervised learning (SSL)\nsuggests that even though SSL encoders are trained on millions of images, they\nstill memorize individual data points. While effort has been put into\ncharacterizing the memorized data and linking encoder memorization to\ndownstream utility, little is known about where the memorization happens inside\nSSL encoders. To close this gap, we propose two metrics for localizing\nmemorization in SSL encoders on a per-layer (layermem) and per-unit basis\n(unitmem). Our localization methods are independent of the downstream task, do\nnot require any label information, and can be performed in a forward pass. By\nlocalizing memorization in various encoder architectures (convolutional and\ntransformer-based) trained on diverse datasets with contrastive and\nnon-contrastive SSL frameworks, we find that (1) while SSL memorization\nincreases with layer depth, highly memorizing units are distributed across the\nentire encoder, (2) a significant fraction of units in SSL encoders experiences\nsurprisingly high memorization of individual data points, which is in contrast\nto models trained under supervision, (3) atypical (or outlier) data points\ncause much higher layer and unit memorization than standard data points, and\n(4) in vision transformers, most memorization happens in the fully-connected\nlayers. Finally, we show that localizing memorization in SSL has the potential\nto improve fine-tuning and to inform pruning strategies.\n","authors":["Wenhao Wang","Adam Dziedzic","Michael Backes","Franziska Boenisch"],"pdf_url":"https://arxiv.org/pdf/2409.19069v3.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.08486v2","updated":"2024-12-12T18:43:39Z","published":"2024-12-11T15:51:14Z","title":"Learning Flow Fields in Attention for Controllable Person Image\n Generation","summary":" Controllable person image generation aims to generate a person image\nconditioned on reference images, allowing precise control over the person's\nappearance or pose. However, prior methods often distort fine-grained textural\ndetails from the reference image, despite achieving high overall image quality.\nWe attribute these distortions to inadequate attention to corresponding regions\nin the reference image. To address this, we thereby propose learning flow\nfields in attention (Leffa), which explicitly guides the target query to attend\nto the correct reference key in the attention layer during training.\nSpecifically, it is realized via a regularization loss on top of the attention\nmap within a diffusion-based baseline. Our extensive experiments show that\nLeffa achieves state-of-the-art performance in controlling appearance (virtual\ntry-on) and pose (pose transfer), significantly reducing fine-grained detail\ndistortion while maintaining high image quality. Additionally, we show that our\nloss is model-agnostic and can be used to improve the performance of other\ndiffusion models.\n","authors":["Zijian Zhou","Shikun Liu","Xiao Han","Haozhe Liu","Kam Woh Ng","Tian Xie","Yuren Cong","Hang Li","Mengmeng Xu","Juan-Manuel Pérez-Rúa","Aditya Patel","Tao Xiang","Miaojing Shi","Sen He"],"pdf_url":"https://arxiv.org/pdf/2412.08486v2.pdf","comment":"github: https://github.com/franciszzj/Leffa, demo:\n https://huggingface.co/spaces/franciszzj/Leffa, model:\n https://huggingface.co/franciszzj/Leffa"},{"id":"http://arxiv.org/abs/2405.04211v3","updated":"2024-12-12T18:42:37Z","published":"2024-05-07T11:24:37Z","title":"Leveraging Medical Foundation Model Features in Graph Neural\n Network-Based Retrieval of Breast Histopathology Images","summary":" Breast cancer is the most common cancer type in women worldwide. Early\ndetection and appropriate treatment can significantly reduce its impact. While\nhistopathology examinations play a vital role in rapid and accurate diagnosis,\nthey often require experienced medical experts for proper recognition and\ncancer grading. Automated image retrieval systems have the potential to assist\npathologists in identifying cancerous tissues, thereby accelerating the\ndiagnostic process. Nevertheless, proposing an accurate image retrieval model\nis challenging due to considerable variability among the tissue and cell\npatterns in histological images. In this work, we leverage the features from\nfoundation models in a novel attention-based adversarially regularized\nvariational graph autoencoder model for breast histological image retrieval.\nOur results confirm the superior performance of models trained with foundation\nmodel features compared to those using pre-trained convolutional neural\nnetworks (up to 7.7% and 15.5% for mAP and mMV, respectively), with the\npre-trained general-purpose self-supervised model for computational pathology\n(UNI) delivering the best overall performance. By evaluating two publicly\navailable histology image datasets of breast cancer, our top-performing model,\ntrained with UNI features, achieved average mAP/mMV scores of 96.7%/91.5% and\n97.6%/94.2% for the BreakHis and BACH datasets, respectively. Our proposed\nretrieval model has the potential to be used in clinical settings to enhance\ndiagnostic performance and ultimately benefit patients.\n","authors":["Nematollah Saeidi","Hossein Karshenas","Bijan Shoushtarian","Sepideh Hatamikia","Ramona Woitek","Amirreza Mahbod"],"pdf_url":"https://arxiv.org/pdf/2405.04211v3.pdf","comment":"29 pages"},{"id":"http://arxiv.org/abs/2412.09551v1","updated":"2024-12-12T18:41:20Z","published":"2024-12-12T18:41:20Z","title":"Video Creation by Demonstration","summary":" We explore a novel video creation experience, namely Video Creation by\nDemonstration. Given a demonstration video and a context image from a different\nscene, we generate a physically plausible video that continues naturally from\nthe context image and carries out the action concepts from the demonstration.\nTo enable this capability, we present $\\delta$-Diffusion, a self-supervised\ntraining approach that learns from unlabeled videos by conditional future frame\nprediction. Unlike most existing video generation controls that are based on\nexplicit signals, we adopts the form of implicit latent control for maximal\nflexibility and expressiveness required by general videos. By leveraging a\nvideo foundation model with an appearance bottleneck design on top, we extract\naction latents from demonstration videos for conditioning the generation\nprocess with minimal appearance leakage. Empirically, $\\delta$-Diffusion\noutperforms related baselines in terms of both human preference and large-scale\nmachine evaluations, and demonstrates potentials towards interactive world\nsimulation. Sampled video generation results are available at\nhttps://delta-diffusion.github.io/.\n","authors":["Yihong Sun","Hao Zhou","Liangzhe Yuan","Jennifer J. Sun","Yandong Li","Xuhui Jia","Hartwig Adam","Bharath Hariharan","Long Zhao","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09551v1.pdf","comment":"Project page at https://delta-diffusion.github.io/"},{"id":"http://arxiv.org/abs/2412.09549v1","updated":"2024-12-12T18:40:20Z","published":"2024-12-12T18:40:20Z","title":"Exemplar Masking for Multimodal Incremental Learning","summary":" Multimodal incremental learning needs to digest the information from multiple\nmodalities while concurrently learning new knowledge without forgetting the\npreviously learned information. There are numerous challenges for this task,\nmainly including the larger storage size of multimodal data in exemplar-based\nmethods and the computational requirement of finetuning on huge multimodal\nmodels. In this paper, we leverage the parameter-efficient tuning scheme to\nreduce the burden of fine-tuning and propose the exemplar masking framework to\nefficiently replay old knowledge. Specifically, the non-important tokens are\nmasked based on the attention weights and the correlation across different\nmodalities, significantly reducing the storage size of an exemplar and\nconsequently saving more exemplars under the same memory buffer. Moreover, we\ndesign a multimodal data augmentation technique to diversify exemplars for\nreplaying prior knowledge. In experiments, we not only evaluate our method in\nexisting multimodal datasets but also extend the ImageNet-R dataset to a\nmultimodal dataset as a real-world application, where captions are generated by\nquerying multimodal large language models (e.g., InstructBLIP). Extensive\nexperiments show that our exemplar masking framework is more efficient and\nrobust to catastrophic forgetting under the same limited memory buffer. Code is\navailable at https://github.com/YiLunLee/Exemplar_Masking_MCIL.\n","authors":["Yi-Lun Lee","Chen-Yu Lee","Wei-Chen Chiu","Yi-Hsuan Tsai"],"pdf_url":"https://arxiv.org/pdf/2412.09549v1.pdf","comment":"Project page: https://github.com/YiLunLee/Exemplar_Masking_MCIL"},{"id":"http://arxiv.org/abs/2412.09548v1","updated":"2024-12-12T18:38:42Z","published":"2024-12-12T18:38:42Z","title":"Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale","summary":" Meshes are fundamental representations of 3D surfaces. However, creating\nhigh-quality meshes is a labor-intensive task that requires significant time\nand expertise in 3D modeling. While a delicate object often requires over\n$10^4$ faces to be accurately modeled, recent attempts at generating\nartist-like meshes are limited to $1.6$K faces and heavy discretization of\nvertex coordinates. Hence, scaling both the maximum face count and vertex\ncoordinate resolution is crucial to producing high-quality meshes of realistic,\ncomplex 3D objects. We present Meshtron, a novel autoregressive mesh generation\nmodel able to generate meshes with up to 64K faces at 1024-level coordinate\nresolution --over an order of magnitude higher face count and $8{\\times}$\nhigher coordinate resolution than current state-of-the-art methods. Meshtron's\nscalability is driven by four key components: (1) an hourglass neural\narchitecture, (2) truncated sequence training, (3) sliding window inference,\n(4) a robust sampling strategy that enforces the order of mesh sequences. This\nresults in over $50{\\%}$ less training memory, $2.5{\\times}$ faster throughput,\nand better consistency than existing works. Meshtron generates meshes of\ndetailed, complex 3D objects at unprecedented levels of resolution and\nfidelity, closely resembling those created by professional artists, and opening\nthe door to more realistic generation of detailed 3D assets for animation,\ngaming, and virtual environments.\n","authors":["Zekun Hao","David W. Romero","Tsung-Yi Lin","Ming-Yu Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09548v1.pdf","comment":"Project page: https://research.nvidia.com/labs/dir/meshtron/"},{"id":"http://arxiv.org/abs/2412.09545v1","updated":"2024-12-12T18:35:26Z","published":"2024-12-12T18:35:26Z","title":"SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing","summary":" We introduce SimAvatar, a framework designed to generate simulation-ready\nclothed 3D human avatars from a text prompt. Current text-driven human avatar\ngeneration methods either model hair, clothing, and the human body using a\nunified geometry or produce hair and garments that are not easily adaptable for\nsimulation within existing simulation pipelines. The primary challenge lies in\nrepresenting the hair and garment geometry in a way that allows leveraging\nestablished prior knowledge from foundational image diffusion models (e.g.,\nStable Diffusion) while being simulation-ready using either physics or neural\nsimulators. To address this task, we propose a two-stage framework that\ncombines the flexibility of 3D Gaussians with simulation-ready hair strands and\ngarment meshes. Specifically, we first employ three text-conditioned 3D\ngenerative models to generate garment mesh, body shape and hair strands from\nthe given text prompt. To leverage prior knowledge from foundational diffusion\nmodels, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair\nstrands and learn the avatar appearance through optimization. To drive the\navatar given a pose sequence, we first apply physics simulators onto the\ngarment meshes and hair strands. We then transfer the motion onto 3D Gaussians\nthrough carefully designed mechanisms for each body part. As a result, our\nsynthesized avatars have vivid texture and realistic dynamic motion. To the\nbest of our knowledge, our method is the first to produce highly realistic,\nfully simulation-ready 3D avatars, surpassing the capabilities of current\napproaches.\n","authors":["Xueting Li","Ye Yuan","Shalini De Mello","Gilles Daviet","Jonathan Leaf","Miles Macklin","Jan Kautz","Umar Iqbal"],"pdf_url":"https://arxiv.org/pdf/2412.09545v1.pdf","comment":"Project website: https://nvlabs.github.io/SimAvatar/"},{"id":"http://arxiv.org/abs/2410.17251v2","updated":"2024-12-12T18:26:45Z","published":"2024-10-22T17:59:57Z","title":"Altogether: Image Captioning via Re-aligning Alt-text","summary":" This paper focuses on creating synthetic data to improve the quality of image\ncaptions. Existing works typically have two shortcomings. First, they caption\nimages from scratch, ignoring existing alt-text metadata, and second, lack\ntransparency if the captioners' training data (e.g. GPT) is unknown. In this\npaper, we study a principled approach Altogether based on the key idea to edit\nand re-align existing alt-texts associated with the images. To generate\ntraining data, we perform human annotation where annotators start with the\nexisting alt-text and re-align it to the image content in multiple rounds,\nconsequently constructing captions with rich visual concepts. This differs from\nprior work that carries out human annotation as a one-time description task\nsolely based on images and annotator knowledge. We train a captioner on this\ndata that generalizes the process of re-aligning alt-texts at scale. Our\nresults show our Altogether approach leads to richer image captions that also\nimprove text-to-image generation and zero-shot image classification tasks.\n","authors":["Hu Xu","Po-Yao Huang","Xiaoqing Ellen Tan","Ching-Feng Yeh","Jacob Kahn","Christine Jou","Gargi Ghosh","Omer Levy","Luke Zettlemoyer","Wen-tau Yih","Shang-Wen Li","Saining Xie","Christoph Feichtenhofer"],"pdf_url":"https://arxiv.org/pdf/2410.17251v2.pdf","comment":"accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine"},{"id":"http://arxiv.org/abs/2409.01314v2","updated":"2024-12-12T18:21:03Z","published":"2024-09-02T15:16:07Z","title":"Disentangling Mean Embeddings for Better Diagnostics of Image Generators","summary":" The evaluation of image generators remains a challenge due to the limitations\nof traditional metrics in providing nuanced insights into specific image\nregions. This is a critical problem as not all regions of an image may be\nlearned with similar ease. In this work, we propose a novel approach to\ndisentangle the cosine similarity of mean embeddings into the product of cosine\nsimilarities for individual pixel clusters via central kernel alignment.\nConsequently, we can quantify the contribution of the cluster-wise performance\nto the overall image generation performance. We demonstrate how this enhances\nthe explainability and the likelihood of identifying pixel regions of model\nmisbehavior across various real-world use cases.\n","authors":["Sebastian G. Gruber","Pascal Tobias Ziegler","Florian Buettner"],"pdf_url":"https://arxiv.org/pdf/2409.01314v2.pdf","comment":"Published at Interpretable AI: Past, Present and Future Workshop at\n NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.09530v1","updated":"2024-12-12T18:20:41Z","published":"2024-12-12T18:20:41Z","title":"Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM","summary":" The application of Large Vision-Language Models (LVLMs) for analyzing images\nand videos is an exciting and rapidly evolving field. In recent years, we've\nseen significant growth in high-quality image-text datasets for fine-tuning\nimage understanding, but there is still a lack of comparable datasets for\nvideos. Additionally, many VideoLLMs are extensions of single-image VLMs, which\nmay not efficiently handle the complexities of longer videos. In this study, we\nintroduce a large-scale synthetic dataset created from proprietary models,\nusing carefully designed prompts to tackle a wide range of questions. We also\nexplore a dynamic visual token compression architecture that strikes a balance\nbetween computational efficiency and performance. Our proposed \\model{}\nachieves state-of-the-art results across various video tasks and shows\nimpressive generalization, setting new baselines in multi-image understanding.\nNotably, \\model{} delivers an absolute improvement of 2.7\\% over\nLLaVA-OneVision on VideoMME and 10.7\\% on MuirBench. Codes are available at\nhttps://github.com/Hon-Wong/ByteVideoLLM\n","authors":["Han Wang","Yuxiang Nie","Yongjie Ye","Deng GuanYu","Yanjie Wang","Shuai Li","Haiyang Yu","Jinghui Lu","Can Huang"],"pdf_url":"https://arxiv.org/pdf/2412.09530v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09529v1","updated":"2024-12-12T18:20:16Z","published":"2024-12-12T18:20:16Z","title":"Can Modern LLMs Act as Agent Cores in Radiology~Environments?","summary":" Advancements in large language models (LLMs) have paved the way for LLM-based\nagent systems that offer enhanced accuracy and interpretability across various\ndomains. Radiology, with its complex analytical requirements, is an ideal field\nfor the application of these agents. This paper aims to investigate the\npre-requisite question for building concrete radiology agents which is, `Can\nmodern LLMs act as agent cores in radiology environments?' To investigate it,\nwe introduce RadABench with three-fold contributions: First, we present\nRadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based\nagents, generated from an extensive taxonomy encompassing 6 anatomies, 5\nimaging modalities, 10 tool categories, and 11 radiology tasks. Second, we\npropose RadABench-EvalPlat, a novel evaluation platform for agents featuring a\nprompt-driven workflow and the capability to simulate a wide range of radiology\ntoolsets. Third, we assess the performance of 7 leading LLMs on our benchmark\nfrom 5 perspectives with multiple metrics. Our findings indicate that while\ncurrent LLMs demonstrate strong capabilities in many areas, they are still not\nsufficiently advanced to serve as the central agent core in a fully operational\nradiology agent system. Additionally, we identify key factors influencing the\nperformance of LLM-based agent cores, offering insights for clinicians on how\nto apply agent systems in real-world radiology practices effectively. All of\nour code and data are open-sourced in\nhttps://github.com/MAGIC-AI4Med/RadABench.\n","authors":["Qiaoyu Zheng","Chaoyi Wu","Pengcheng Qiu","Lisong Dai","Ya Zhang","Yanfeng Wang","Weidi Xie"],"pdf_url":"https://arxiv.org/pdf/2412.09529v1.pdf","comment":"22 pages,7 figures"},{"id":"http://arxiv.org/abs/2412.04332v2","updated":"2024-12-12T18:08:56Z","published":"2024-12-05T16:48:16Z","title":"Liquid: Language Models are Scalable Multi-modal Generators","summary":" We present Liquid, an auto-regressive generation paradigm that seamlessly\nintegrates visual comprehension and generation by tokenizing images into\ndiscrete codes and learning these code embeddings alongside text tokens within\na shared feature space for both vision and language. Unlike previous multimodal\nlarge language model (MLLM), Liquid achieves this integration using a single\nlarge language model (LLM), eliminating the need for external pretrained visual\nembeddings such as CLIP. For the first time, Liquid uncovers a scaling law that\nperformance drop unavoidably brought by the unified training of visual and\nlanguage tasks diminishes as the model size increases. Furthermore, the unified\ntoken space enables visual generation and comprehension tasks to mutually\nenhance each other, effectively removing the typical interference seen in\nearlier models. We show that existing LLMs can serve as strong foundations for\nLiquid, saving 100x in training costs while outperforming Chameleon in\nmultimodal capabilities and maintaining language performance comparable to\nmainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and\nSD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and\ntext-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2\nare powerful multimodal generators, offering a scalable solution for enhancing\nboth vision-language understanding and generation. The code and models will be\nreleased at https://github.com/FoundationVision/Liquid.\n","authors":["Junfeng Wu","Yi Jiang","Chuofan Ma","Yuliang Liu","Hengshuang Zhao","Zehuan Yuan","Song Bai","Xiang Bai"],"pdf_url":"https://arxiv.org/pdf/2412.04332v2.pdf","comment":"Technical report. Project page:\n https://github.com/FoundationVision/Liquid"},{"id":"http://arxiv.org/abs/2412.09521v1","updated":"2024-12-12T18:07:23Z","published":"2024-12-12T18:07:23Z","title":"Efficient and Comprehensive Feature Extraction in Large Vision-Language\n Model for Clinical Pathology Analysis","summary":" Pathological diagnosis is vital for determining disease characteristics,\nguiding treatment, and assessing prognosis, relying heavily on detailed,\nmulti-scale analysis of high-resolution whole slide images (WSI). However,\ntraditional pure vision models face challenges of redundant feature extraction,\nwhereas existing large vision-language models (LVLMs) are limited by input\nresolution constraints, hindering their efficiency and accuracy. To overcome\nthese issues, we propose two innovative strategies: the mixed task-guided\nfeature enhancement, which directs feature extraction toward lesion-related\ndetails across scales, and the prompt-guided detail feature completion, which\nintegrates coarse- and fine-grained features from WSI based on specific prompts\nwithout compromising inference speed. Leveraging a comprehensive dataset of\n490,000 samples from diverse pathology tasks-including cancer detection,\ngrading, vascular and neural invasion identification, and so on-we trained the\npathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that\nthis model significantly outperforms existing methods in diagnostic accuracy\nand efficiency, offering an interactive, clinically aligned approach for\nauxiliary diagnosis in a wide range of pathology applications.\n","authors":["Shengxuming Zhang","Weihan Li","Tianhong Gao","Jiacong Hu","Haoming Luo","Mingli Song","Xiuming Zhang","Zunlei Feng"],"pdf_url":"https://arxiv.org/pdf/2412.09521v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09513v1","updated":"2024-12-12T17:59:28Z","published":"2024-12-12T17:59:28Z","title":"Agent-based Video Trimming","summary":" As information becomes more accessible, user-generated videos are increasing\nin length, placing a burden on viewers to sift through vast content for\nvaluable insights. This trend underscores the need for an algorithm to extract\nkey video information efficiently. Despite significant advancements in\nhighlight detection, moment retrieval, and video summarization, current\napproaches primarily focus on selecting specific time intervals, often\noverlooking the relevance between segments and the potential for segment\narranging. In this paper, we introduce a novel task called Video Trimming (VT),\nwhich focuses on detecting wasted footage, selecting valuable segments, and\ncomposing them into a final video with a coherent story. To address this task,\nwe propose Agent-based Video Trimming (AVT), structured into three phases:\nVideo Structuring, Clip Filtering, and Story Composition. Specifically, we\nemploy a Video Captioning Agent to convert video slices into structured textual\ndescriptions, a Filtering Module to dynamically discard low-quality footage\nbased on the structured information of each clip, and a Video Arrangement Agent\nto select and compile valid clips into a coherent final narrative. For\nevaluation, we develop a Video Evaluation Agent to assess trimmed videos,\nconducting assessments in parallel with human evaluations. Additionally, we\ncurate a new benchmark dataset for video trimming using raw user videos from\nthe internet. As a result, AVT received more favorable evaluations in user\nstudies and demonstrated superior mAP and precision on the YouTube Highlights,\nTVSum, and our own dataset for the highlight detection task. The code and\nmodels are available at https://ylingfeng.github.io/AVT.\n","authors":["Lingfeng Yang","Zhenyuan Chen","Xiang Li","Peiyang Jia","Liangqu Long","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09513v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09511v1","updated":"2024-12-12T17:59:03Z","published":"2024-12-12T17:59:03Z","title":"GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency","summary":" Identifying affordance regions on 3D objects from semantic cues is essential\nfor robotics and human-machine interaction. However, existing 3D affordance\nlearning methods struggle with generalization and robustness due to limited\nannotated data and a reliance on 3D backbones focused on geometric encoding,\nwhich often lack resilience to real-world noise and data corruption. We propose\nGEAL, a novel framework designed to enhance the generalization and robustness\nof 3D affordance learning by leveraging large-scale pre-trained 2D models. We\nemploy a dual-branch architecture with Gaussian splatting to establish\nconsistent mappings between 3D point clouds and 2D representations, enabling\nrealistic 2D renderings from sparse point clouds. A granularity-adaptive fusion\nmodule and a 2D-3D consistency alignment module further strengthen cross-modal\nalignment and knowledge transfer, allowing the 3D branch to benefit from the\nrich semantics and generalization capacity of 2D models. To holistically assess\nthe robustness, we introduce two new corruption-based benchmarks: PIAD-C and\nLASO-C. Extensive experiments on public datasets and our benchmarks show that\nGEAL consistently outperforms existing methods across seen and novel object\ncategories, as well as corrupted data, demonstrating robust and adaptable\naffordance prediction under diverse conditions. Code and corruption datasets\nhave been made publicly available.\n","authors":["Dongyue Lu","Lingdong Kong","Tianxin Huang","Gim Hee Lee"],"pdf_url":"https://arxiv.org/pdf/2412.09511v1.pdf","comment":"22 pages, 8 figures, 12 tables; Project Page at\n https://dylanorange.github.io/projects/geal"},{"id":"http://arxiv.org/abs/2412.09507v1","updated":"2024-12-12T17:55:00Z","published":"2024-12-12T17:55:00Z","title":"Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction","summary":" Vision Transformers (ViTs) have demonstrated remarkable success in achieving\nstate-of-the-art performance across various image-based tasks and beyond. In\nthis study, we employ a ViT-based neural network to address the problem of\nindoor pathloss radio map prediction. The network's generalization ability is\nevaluated across diverse settings, including unseen buildings, frequencies, and\nantennas with varying radiation patterns. By leveraging extensive data\naugmentation techniques and pretrained DINOv2 weights, we achieve promising\nresults, even under the most challenging scenarios.\n","authors":["Edvard Ghukasyan","Hrant Khachatrian","Rafayel Mkrtchyan","Theofanis P. Raptis"],"pdf_url":"https://arxiv.org/pdf/2412.09507v1.pdf","comment":"Work partly supported by the RA Science Committee grant No. 22rl-052\n (DISTAL) and the EU under Italian National Recovery and Resilience Plan of\n NextGenerationEU on \"Telecommunications of the Future\" (PE00000001 - program\n \"RESTART\")"},{"id":"http://arxiv.org/abs/2412.09501v1","updated":"2024-12-12T17:50:39Z","published":"2024-12-12T17:50:39Z","title":"Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition","summary":" As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond\nsingle-domain capabilities is essential to meet the demands for more versatile\nand efficient AI. However, previous omni-models have insufficiently explored\nspeech, neglecting its integration with multi-modality. We introduce Lyra, an\nefficient MLLM that enhances multimodal abilities, including advanced\nlong-speech comprehension, sound understanding, cross-modality efficiency, and\nseamless speech interaction. To achieve efficiency and speech-centric\ncapabilities, Lyra employs three strategies: (1) leveraging existing\nopen-source large models and a proposed multi-modality LoRA to reduce training\ncosts and data requirements; (2) using a latent multi-modality regularizer and\nextractor to strengthen the relationship between speech and other modalities,\nthereby enhancing model performance; and (3) constructing a high-quality,\nextensive dataset that includes 1.5M multi-modal (language, vision, audio) data\nsamples and 12K long speech samples, enabling Lyra to handle complex long\nspeech inputs and achieve more robust omni-cognition. Compared to other\nomni-methods, Lyra achieves state-of-the-art performance on various\nvision-language, vision-speech, and speech-language benchmarks, while also\nusing fewer computational resources and less training data.\n","authors":["Zhisheng Zhong","Chengyao Wang","Yuqi Liu","Senqiao Yang","Longxiang Tang","Yuechen Zhang","Jingyao Li","Tianyuan Qu","Yanwei Li","Yukang Chen","Shaozuo Yu","Sitong Wu","Eric Lo","Shu Liu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2412.09501v1.pdf","comment":"Tech report"},{"id":"http://arxiv.org/abs/2412.09492v1","updated":"2024-12-12T17:41:49Z","published":"2024-12-12T17:41:49Z","title":"Video Seal: Open and Efficient Video Watermarking","summary":" The proliferation of AI-generated content and sophisticated video editing\ntools has made it both important and challenging to moderate digital platforms.\nVideo watermarking addresses these challenges by embedding imperceptible\nsignals into videos, allowing for identification. However, the rare open tools\nand methods often fall short on efficiency, robustness, and flexibility. To\nreduce these gaps, this paper introduces Video Seal, a comprehensive framework\nfor neural video watermarking and a competitive open-sourced model. Our\napproach jointly trains an embedder and an extractor, while ensuring the\nwatermark robustness by applying transformations in-between, e.g., video\ncodecs. This training is multistage and includes image pre-training, hybrid\npost-training and extractor fine-tuning. We also introduce temporal watermark\npropagation, a technique to convert any image watermarking model to an\nefficient video watermarking model without the need to watermark every\nhigh-resolution frame. We present experimental results demonstrating the\neffectiveness of the approach in terms of speed, imperceptibility, and\nrobustness. Video Seal achieves higher robustness compared to strong baselines\nespecially under challenging distortions combining geometric transformations\nand video compression. Additionally, we provide new insights such as the impact\nof video compression during training, and how to compare methods operating on\ndifferent payloads. Contributions in this work - including the codebase,\nmodels, and a public demo - are open-sourced under permissive licenses to\nfoster further research and development in the field.\n","authors":["Pierre Fernandez","Hady Elsahar","I. Zeki Yalniz","Alexandre Mourachko"],"pdf_url":"https://arxiv.org/pdf/2412.09492v1.pdf","comment":"Code available at https://github.com/facebookresearch/videoseal"},{"id":"http://arxiv.org/abs/2412.09475v1","updated":"2024-12-12T17:20:27Z","published":"2024-12-12T17:20:27Z","title":"New keypoint-based approach for recognising British Sign Language (BSL)\n from sequences","summary":" In this paper, we present a novel keypoint-based classification model\ndesigned to recognise British Sign Language (BSL) words within continuous\nsigning sequences. Our model's performance is assessed using the BOBSL dataset,\nrevealing that the keypoint-based approach surpasses its RGB-based counterpart\nin computational efficiency and memory usage. Furthermore, it offers expedited\ntraining times and demands fewer computational resources. To the best of our\nknowledge, this is the inaugural application of a keypoint-based model for BSL\nword classification, rendering direct comparisons with existing works\nunavailable.\n","authors":["Oishi Deb","KR Prajwal","Andrew Zisserman"],"pdf_url":"https://arxiv.org/pdf/2412.09475v1.pdf","comment":"International Conference on Computer Vision (ICCV) - HANDS Workshop"},{"id":"http://arxiv.org/abs/2412.09465v1","updated":"2024-12-12T17:14:58Z","published":"2024-12-12T17:14:58Z","title":"OFTSR: One-Step Flow for Image Super-Resolution with Tunable\n Fidelity-Realism Trade-offs","summary":" Recent advances in diffusion and flow-based generative models have\ndemonstrated remarkable success in image restoration tasks, achieving superior\nperceptual quality compared to traditional deep learning approaches. However,\nthese methods either require numerous sampling steps to generate high-quality\nimages, resulting in significant computational overhead, or rely on model\ndistillation, which usually imposes a fixed fidelity-realism trade-off and thus\nlacks flexibility. In this paper, we introduce OFTSR, a novel flow-based\nframework for one-step image super-resolution that can produce outputs with\ntunable levels of fidelity and realism. Our approach first trains a conditional\nflow-based super-resolution model to serve as a teacher model. We then distill\nthis teacher model by applying a specialized constraint. Specifically, we force\nthe predictions from our one-step student model for same input to lie on the\nsame sampling ODE trajectory of the teacher model. This alignment ensures that\nthe student model's single-step predictions from initial states match the\nteacher's predictions from a closer intermediate state. Through extensive\nexperiments on challenging datasets including FFHQ (256$\\times$256), DIV2K, and\nImageNet (256$\\times$256), we demonstrate that OFTSR achieves state-of-the-art\nperformance for one-step image super-resolution, while having the ability to\nflexibly tune the fidelity-realism trade-off. Code and pre-trained models are\navailable at https://github.com/yuanzhi-zhu/OFTSR and\nhttps://huggingface.co/Yuanzhi/OFTSR, respectively.\n","authors":["Yuanzhi Zhu","Ruiqing Wang","Shilin Lu","Junnan Li","Hanshu Yan","Kai Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09445v1","updated":"2024-12-12T16:59:37Z","published":"2024-12-12T16:59:37Z","title":"Embeddings are all you need! Achieving High Performance Medical Image\n Classification through Training-Free Embedding Analysis","summary":" Developing artificial intelligence (AI) and machine learning (ML) models for\nmedical imaging typically involves extensive training and testing on large\ndatasets, consuming significant computational time, energy, and resources.\nThere is a need for more efficient methods that can achieve comparable or\nsuperior diagnostic performance without the associated resource burden. We\ninvestigated the feasibility of replacing conventional training procedures with\nan embedding-based approach that leverages concise and semantically meaningful\nrepresentations of medical images. Using pre-trained foundational\nmodels-specifically, convolutional neural networks (CNN) like ResNet and\nmultimodal models like Contrastive Language-Image Pre-training (CLIP)-we\ngenerated image embeddings for multi-class classification tasks. Simple linear\nclassifiers were then applied to these embeddings. The approach was evaluated\nacross diverse medical imaging modalities, including retinal images,\nmammography, dermatoscopic images, and chest radiographs. Performance was\ncompared to benchmark models trained and tested using traditional methods. The\nembedding-based models surpassed the benchmark area under the receiver\noperating characteristic curve (AUC-ROC) scores by up to 87 percentage in\nmulti-class classification tasks across the various medical imaging modalities.\nNotably, CLIP embedding models achieved the highest AUC-ROC scores,\ndemonstrating superior classification performance while significantly reducing\ncomputational demands. Our study indicates that leveraging embeddings from\npre-trained foundational models can effectively replace conventional,\nresource-intensive training and testing procedures in medical image analysis.\nThis embedding-based approach offers a more efficient alternative for image\nsegmentation, classification, and prediction, potentially accelerating AI\ntechnology integration into clinical practice.\n","authors":["Raj Hansini Khoiwal","Alan B. McMillan"],"pdf_url":"https://arxiv.org/pdf/2412.09445v1.pdf","comment":"15 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.09441v1","updated":"2024-12-12T16:57:20Z","published":"2024-12-12T16:57:20Z","title":"MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental\n Learning","summary":" Class-Incremental Learning (CIL) requires models to continually acquire\nknowledge of new classes without forgetting old ones. Despite Pre-trained\nModels (PTMs) have shown excellent performance in CIL, catastrophic forgetting\nstill occurs as the model learns new concepts. Existing work seeks to utilize\nlightweight components to adjust the PTM, while the forgetting phenomenon still\ncomes from {\\em parameter and retrieval} levels. Specifically, iterative\nupdates of the model result in parameter drift, while mistakenly retrieving\nirrelevant modules leads to the mismatch during inference. To this end, we\npropose MOdel Surgery (MOS) to rescue the model from forgetting previous\nknowledge. By training task-specific adapters, we continually adjust the PTM to\ndownstream tasks. To mitigate parameter-level forgetting, we present an adapter\nmerging approach to learn task-specific adapters, which aims to bridge the gap\nbetween different components while reserve task-specific information. Besides,\nto address retrieval-level forgetting, we introduce a training-free\nself-refined adapter retrieval mechanism during inference, which leverages the\nmodel's inherent ability for better adapter retrieval. By jointly rectifying\nthe model with those steps, MOS can robustly resist catastrophic forgetting in\nthe learning process. Extensive experiments on seven benchmark datasets\nvalidate MOS's state-of-the-art performance. Code is available at:\nhttps://github.com/sun-hailong/AAAI25-MOS\n","authors":["Hai-Long Sun","Da-Wei Zhou","Hanbin Zhao","Le Gan","De-Chuan Zhan","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2412.09441v1.pdf","comment":"Accepted to AAAI 2025. Code is available at:\n https://github.com/sun-hailong/AAAI25-MOS"},{"id":"http://arxiv.org/abs/2412.09442v1","updated":"2024-12-12T16:57:20Z","published":"2024-12-12T16:57:20Z","title":"ATPrompt: Textual Prompt Learning with Embedded Attributes","summary":" Textual-based prompt learning methods primarily employ multiple learnable\nsoft prompts and hard class tokens in a cascading manner as text prompt inputs,\naiming to align image and text (category) spaces for downstream tasks. However,\ncurrent training is restricted to aligning images with predefined known\ncategories and cannot be associated with unknown categories. In this work, we\npropose utilizing universal attributes as a bridge to enhance the alignment\nbetween images and unknown categories. Specifically, we introduce an\nAttribute-embedded Textual Prompt learning method for vision-language models,\nnamed ATPrompt. This approach expands the learning space of soft prompts from\nthe original one-dimensional category level into the multi-dimensional\nattribute level by incorporating multiple universal attribute tokens into the\nlearnable soft prompts. Through this modification, we transform the text prompt\nfrom a category-centric form to an attribute-category hybrid form. To finalize\nthe attributes for downstream tasks, we propose a differentiable attribute\nsearch method that learns to identify representative and suitable attributes\nfrom a candidate pool summarized by a large language model. As an easy-to-use\nplug-in technique, ATPrompt can seamlessly replace the existing prompt format\nof textual-based methods, offering general improvements at a negligible\ncomputational cost. Extensive experiments on 11 datasets demonstrate the\neffectiveness of our method.\n","authors":["Zheng Li","Yibing Song","Penghai Zhao","Ming-Ming Cheng","Xiang Li","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09442v1.pdf","comment":"Technical Report. Project Page: https://zhengli97.github.io/ATPrompt/"},{"id":"http://arxiv.org/abs/2412.09439v1","updated":"2024-12-12T16:50:52Z","published":"2024-12-12T16:50:52Z","title":"Towards Robust and Fair Vision Learning in Open-World Environments","summary":" The dissertation presents four key contributions toward fairness and\nrobustness in vision learning. First, to address the problem of large-scale\ndata requirements, the dissertation presents a novel Fairness Domain Adaptation\napproach derived from two major novel research findings of Bijective Maximum\nLikelihood and Fairness Adaptation Learning. Second, to enable the capability\nof open-world modeling of vision learning, this dissertation presents a novel\nOpen-world Fairness Continual Learning Framework. The success of this research\ndirection is the result of two research lines, i.e., Fairness Continual\nLearning and Open-world Continual Learning. Third, since visual data are often\ncaptured from multiple camera views, robust vision learning methods should be\ncapable of modeling invariant features across views. To achieve this desired\ngoal, the research in this thesis will present a novel Geometry-based\nCross-view Adaptation framework to learn robust feature representations across\nviews. Finally, with the recent increase in large-scale videos and multimodal\ndata, understanding the feature representations and improving the robustness of\nlarge-scale visual foundation models is critical. Therefore, this thesis will\npresent novel Transformer-based approaches to improve the robust feature\nrepresentations against multimodal and temporal data. Then, a novel Domain\nGeneralization Approach will be presented to improve the robustness of visual\nfoundation models. The research's theoretical analysis and experimental results\nhave shown the effectiveness of the proposed approaches, demonstrating their\nsuperior performance compared to prior studies. The contributions in this\ndissertation have advanced the fairness and robustness of machine vision\nlearning.\n","authors":["Thanh-Dat Truong"],"pdf_url":"https://arxiv.org/pdf/2412.09439v1.pdf","comment":"PhD Dissertation"},{"id":"http://arxiv.org/abs/2410.22101v2","updated":"2024-12-12T16:46:41Z","published":"2024-10-29T14:54:13Z","title":"Hyperspectral Imaging-Based Perception in Autonomous Driving Scenarios:\n Benchmarking Baseline Semantic Segmentation Models","summary":" Hyperspectral Imaging (HSI) is known for its advantages over traditional RGB\nimaging in remote sensing, agriculture, and medicine. Recently, it has gained\nattention for enhancing Advanced Driving Assistance Systems (ADAS) perception.\nSeveral HSI datasets such as HyKo, HSI-Drive, HSI-Road, and Hyperspectral City\nhave been made available. However, a comprehensive evaluation of semantic\nsegmentation models (SSM) using these datasets is lacking. To address this gap,\nwe evaluated the available annotated HSI datasets on four deep learning-based\nbaseline SSMs: DeepLab v3+, HRNet, PSPNet, and U-Net, along with its two\nvariants: Coordinate Attention (UNet-CA) and Convolutional Block-Attention\nModule (UNet-CBAM). The original model architectures were adapted to handle the\nvarying spatial and spectral dimensions of the datasets. These baseline SSMs\nwere trained using a class-weighted loss function for individual HSI datasets\nand evaluated using mean-based metrics such as intersection over union (IoU),\nrecall, precision, F1 score, specificity, and accuracy. Our results indicate\nthat UNet-CBAM, which extracts channel-wise features, outperforms other SSMs\nand shows potential to leverage spectral information for enhanced semantic\nsegmentation. This study establishes a baseline SSM benchmark on available\nannotated datasets for future evaluation of HSI-based ADAS perception. However,\nlimitations of current HSI datasets, such as limited dataset size, high class\nimbalance, and lack of fine-grained annotations, remain significant constraints\nfor developing robust SSMs for ADAS applications.\n","authors":["Imad Ali Shah","Jiarong Li","Martin Glavin","Edward Jones","Enda Ward","Brian Deegan"],"pdf_url":"https://arxiv.org/pdf/2410.22101v2.pdf","comment":"Accepted and Presented at IEEE WHISPERS 2024"},{"id":"http://arxiv.org/abs/2412.09428v1","updated":"2024-12-12T16:33:21Z","published":"2024-12-12T16:33:21Z","title":"Multimodal Music Generation with Explicit Bridges and Retrieval\n Augmentation","summary":" Multimodal music generation aims to produce music from diverse input\nmodalities, including text, videos, and images. Existing methods use a common\nembedding space for multimodal fusion. Despite their effectiveness in other\nmodalities, their application in multimodal music generation faces challenges\nof data scarcity, weak cross-modal alignment, and limited controllability. This\npaper addresses these issues by using explicit bridges of text and music for\nmultimodal alignment. We introduce a novel method named Visuals Music Bridge\n(VMB). Specifically, a Multimodal Music Description Model converts visual\ninputs into detailed textual descriptions to provide the text bridge; a\nDual-track Music Retrieval module that combines broad and targeted retrieval\nstrategies to provide the music bridge and enable user control. Finally, we\ndesign an Explicitly Conditioned Music Generation framework to generate music\nbased on the two bridges. We conduct experiments on video-to-music,\nimage-to-music, text-to-music, and controllable music generation tasks, along\nwith experiments on controllability. The results demonstrate that VMB\nsignificantly enhances music quality, modality, and customization alignment\ncompared to previous methods. VMB sets a new standard for interpretable and\nexpressive multimodal music generation with applications in various multimedia\nfields. Demos and code are available at https://github.com/wbs2788/VMB.\n","authors":["Baisen Wang","Le Zhuo","Zhaokai Wang","Chenxi Bao","Wu Chengjing","Xuecheng Nie","Jiao Dai","Jizhong Han","Yue Liao","Si Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09428v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09427v1","updated":"2024-12-12T16:33:06Z","published":"2024-12-12T16:33:06Z","title":"A Plug-and-Play Algorithm for 3D Video Super-Resolution of Single-Photon\n LiDAR data","summary":" Single-photon avalanche diodes (SPADs) are advanced sensors capable of\ndetecting individual photons and recording their arrival times with picosecond\nresolution using time-correlated Single-Photon Counting detection techniques.\nThey are used in various applications, such as LiDAR, and can capture\nhigh-speed sequences of binary single-photon images, offering great potential\nfor reconstructing 3D environments with high motion dynamics. To complement\nsingle-photon data, they are often paired with conventional passive cameras,\nwhich capture high-resolution (HR) intensity images at a lower frame rate.\nHowever, 3D reconstruction from SPAD data faces challenges. Aggregating\nmultiple binary measurements improves precision and reduces noise but can cause\nmotion blur in dynamic scenes. Additionally, SPAD arrays often have lower\nresolution than passive cameras. To address these issues, we propose a novel\ncomputational imaging algorithm to improve the 3D reconstruction of moving\nscenes from SPAD data by addressing the motion blur and increasing the native\nspatial resolution. We adopt a plug-and-play approach within an optimization\nscheme alternating between guided video super-resolution of the 3D scene, and\nprecise image realignment using optical flow. Experiments on synthetic data\nshow significantly improved image resolutions across various signal-to-noise\nratios and photon levels. We validate our method using real-world SPAD\nmeasurements on three practical situations with dynamic objects. First on\nfast-moving scenes in laboratory conditions at short range; second very low\nresolution imaging of people with a consumer-grade SPAD sensor from\nSTMicroelectronics; and finally, HR imaging of people walking outdoors in\ndaylight at a range of 325 meters under eye-safe illumination conditions using\na short-wave infrared SPAD camera. These results demonstrate the robustness and\nversatility of our approach.\n","authors":["Alice Ruget","Lewis Wilson","Jonathan Leach","Rachael Tobin","Aongus Mccarthy","Gerald S. Buller","Steve Mclaughlin","Abderrahim Halimi"],"pdf_url":"https://arxiv.org/pdf/2412.09427v1.pdf","comment":"14 pages, 10 figures"},{"id":"http://arxiv.org/abs/2409.14747v4","updated":"2024-12-12T16:17:46Z","published":"2024-09-23T06:51:10Z","title":"Distribution-Level Feature Distancing for Machine Unlearning: Towards a\n Better Trade-off Between Model Utility and Forgetting","summary":" With the explosive growth of deep learning applications and increasing\nprivacy concerns, the right to be forgotten has become a critical requirement\nin various AI industries. For example, given a facial recognition system, some\nindividuals may wish to remove their personal data that might have been used in\nthe training phase. Unfortunately, deep neural networks sometimes unexpectedly\nleak personal identities, making this removal challenging. While recent machine\nunlearning algorithms aim to enable models to forget specific data, we identify\nan unintended utility drop-correlation collapse-in which the essential\ncorrelations between image features and true labels weaken during the\nforgetting process. To address this challenge, we propose Distribution-Level\nFeature Distancing (DLFD), a novel method that efficiently forgets instances\nwhile preserving task-relevant feature correlations. Our method synthesizes\ndata samples by optimizing the feature distribution to be distinctly different\nfrom that of forget samples, achieving effective results within a single\ntraining epoch. Through extensive experiments on facial recognition datasets,\nwe demonstrate that our approach significantly outperforms state-of-the-art\nmachine unlearning methods in both forgetting performance and model utility\npreservation.\n","authors":["Dasol Choi","Dongbin Na"],"pdf_url":"https://arxiv.org/pdf/2409.14747v4.pdf","comment":"10 pages, 6 figures, AAAI 2025 camera ready version"},{"id":"http://arxiv.org/abs/2412.09405v1","updated":"2024-12-12T16:09:57Z","published":"2024-12-12T16:09:57Z","title":"Learned Compression for Compressed Learning","summary":" Modern sensors produce increasingly rich streams of high-resolution data. Due\nto resource constraints, machine learning systems discard the vast majority of\nthis information via resolution reduction. Compressed-domain learning allows\nmodels to operate on compact latent representations, allowing higher effective\nresolution for the same budget. However, existing compression systems are not\nideal for compressed learning. Linear transform coding and end-to-end learned\ncompression systems reduce bitrate, but do not uniformly reduce dimensionality;\nthus, they do not meaningfully increase efficiency. Generative autoencoders\nreduce dimensionality, but their adversarial or perceptual objectives lead to\nsignificant information loss. To address these limitations, we introduce WaLLoC\n(Wavelet Learned Lossy Compression), a neural codec architecture that combines\nlinear transform coding with nonlinear dimensionality-reducing autoencoders.\nWaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck\nbetween an invertible wavelet packet transform. Across several key metrics,\nWaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion\nmodels. WaLLoC does not require perceptual or adversarial losses to represent\nhigh-frequency detail, providing compatibility with modalities beyond RGB\nimages and stereo audio. WaLLoC's encoder consists almost entirely of linear\noperations, making it exceptionally efficient and suitable for mobile\ncomputing, remote sensing, and learning directly from compressed data. We\ndemonstrate WaLLoC's capability for compressed-domain learning across several\ntasks, including image classification, colorization, document understanding,\nand music source separation. Our code, experiments, and pre-trained audio and\nimage codecs are available at https://ut-sysml.org/walloc\n","authors":["Dan Jacobellis","Neeraja J. Yadwadkar"],"pdf_url":"https://arxiv.org/pdf/2412.09405v1.pdf","comment":"Accepted as paper to 2025 IEEE Data Compression Conference"},{"id":"http://arxiv.org/abs/2412.09402v1","updated":"2024-12-12T16:08:43Z","published":"2024-12-12T16:08:43Z","title":"MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease\n Recognition from Fundus Images","summary":" Existing multi-modal learning methods on fundus and OCT images mostly require\nboth modalities to be available and strictly paired for training and testing,\nwhich appears less practical in clinical scenarios. To expand the scope of\nclinical applications, we formulate a novel setting, \"OCT-enhanced disease\nrecognition from fundus images\", that allows for the use of unpaired\nmulti-modal data during the training phase and relies on the widespread fundus\nphotographs for testing. To benchmark this setting, we present the first large\nmulti-modal multi-class dataset for eye disease diagnosis, MultiEYE, and\npropose an OCT-assisted Conceptual Distillation Approach (OCT-CoDA), which\nemploys semantically rich concepts to extract disease-related knowledge from\nOCT images and leverage them into the fundus model. Specifically, we regard the\nimage-concept relation as a link to distill useful knowledge from the OCT\nteacher model to the fundus student model, which considerably improves the\ndiagnostic performance based on fundus images and formulates the cross-modal\nknowledge transfer into an explainable process. Through extensive experiments\non the multi-disease classification task, our proposed OCT-CoDA demonstrates\nremarkable results and interpretability, showing great potential for clinical\napplication. Our dataset and code are available at\nhttps://github.com/xmed-lab/MultiEYE.\n","authors":["Lehan Wang","Chongchong Qi","Chubin Ou","Lin An","Mei Jin","Xiangbin Kong","Xiaomeng Li"],"pdf_url":"https://arxiv.org/pdf/2412.09402v1.pdf","comment":"Accepted at IEEE TMI"},{"id":"http://arxiv.org/abs/2412.09401v1","updated":"2024-12-12T16:08:03Z","published":"2024-12-12T16:08:03Z","title":"SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos","summary":" In this paper, we introduce \\textbf{SLAM3R}, a novel and effective monocular\nRGB SLAM system for real-time and high-quality dense 3D reconstruction. SLAM3R\nprovides an end-to-end solution by seamlessly integrating local 3D\nreconstruction and global coordinate registration through feed-forward neural\nnetworks. Given an input video, the system first converts it into overlapping\nclips using a sliding window mechanism. Unlike traditional pose\noptimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB\nimages in each window and progressively aligns and deforms these local\npointmaps to create a globally consistent scene reconstruction - all without\nexplicitly solving any camera parameters. Experiments across datasets\nconsistently show that SLAM3R achieves state-of-the-art reconstruction accuracy\nand completeness while maintaining real-time performance at 20+ FPS. Code and\nweights at: \\url{https://github.com/PKU-VCL-3DV/SLAM3R}.\n","authors":["Yuzheng Liu","Siyan Dong","Shuzhe Wang","Yingda Yin","Yanchao Yang","Qingnan Fan","Baoquan Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09401v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09389v1","updated":"2024-12-12T15:56:26Z","published":"2024-12-12T15:56:26Z","title":"UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame\n Organizer","summary":" Recently, diffusion-based video generation models have achieved significant\nsuccess. However, existing models often suffer from issues like weak\nconsistency and declining image quality over time. To overcome these\nchallenges, inspired by aesthetic principles, we propose a non-invasive plug-in\ncalled Uniform Frame Organizer (UFO), which is compatible with any\ndiffusion-based video generation model. The UFO comprises a series of adaptive\nadapters with adjustable intensities, which can significantly enhance the\nconsistency between the foreground and background of videos and improve image\nquality without altering the original model parameters when integrated. The\ntraining for UFO is simple, efficient, requires minimal resources, and supports\nstylized training. Its modular design allows for the combination of multiple\nUFOs, enabling the customization of personalized video generation models.\nFurthermore, the UFO also supports direct transferability across different\nmodels of the same specification without the need for specific retraining. The\nexperimental results indicate that UFO effectively enhances video generation\nquality and demonstrates its superiority in public video generation benchmarks.\nThe code will be publicly available at https://github.com/Delong-liu-bupt/UFO.\n","authors":["Delong Liu","Zhaohui Hou","Mingjie Zhan","Shihao Han","Zhicheng Zhao","Fei Su"],"pdf_url":"https://arxiv.org/pdf/2412.09389v1.pdf","comment":"Code:https://github.com/Delong-liu-bupt/UFO"},{"id":"http://arxiv.org/abs/2412.09388v1","updated":"2024-12-12T15:56:20Z","published":"2024-12-12T15:56:20Z","title":"All You Need in Knowledge Distillation Is a Tailored Coordinate System","summary":" Knowledge Distillation (KD) is essential in transferring dark knowledge from\na large teacher to a small student network, such that the student can be much\nmore efficient than the teacher but with comparable accuracy. Existing KD\nmethods, however, rely on a large teacher trained specifically for the target\ntask, which is both very inflexible and inefficient. In this paper, we argue\nthat a SSL-pretrained model can effectively act as the teacher and its dark\nknowledge can be captured by the coordinate system or linear subspace where the\nfeatures lie in. We then need only one forward pass of the teacher, and then\ntailor the coordinate system (TCS) for the student network. Our TCS method is\nteacher-free and applies to diverse architectures, works well for KD and\npractical few-shot learning, and allows cross-architecture distillation with\nlarge capacity gap. Experiments show that TCS achieves significantly higher\naccuracy than state-of-the-art KD methods, while only requiring roughly half of\ntheir training time and GPU memory costs.\n","authors":["Junjie Zhou","Ke Zhu","Jianxin Wu"],"pdf_url":"https://arxiv.org/pdf/2412.09388v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09386v1","updated":"2024-12-12T15:53:14Z","published":"2024-12-12T15:53:14Z","title":"Multi-Stage Segmentation and Cascade Classification Methods for\n Improving Cardiac MRI Analysis","summary":" The segmentation and classification of cardiac magnetic resonance imaging are\ncritical for diagnosing heart conditions, yet current approaches face\nchallenges in accuracy and generalizability. In this study, we aim to further\nadvance the segmentation and classification of cardiac magnetic resonance\nimages by introducing a novel deep learning-based approach. Using a multi-stage\nprocess with U-Net and ResNet models for segmentation, followed by Gaussian\nsmoothing, the method improved segmentation accuracy, achieving a Dice\ncoefficient of 0.974 for the left ventricle and 0.947 for the right ventricle.\nFor classification, a cascade of deep learning classifiers was employed to\ndistinguish heart conditions, including hypertrophic cardiomyopathy, myocardial\ninfarction, and dilated cardiomyopathy, achieving an average accuracy of 97.2%.\nThe proposed approach outperformed existing models, enhancing segmentation\naccuracy and classification precision. These advancements show promise for\nclinical applications, though further validation and interpretation across\ndiverse imaging protocols is necessary.\n","authors":["Vitalii Slobodzian","Pavlo Radiuk","Oleksander Barmak","Iurii Krak"],"pdf_url":"https://arxiv.org/pdf/2412.09386v1.pdf","comment":"Cardiac MRI, heart pathology, deep learning, segmentation, Gaussian\n smoothing, classification, cascade"},{"id":"http://arxiv.org/abs/2411.06908v2","updated":"2024-12-12T15:40:54Z","published":"2024-11-11T12:11:36Z","title":"EVQAScore: Efficient Video Question Answering Data Evaluation","summary":" Video question-answering (QA) is a core task in video understanding.\nEvaluating the quality of video QA and video caption data quality for training\nvideo large language models (VideoLLMs) is an essential challenge. Although\nvarious methods have been proposed for assessing video caption quality, there\nremains a lack of dedicated evaluation methods for Video QA. To address this\ngap, we introduce EVQAScore, a reference-free method that leverages keyword\nextraction to assess both video caption and video QA data quality.\nAdditionally, we incorporate frame sampling and rescaling techniques to enhance\nthe efficiency and robustness of our evaluation, this enables our score to\nevaluate the quality of extremely long videos. Our approach achieves\nstate-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for\nSpearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on\nthe VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using\nEVQAScore for data selection, we achieved SOTA results with only 12.5\\% of the\noriginal data volume, outperforming the previous SOTA method PAC-S and 100\\% of\ndata.\n","authors":["Hao Liang","Zirong Chen","Wentao Zhang"],"pdf_url":"https://arxiv.org/pdf/2411.06908v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13809v2","updated":"2024-12-12T15:40:49Z","published":"2024-08-25T11:10:15Z","title":"On the Robustness of Kolmogorov-Arnold Networks: An Adversarial\n Perspective","summary":" Kolmogorov-Arnold Networks (KANs) have recently emerged as a novel approach\nto function approximation, demonstrating remarkable potential in various\ndomains. Despite their theoretical promise, the robustness of KANs under\nadversarial conditions has yet to be thoroughly examined. In this paper we\nexplore the adversarial robustness of KANs, with a particular focus on image\nclassification tasks. We assess the performance of KANs against standard white\nbox and black-box adversarial attacks, comparing their resilience to that of\nestablished neural network architectures. Our experimental evaluation\nencompasses a variety of standard image classification benchmark datasets and\ninvestigates both fully connected and convolutional neural network\narchitectures, of three sizes: small, medium, and large. We conclude that\nsmall- and medium-sized KANs (either fully connected or convolutional) are not\nconsistently more robust than their standard counterparts, but that large-sized\nKANs are, by and large, more robust. This comprehensive evaluation of KANs in\nadversarial scenarios offers the first in-depth analysis of KAN security,\nlaying the groundwork for future research in this emerging field.\n","authors":["Tal Alter","Raz Lapid","Moshe Sipper"],"pdf_url":"https://arxiv.org/pdf/2408.13809v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08357v2","updated":"2024-12-12T15:33:55Z","published":"2024-12-11T13:02:09Z","title":"Video Summarization using Denoising Diffusion Probabilistic Model","summary":" Video summarization aims to eliminate visual redundancy while retaining key\nparts of video to construct concise and comprehensive synopses. Most existing\nmethods use discriminative models to predict the importance scores of video\nframes. However, these methods are susceptible to annotation inconsistency\ncaused by the inherent subjectivity of different annotators when annotating the\nsame video. In this paper, we introduce a generative framework for video\nsummarization that learns how to generate summaries from a probability\ndistribution perspective, effectively reducing the interference of subjective\nannotation noise. Specifically, we propose a novel diffusion summarization\nmethod based on the Denoising Diffusion Probabilistic Model (DDPM), which\nlearns the probability distribution of training data through noise prediction,\nand generates summaries by iterative denoising. Our method is more resistant to\nsubjective annotation noise, and is less prone to overfitting the training data\nthan discriminative methods, with strong generalization ability. Moreover, to\nfacilitate training DDPM with limited data, we employ an unsupervised video\nsummarization model to implement the earlier denoising process. Extensive\nexperiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the\neffectiveness of our method.\n","authors":["Zirui Shang","Yubo Zhu","Hongxi Li","Shuo Yang","Xinxiao Wu"],"pdf_url":"https://arxiv.org/pdf/2412.08357v2.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.04464v2","updated":"2024-12-12T15:26:07Z","published":"2024-12-05T18:59:48Z","title":"DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose\n Reconstruction","summary":" The choice of data representation is a key factor in the success of deep\nlearning in geometric tasks. For instance, DUSt3R has recently introduced the\nconcept of viewpoint-invariant point maps, generalizing depth prediction, and\nshowing that one can reduce all the key problems in the 3D reconstruction of\nstatic scenes to predicting such point maps. In this paper, we develop an\nanalogous concept for a very different problem, namely, the reconstruction of\nthe 3D shape and pose of deformable objects. To this end, we introduce the Dual\nPoint Maps (DualPM), where a pair of point maps is extracted from the same\nimage, one associating pixels to their 3D locations on the object, and the\nother to a canonical version of the object at rest pose. We also extend point\nmaps to amodal reconstruction, seeing through self-occlusions to obtain the\ncomplete shape of the object. We show that 3D reconstruction and 3D pose\nestimation reduce to the prediction of the DualPMs. We demonstrate empirically\nthat this representation is a good target for a deep network to predict;\nspecifically, we consider modeling horses, showing that DualPMs can be trained\npurely on 3D synthetic data, consisting of a single model of a horse, while\ngeneralizing very well to real images. With this, we improve by a large margin\nprevious methods for the 3D analysis and reconstruction of this type of\nobjects.\n","authors":["Ben Kaye","Tomas Jakab","Shangzhe Wu","Christian Rupprecht","Andrea Vedaldi"],"pdf_url":"https://arxiv.org/pdf/2412.04464v2.pdf","comment":"First two authors contributed equally. Project page:\n https://dualpm.github.io"},{"id":"http://arxiv.org/abs/2412.09353v1","updated":"2024-12-12T15:22:03Z","published":"2024-12-12T15:22:03Z","title":"Causal Graphical Models for Vision-Language Compositional Understanding","summary":" Recent work has empirically shown that Vision-Language Models (VLMs) struggle\nto fully understand the compositional properties of the human language, usually\nmodeling an image caption as a \"bag of words\". As a result, they perform poorly\non compositional tasks, which require a deeper understanding of the different\nentities of a sentence (subject, verb, etc.) jointly with their mutual\nrelationships in order to be solved. In this paper, we model the dependency\nrelations among textual and visual tokens using a Causal Graphical Model (CGM),\nbuilt using a dependency parser, and we train a decoder conditioned by the VLM\nvisual encoder. Differently from standard autoregressive or parallel\npredictions, our decoder's generative process is partially-ordered following\nthe CGM structure. This structure encourages the decoder to learn only the main\ncausal dependencies in a sentence discarding spurious correlations. Using\nextensive experiments on five compositional benchmarks, we show that our method\nsignificantly outperforms all the state-of-the-art compositional approaches by\na large margin, and it also improves over methods trained using much larger\ndatasets.\n","authors":["Fiorenzo Parascandolo","Nicholas Moratelli","Enver Sangineto","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.09353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09349v1","updated":"2024-12-12T15:15:59Z","published":"2024-12-12T15:15:59Z","title":"DisPose: Disentangling Pose Guidance for Controllable Human Image\n Animation","summary":" Controllable human image animation aims to generate videos from reference\nimages using driving videos. Due to the limited control signals provided by\nsparse guidance (e.g., skeleton pose), recent works have attempted to introduce\nadditional dense conditions (e.g., depth map) to ensure motion alignment.\nHowever, such strict dense guidance impairs the quality of the generated video\nwhen the body shape of the reference character differs significantly from that\nof the driving video. In this paper, we present DisPose to mine more\ngeneralizable and effective control signals without additional dense input,\nwhich disentangles the sparse skeleton pose in human image animation into\nmotion field guidance and keypoint correspondence. Specifically, we generate a\ndense motion field from a sparse motion field and the reference image, which\nprovides region-level dense guidance while maintaining the generalization of\nthe sparse pose control. We also extract diffusion features corresponding to\npose keypoints from the reference image, and then these point features are\ntransferred to the target pose to provide distinct identity information. To\nseamlessly integrate into existing models, we propose a plug-and-play hybrid\nControlNet that improves the quality and consistency of generated videos while\nfreezing the existing model parameters. Extensive qualitative and quantitative\nexperiments demonstrate the superiority of DisPose compared to current methods.\nCode:\n\\hyperlink{https://github.com/lihxxx/DisPose}{https://github.com/lihxxx/DisPose}.\n","authors":["Hongxiang Li","Yaowei Li","Yuhang Yang","Junjie Cao","Zhihong Zhu","Xuxin Cheng","Long Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09349v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09346v1","updated":"2024-12-12T15:13:34Z","published":"2024-12-12T15:13:34Z","title":"Quantitative Evaluation of Motif Sets in Time Series","summary":" Time Series Motif Discovery (TSMD), which aims at finding recurring patterns\nin time series, is an important task in numerous application domains, and many\nmethods for this task exist. These methods are usually evaluated qualitatively.\nA few metrics for quantitative evaluation, where discovered motifs are compared\nto some ground truth, have been proposed, but they typically make implicit\nassumptions that limit their applicability. This paper introduces PROM, a\nbroadly applicable metric that overcomes those limitations, and TSMD-Bench, a\nbenchmark for quantitative evaluation of time series motif discovery.\nExperiments with PROM and TSMD-Bench show that PROM provides a more\ncomprehensive evaluation than existing metrics, that TSMD-Bench is a more\nchallenging benchmark than earlier ones, and that the combination can help\nunderstand the relative performance of TSMD methods. More generally, the\nproposed approach enables large-scale, systematic performance comparisons in\nthis field.\n","authors":["Daan Van Wesenbeeck","Aras Yurtman","Wannes Meert","Hendrik Blockeel"],"pdf_url":"https://arxiv.org/pdf/2412.09346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09333v1","updated":"2024-12-12T15:01:39Z","published":"2024-12-12T15:01:39Z","title":"MaskTerial: A Foundation Model for Automated 2D Material Flake Detection","summary":" The detection and classification of exfoliated two-dimensional (2D) material\nflakes from optical microscope images can be automated using computer vision\nalgorithms. This has the potential to increase the accuracy and objectivity of\nclassification and the efficiency of sample fabrication, and it allows for\nlarge-scale data collection. Existing algorithms often exhibit challenges in\nidentifying low-contrast materials and typically require large amounts of\ntraining data. Here, we present a deep learning model, called MaskTerial, that\nuses an instance segmentation network to reliably identify 2D material flakes.\nThe model is extensively pre-trained using a synthetic data generator, that\ngenerates realistic microscopy images from unlabeled data. This results in a\nmodel that can to quickly adapt to new materials with as little as 5 to 10\nimages. Furthermore, an uncertainty estimation model is used to finally\nclassify the predictions based on optical contrast. We evaluate our method on\neight different datasets comprising five different 2D materials and demonstrate\nsignificant improvements over existing techniques in the detection of\nlow-contrast materials such as hexagonal boron nitride.\n","authors":["Jan-Lucas Uslu","Alexey Nekrasov","Alexander Hermans","Bernd Beschoten","Bastian Leibe","Lutz Waldecker","Christoph Stampfer"],"pdf_url":"https://arxiv.org/pdf/2412.09333v1.pdf","comment":"9 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.09331v1","updated":"2024-12-12T14:59:56Z","published":"2024-12-12T14:59:56Z","title":"Physics-Driven Autoregressive State Space Models for Medical Image\n Reconstruction","summary":" Medical image reconstruction from undersampled acquisitions is an ill-posed\nproblem that involves inversion of the imaging operator linking measurement and\nimage domains. In recent years, physics-driven (PD) models have gained\nprominence in learning-based reconstruction given their enhanced balance\nbetween efficiency and performance. For reconstruction, PD models cascade\ndata-consistency modules that enforce fidelity to acquired data based on the\nimaging operator, with network modules that process feature maps to alleviate\nimage artifacts due to undersampling. Success in artifact suppression\ninevitably depends on the ability of the network modules to tease apart\nartifacts from underlying tissue structures, both of which can manifest\ncontextual relations over broad spatial scales. Convolutional modules that\nexcel at capturing local correlations are relatively insensitive to non-local\ncontext. While transformers promise elevated sensitivity to non-local context,\npractical implementations often suffer from a suboptimal trade-off between\nlocal and non-local sensitivity due to intrinsic model complexity. Here, we\nintroduce a novel physics-driven autoregressive state space model (MambaRoll)\nfor enhanced fidelity in medical image reconstruction. In each cascade of an\nunrolled architecture, MambaRoll employs an autoregressive framework based on\nphysics-driven state space modules (PSSM), where PSSMs efficiently aggregate\ncontextual features at a given spatial scale while maintaining fidelity to\nacquired data, and autoregressive prediction of next-scale feature maps from\nearlier spatial scales enhance capture of multi-scale contextual features.\nDemonstrations on accelerated MRI and sparse-view CT reconstructions indicate\nthat MambaRoll outperforms state-of-the-art PD methods based on convolutional,\ntransformer and conventional SSM modules.\n","authors":["Bilal Kabas","Fuat Arslan","Valiyeh A. Nezhad","Saban Ozturk","Emine U. Saritas","Tolga Çukur"],"pdf_url":"https://arxiv.org/pdf/2412.09331v1.pdf","comment":"10 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.09330v1","updated":"2024-12-12T14:59:10Z","published":"2024-12-12T14:59:10Z","title":"Computer-Aided Osteoporosis Diagnosis Using Transfer Learning with\n Enhanced Features from Stacked Deep Learning Modules","summary":" Knee osteoporosis weakens the bone tissue in the knee joint, increasing\nfracture risk. Early detection through X-ray images enables timely intervention\nand improved patient outcomes. While some researchers have focused on\ndiagnosing knee osteoporosis through manual radiology evaluation and\ntraditional machine learning using hand-crafted features, these methods often\nstruggle with performance and efficiency due to reliance on manual feature\nextraction and subjective interpretation. In this study, we propose a\ncomputer-aided diagnosis (CAD) system for knee osteoporosis, combining transfer\nlearning with stacked feature enhancement deep learning blocks. Initially, knee\nX-ray images are preprocessed, and features are extracted using a pre-trained\nConvolutional Neural Network (CNN). These features are then enhanced through\nfive sequential Conv-RELU-MaxPooling blocks. The Conv2D layers detect low-level\nfeatures, while the ReLU activations introduce non-linearity, allowing the\nnetwork to learn complex patterns. MaxPooling layers down-sample the features,\nretaining the most important spatial information. This sequential processing\nenables the model to capture complex, high-level features related to bone\nstructure, joint deformation, and osteoporotic markers. The enhanced features\nare passed through a classification module to differentiate between healthy and\nosteoporotic knee conditions. Extensive experiments on three individual\ndatasets and a combined dataset demonstrate that our model achieves 97.32%,\n98.24%, 97.27%, and 98.00% accuracy for OKX Kaggle Binary, KXO-Mendeley\nMulti-Class, OKX Kaggle Multi-Class, and the combined dataset, respectively,\nshowing an improvement of around 2% over existing methods.\n","authors":["Ayesha Siddiqua","Rakibul Hasan","Anichur Rahman","Abu Saleh Musa Miah"],"pdf_url":"https://arxiv.org/pdf/2412.09330v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09324v1","updated":"2024-12-12T14:49:55Z","published":"2024-12-12T14:49:55Z","title":"Are Conditional Latent Diffusion Models Effective for Image Restoration?","summary":" Recent advancements in image restoration increasingly employ conditional\nlatent diffusion models (CLDMs). While these models have demonstrated notable\nperformance improvements in recent years, this work questions their suitability\nfor IR tasks. CLDMs excel in capturing high-level semantic correlations, making\nthem effective for tasks like text-to-image generation with spatial\nconditioning. However, in IR, where the goal is to enhance image perceptual\nquality, these models face difficulty of modeling the relationship between\ndegraded images and ground truth images using a low-level representation. To\nsupport our claims, we compare state-of-the-art CLDMs with traditional image\nrestoration models through extensive experiments. Results reveal that despite\nthe scaling advantages of CLDMs, they suffer from high distortion and semantic\ndeviation, especially in cases with minimal degradation, where traditional\nmethods outperform them. Additionally, we perform empirical studies to examine\nthe impact of various CLDM design elements on their restoration performance. We\nhope this finding inspires a reexamination of current CLDM-based IR solutions,\nopening up more opportunities in this field.\n","authors":["Yunchen Yuan","Junyuan Xiao","Xinjie Li"],"pdf_url":"https://arxiv.org/pdf/2412.09324v1.pdf","comment":"16 pages, 12 figures, submitted to IEEE / CVF Computer Vision and\n Pattern Recognition Conference (CVPR 2025)"},{"id":"http://arxiv.org/abs/2412.09323v1","updated":"2024-12-12T14:48:46Z","published":"2024-12-12T14:48:46Z","title":"T-SVG: Text-Driven Stereoscopic Video Generation","summary":" The advent of stereoscopic videos has opened new horizons in multimedia,\nparticularly in extended reality (XR) and virtual reality (VR) applications,\nwhere immersive content captivates audiences across various platforms. Despite\nits growing popularity, producing stereoscopic videos remains challenging due\nto the technical complexities involved in generating stereo parallax. This\nrefers to the positional differences of objects viewed from two distinct\nperspectives and is crucial for creating depth perception. This complex process\nposes significant challenges for creators aiming to deliver convincing and\nengaging presentations. To address these challenges, this paper introduces the\nText-driven Stereoscopic Video Generation (T-SVG) system. This innovative,\nmodel-agnostic, zero-shot approach streamlines video generation by using text\nprompts to create reference videos. These videos are transformed into 3D point\ncloud sequences, which are rendered from two perspectives with subtle parallax\ndifferences, achieving a natural stereoscopic effect. T-SVG represents a\nsignificant advancement in stereoscopic content creation by integrating\nstate-of-the-art, training-free techniques in text-to-video generation, depth\nestimation, and video inpainting. Its flexible architecture ensures high\nefficiency and user-friendliness, allowing seamless updates with newer models\nwithout retraining. By simplifying the production pipeline, T-SVG makes\nstereoscopic video generation accessible to a broader audience, demonstrating\nits potential to revolutionize the field.\n","authors":["Qiao Jin","Xiaodong Chen","Wu Liu","Tao Mei","Yongdong Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09323v1.pdf","comment":"5 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.09319v1","updated":"2024-12-12T14:44:05Z","published":"2024-12-12T14:44:05Z","title":"FAMNet: Frequency-aware Matching Network for Cross-domain Few-shot\n Medical Image Segmentation","summary":" Existing few-shot medical image segmentation (FSMIS) models fail to address a\npractical issue in medical imaging: the domain shift caused by different\nimaging techniques, which limits the applicability to current FSMIS tasks. To\novercome this limitation, we focus on the cross-domain few-shot medical image\nsegmentation (CD-FSMIS) task, aiming to develop a generalized model capable of\nadapting to a broader range of medical image segmentation scenarios with\nlimited labeled data from the novel target domain. Inspired by the\ncharacteristics of frequency domain similarity across different domains, we\npropose a Frequency-aware Matching Network (FAMNet), which includes two key\ncomponents: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion\n(MSF) module. The FAM module tackles two problems during the meta-learning\nphase: 1) intra-domain variance caused by the inherent support-query bias, due\nto the different appearances of organs and lesions, and 2) inter-domain\nvariance caused by different medical imaging techniques. Additionally, we\ndesign an MSF module to integrate the different frequency features decoupled by\nthe FAM module, and further mitigate the impact of inter-domain variance on the\nmodel's segmentation performance. Combining these two modules, our FAMNet\nsurpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation\nmodels on three cross-domain datasets, achieving state-of-the-art performance\nin the CD-FSMIS task.\n","authors":["Yuntian Bo","Yazhou Zhu","Lunbo Li","Haofeng Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09319v1.pdf","comment":"Accepted by the 39th Annual AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2408.11447v3","updated":"2024-12-12T14:42:30Z","published":"2024-08-21T09:06:30Z","title":"GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation\n with Gaussian Splatting","summary":" We introduce GaussianOcc, a systematic method that investigates the two\nusages of Gaussian splatting for fully self-supervised and efficient 3D\noccupancy estimation in surround views. First, traditional methods for\nself-supervised 3D occupancy estimation still require ground truth 6D poses\nfrom sensors during training. To address this limitation, we propose Gaussian\nSplatting for Projection (GSP) module to provide accurate scale information for\nfully self-supervised training from adjacent view projection. Additionally,\nexisting methods rely on volume rendering for final 3D voxel representation\nlearning using 2D signals (depth maps, semantic maps), which is both\ntime-consuming and less effective. We propose Gaussian Splatting from Voxel\nspace (GSV) to leverage the fast rendering properties of Gaussian splatting. As\na result, the proposed GaussianOcc method enables fully self-supervised (no\nground truth pose) 3D occupancy estimation in competitive performance with low\ncomputational cost (2.7 times faster in training and 5 times faster in\nrendering). The relevant code is available in\nhttps://github.com/GANWANSHUI/GaussianOcc.git.\n","authors":["Wanshui Gan","Fang Liu","Hongbin Xu","Ningkai Mo","Naoto Yokoya"],"pdf_url":"https://arxiv.org/pdf/2408.11447v3.pdf","comment":"Project page: https://ganwanshui.github.io/GaussianOcc/"},{"id":"http://arxiv.org/abs/2412.09317v1","updated":"2024-12-12T14:42:10Z","published":"2024-12-12T14:42:10Z","title":"Multimodal Sentiment Analysis based on Video and Audio Inputs","summary":" Despite the abundance of current researches working on the sentiment analysis\nfrom videos and audios, finding the best model that gives the highest accuracy\nrate is still considered a challenge for researchers in this field. The main\nobjective of this paper is to prove the usability of emotion recognition models\nthat take video and audio inputs. The datasets used to train the models are the\nCREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned\nmodels that been used are: Facebook/wav2vec2-large for audio and the\nGoogle/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for\neach emotion generated by the two previous models is utilized in the decision\nmaking framework. After disparity in the results, if one of the models gets\nmuch higher accuracy, another test framework is created. The methods used are\nthe Weighted Average method, the Confidence Level Threshold method, the Dynamic\nWeighting Based on Confidence method, and the Rule-Based Logic method. This\nlimited approach gives encouraging results that make future research into these\nmethods viable.\n","authors":["Antonio Fernandez","Suzan Awinat"],"pdf_url":"https://arxiv.org/pdf/2412.09317v1.pdf","comment":"Presented as a full paper in the 15th International Conference on\n Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October\n 28-30, 2024, Leuven, Belgium"},{"id":"http://arxiv.org/abs/2408.08984v2","updated":"2024-12-12T14:37:28Z","published":"2024-08-16T19:25:19Z","title":"Fire Dynamic Vision: Image Segmentation and Tracking for Multi-Scale\n Fire and Plume Behavior","summary":" The increasing frequency and severity of wildfires highlight the need for\naccurate fire and plume spread models. We introduce an approach that\neffectively isolates and tracks fire and plume behavior across various spatial\nand temporal scales and image types, identifying physical phenomena in the\nsystem and providing insights useful for developing and validating models. Our\nmethod combines image segmentation and graph theory to delineate fire fronts\nand plume boundaries. We demonstrate that the method effectively distinguishes\nfires and plumes from visually similar objects. Results demonstrate the\nsuccessful isolation and tracking of fire and plume dynamics across various\nimage sources, ranging from synoptic-scale ($10^4$-$10^5$ m) satellite images\nto sub-microscale ($10^0$-$10^1$ m) images captured close to the fire\nenvironment. Furthermore, the methodology leverages image inpainting and\nspatio-temporal dataset generation for use in statistical and machine learning\nmodels.\n","authors":["Daryn Sagel","Bryan Quaife"],"pdf_url":"https://arxiv.org/pdf/2408.08984v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00727v2","updated":"2024-12-12T14:28:42Z","published":"2024-12-01T08:39:12Z","title":"Perturb and Recover: Fine-tuning for Effective Backdoor Removal from\n CLIP","summary":" Vision-Language models like CLIP have been shown to be highly effective at\nlinking visual perception and natural language understanding, enabling\nsophisticated image-text capabilities, including strong retrieval and zero-shot\nclassification performance. Their widespread use, as well as the fact that CLIP\nmodels are trained on image-text pairs from the web, make them both a\nworthwhile and relatively easy target for backdoor attacks. As training\nfoundational models, such as CLIP, from scratch is very expensive, this paper\nfocuses on cleaning potentially poisoned models via fine-tuning. We first show\nthat existing cleaning techniques are not effective against simple structured\ntriggers used in Blended or BadNet backdoor attacks, exposing a critical\nvulnerability for potential real-world deployment of these models. Then, we\nintroduce PAR, Perturb and Recover, a surprisingly simple yet effective\nmechanism to remove backdoors from CLIP models. Through extensive experiments\nacross different encoders and types of backdoor attacks, we show that PAR\nachieves high backdoor removal rate while preserving good standard performance.\nFinally, we illustrate that our approach is effective even only with synthetic\ntext-image pairs, i.e. without access to real training data. The code and\nmodels are available at https://github.com/nmndeep/PerturbAndRecover.\n","authors":["Naman Deep Singh","Francesco Croce","Matthias Hein"],"pdf_url":"https://arxiv.org/pdf/2412.00727v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09311v1","updated":"2024-12-12T14:25:56Z","published":"2024-12-12T14:25:56Z","title":"Advancing Attribution-Based Neural Network Explainability through\n Relative Absolute Magnitude Layer-Wise Relevance Propagation and\n Multi-Component Evaluation","summary":" Recent advancement in deep-neural network performance led to the development\nof new state-of-the-art approaches in numerous areas. However, the black-box\nnature of neural networks often prohibits their use in areas where model\nexplainability and model transparency are crucial. Over the years, researchers\nproposed many algorithms to aid neural network understanding and provide\nadditional information to the human expert. One of the most popular methods\nbeing Layer-Wise Relevance Propagation (LRP). This method assigns local\nrelevance based on the pixel-wise decomposition of nonlinear classifiers. With\nthe rise of attribution method research, there has emerged a pressing need to\nassess and evaluate their performance. Numerous metrics have been proposed,\neach assessing an individual property of attribution methods such as\nfaithfulness, robustness or localization. Unfortunately, no single metric is\ndeemed optimal for every case, and researchers often use several metrics to\ntest the quality of the attribution maps. In this work, we address the\nshortcomings of the current LRP formulations and introduce a novel method for\ndetermining the relevance of input neurons through layer-wise relevance\npropagation. Furthermore, we apply this approach to the recently developed\nVision Transformer architecture and evaluate its performance against existing\nmethods on two image classification datasets, namely ImageNet and PascalVOC.\nOur results clearly demonstrate the advantage of our proposed method.\nFurthermore, we discuss the insufficiencies of current evaluation metrics for\nattribution-based explainability and propose a new evaluation metric that\ncombines the notions of faithfulness, robustness and contrastiveness. We\nutilize this new metric to evaluate the performance of various\nattribution-based methods. Our code is available at:\nhttps://github.com/davor10105/relative-absolute-magnitude-propagation\n","authors":["Davor Vukadin","Petar Afrić","Marin Šilić","Goran Delač"],"pdf_url":"https://arxiv.org/pdf/2412.09311v1.pdf","comment":"30 pages, 16 figures, 13 tables, ACM Transactions on Intelligence\n Systems and Technology"},{"id":"http://arxiv.org/abs/2412.09296v1","updated":"2024-12-12T14:12:07Z","published":"2024-12-12T14:12:07Z","title":"GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with\n Rhythmic Poses and Realistic Expression","summary":" Audio-driven talking head generation necessitates seamless integration of\naudio and visual data amidst the challenges posed by diverse input portraits\nand intricate correlations between audio and facial motions. In response, we\npropose a robust framework GoHD designed to produce highly realistic,\nexpressive, and controllable portrait videos from any reference identity with\nany motion. GoHD innovates with three key modules: Firstly, an animation module\nutilizing latent navigation is introduced to improve the generalization ability\nacross unseen input styles. This module achieves high disentanglement of motion\nand identity, and it also incorporates gaze orientation to rectify unnatural\neye movements that were previously overlooked. Secondly, a conformer-structured\nconditional diffusion model is designed to guarantee head poses that are aware\nof prosody. Thirdly, to estimate lip-synchronized and realistic expressions\nfrom the input audio within limited training data, a two-stage training\nstrategy is devised to decouple frequent and frame-wise lip motion distillation\nfrom the generation of other more temporally dependent but less audio-related\nmotions, e.g., blinks and frowns. Extensive experiments validate GoHD's\nadvanced generalization capabilities, demonstrating its effectiveness in\ngenerating realistic talking face results on arbitrary subjects.\n","authors":["Ziqi Zhou","Weize Quan","Hailin Shi","Wei Li","Lili Wang","Dong-ming Yan"],"pdf_url":"https://arxiv.org/pdf/2412.09296v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.07767v2","updated":"2024-12-12T14:10:43Z","published":"2024-12-10T18:59:31Z","title":"Learning Visual Generative Priors without Text","summary":" Although text-to-image (T2I) models have recently thrived as visual\ngenerative priors, their reliance on high-quality text-image pairs makes\nscaling up expensive. We argue that grasping the cross-modality alignment is\nnot a necessity for a sound visual generative prior, whose focus should be on\ntexture modeling. Such a philosophy inspires us to study image-to-image (I2I)\ngeneration, where models can learn from in-the-wild images in a self-supervised\nmanner. We first develop a pure vision-based training framework, Lumos, and\nconfirm the feasibility and the scalability of learning I2I models. We then\nfind that, as an upstream task of T2I, our I2I model serves as a more\nfoundational visual prior and achieves on-par or better performance than\nexisting T2I models using only 1/10 text-image pairs for fine-tuning. We\nfurther demonstrate the superiority of I2I priors over T2I priors on some\ntext-irrelevant visual generative tasks, like image-to-3D and image-to-video.\nOur project page is available at https://xiaomabufei.github.io/lumos.\n","authors":["Shuailei Ma","Kecheng Zheng","Ying Wei","Wei Wu","Fan Lu","Yifei Zhang","Chen-Wei Xie","Biao Gong","Jiapeng Zhu","Yujun Shen"],"pdf_url":"https://arxiv.org/pdf/2412.07767v2.pdf","comment":"Project Page: https://xiaomabufei.github.io/lumos"},{"id":"http://arxiv.org/abs/2412.09283v1","updated":"2024-12-12T13:48:40Z","published":"2024-12-12T13:48:40Z","title":"InstanceCap: Improving Text-to-Video Generation via Instance-aware\n Structured Caption","summary":" Text-to-video generation has evolved rapidly in recent years, delivering\nremarkable results. Training typically relies on video-caption paired data,\nwhich plays a crucial role in enhancing generation performance. However,\ncurrent video captions often suffer from insufficient details, hallucinations\nand imprecise motion depiction, affecting the fidelity and consistency of\ngenerated videos. In this work, we propose a novel instance-aware structured\ncaption framework, termed InstanceCap, to achieve instance-level and\nfine-grained video caption for the first time. Based on this scheme, we design\nan auxiliary models cluster to convert original video into instances to enhance\ninstance fidelity. Video instances are further used to refine dense prompts\ninto structured phrases, achieving concise yet precise descriptions.\nFurthermore, a 22K InstanceVid dataset is curated for training, and an\nenhancement pipeline that tailored to InstanceCap structure is proposed for\ninference. Experimental results demonstrate that our proposed InstanceCap\nsignificantly outperform previous models, ensuring high fidelity between\ncaptions and videos while reducing hallucinations.\n","authors":["Tiehan Fan","Kepan Nan","Rui Xie","Penghao Zhou","Zhenheng Yang","Chaoyou Fu","Xiang Li","Jian Yang","Ying Tai"],"pdf_url":"https://arxiv.org/pdf/2412.09283v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09278v1","updated":"2024-12-12T13:41:35Z","published":"2024-12-12T13:41:35Z","title":"Towards a Multimodal Large Language Model with Pixel-Level Insight for\n Biomedicine","summary":" In recent years, Multimodal Large Language Models (MLLM) have achieved\nnotable advancements, demonstrating the feasibility of developing an\nintelligent biomedical assistant. However, current biomedical MLLMs\npredominantly focus on image-level understanding and restrict interactions to\ntextual commands, thus limiting their capability boundaries and the flexibility\nof usage. In this paper, we introduce a novel end-to-end multimodal large\nlanguage model for the biomedical domain, named MedPLIB, which possesses\npixel-level understanding. Excitingly, it supports visual question answering\n(VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form\nshapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE)\nmulti-stage training strategy, which divides MoE into separate training phases\nfor a visual-language expert model and a pixel-grounding expert model, followed\nby fine-tuning using MoE. This strategy effectively coordinates multitask\nlearning while maintaining the computational cost at inference equivalent to\nthat of a single expert model. To advance the research of biomedical MLLMs, we\nintroduce the Medical Complex Vision Question Answering Dataset (MeCoVQA),\nwhich comprises an array of 8 modalities for complex medical imaging question\nanswering and image region understanding. Experimental results indicate that\nMedPLIB has achieved state-of-the-art outcomes across multiple medical visual\nlanguage tasks. More importantly, in zero-shot evaluations for the pixel\ngrounding task, MedPLIB leads the best small and large models by margins of\n19.7 and 15.6 respectively on the mDice metric. The codes, data, and model\ncheckpoints will be made publicly available at\nhttps://github.com/ShawnHuang497/MedPLIB.\n","authors":["Xiaoshuang Huang","Lingdong Shen","Jia Liu","Fangxin Shang","Hongxiang Li","Haifeng Huang","Yehui Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09278v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.09276v1","updated":"2024-12-12T13:40:59Z","published":"2024-12-12T13:40:59Z","title":"Text-Video Multi-Grained Integration for Video Moment Montage","summary":" The proliferation of online short video platforms has driven a surge in user\ndemand for short video editing. However, manually selecting, cropping, and\nassembling raw footage into a coherent, high-quality video remains laborious\nand time-consuming. To accelerate this process, we focus on a user-friendly new\ntask called Video Moment Montage (VMM), which aims to accurately locate the\ncorresponding video segments based on a pre-provided narration text and then\narrange these video clips to create a complete video that aligns with the\ncorresponding descriptions. The challenge lies in extracting precise temporal\nsegments while ensuring intra-sentence and inter-sentence context consistency,\nas a single script sentence may require trimming and assembling multiple video\nclips. To address this problem, we present a novel \\textit{Text-Video\nMulti-Grained Integration} method (TV-MGI) that efficiently fuses text features\nfrom the script with both shot-level and frame-level video features, which\nenables the global and fine-grained alignment between the video content and the\ncorresponding textual descriptions in the script. To facilitate further\nresearch in this area, we introduce the Multiple Sentences with Shots Dataset\n(MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct\nextensive experiments on the MSSD dataset to demonstrate the effectiveness of\nour framework compared to baseline methods.\n","authors":["Zhihui Yin","Ye Ma","Xipeng Cao","Bo Wang","Quan Chen","Peng Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.09276v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09262v1","updated":"2024-12-12T13:20:52Z","published":"2024-12-12T13:20:52Z","title":"LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync","summary":" We present LatentSync, an end-to-end lip sync framework based on audio\nconditioned latent diffusion models without any intermediate motion\nrepresentation, diverging from previous diffusion-based lip sync methods based\non pixel space diffusion or two-stage generation. Our framework can leverage\nthe powerful capabilities of Stable Diffusion to directly model complex\naudio-visual correlations. Additionally, we found that the diffusion-based lip\nsync methods exhibit inferior temporal consistency due to the inconsistency in\nthe diffusion process across different frames. We propose Temporal\nREPresentation Alignment (TREPA) to enhance temporal consistency while\npreserving lip-sync accuracy. TREPA uses temporal representations extracted by\nlarge-scale self-supervised video models to align the generated frames with the\nground truth frames. Furthermore, we observe the commonly encountered SyncNet\nconvergence issue and conduct comprehensive empirical studies, identifying key\nfactors affecting SyncNet convergence in terms of model architecture, training\nhyperparameters, and data preprocessing methods. We significantly improve the\naccuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not\nchange the overall training framework of SyncNet, our experience can also be\napplied to other lip sync and audio-driven portrait animation methods that\nutilize SyncNet. Based on the above innovations, our method outperforms\nstate-of-the-art lip sync methods across various metrics on the HDTF and\nVoxCeleb2 datasets.\n","authors":["Chunyu Li","Chao Zhang","Weikai Xu","Jinghui Xie","Weiguo Feng","Bingyue Peng","Weiwei Xing"],"pdf_url":"https://arxiv.org/pdf/2412.09262v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09258v1","updated":"2024-12-12T13:19:05Z","published":"2024-12-12T13:19:05Z","title":"FD2-Net: Frequency-Driven Feature Decomposition Network for\n Infrared-Visible Object Detection","summary":" Infrared-visible object detection (IVOD) seeks to harness the complementary\ninformation in infrared and visible images, thereby enhancing the performance\nof detectors in complex environments. However, existing methods often neglect\nthe frequency characteristics of complementary information, such as the\nabundant high-frequency details in visible images and the valuable\nlow-frequency thermal information in infrared images, thus constraining\ndetection performance. To solve this problem, we introduce a novel\nFrequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which\neffectively captures the unique frequency representations of complementary\ninformation across multimodal visual spaces. Specifically, we propose a feature\ndecomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete\ncosine transform to capture representative high-frequency features, while the\nlow-frequency unit (LFU) employs dynamic receptive fields to model the\nmulti-scale context of diverse objects. Next, we adopt a parameter-free\ncomplementary strengths strategy to enhance multimodal features through\nseamless inter-frequency recoupling. Furthermore, we innovatively design a\nmultimodal reconstruction mechanism that recovers image details lost during\nfeature extraction, further leveraging the complementary information from\ninfrared and visible images to enhance overall representational capacity.\nExtensive experiments demonstrate that FD2-Net outperforms state-of-the-art\n(SOTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR\n(82.9% mAP), and M3FD (83.5% mAP).\n","authors":["Ke Li","Di Wang","Zhangyuan Hu","Shaofeng Li","Weiping Ni","Lin Zhao","Quan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09258v1.pdf","comment":"This work is accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2407.10485v3","updated":"2024-12-12T13:11:11Z","published":"2024-07-15T07:13:27Z","title":"MM-Tracker: Motion Mamba with Margin Loss for UAV-platform Multiple\n Object Tracking","summary":" Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms\nrequires efficient motion modeling. This is because UAV-MOT faces both local\nobject motion and global camera motion. Motion blur also increases the\ndifficulty of detecting large moving objects. Previous UAV motion modeling\napproaches either focus only on local motion or ignore motion blurring effects,\nthus limiting their tracking performance and speed. To address these issues, we\npropose the Motion Mamba Module, which explores both local and global motion\nfeatures through cross-correlation and bi-directional Mamba Modules for better\nmotion modeling. To address the detection difficulties caused by motion blur,\nwe also design motion margin loss to effectively improve the detection accuracy\nof motion blurred objects. Based on the Motion Mamba module and motion margin\nloss, our proposed MM-Tracker surpasses the state-of-the-art in two widely\nopen-source UAV-MOT datasets. Code will be available.\n","authors":["Mufeng Yao","Jinlong Peng","Qingdong He","Bo Peng","Hao Chen","Mingmin Chi","Chao Liu","Jon Atli Benediktsson"],"pdf_url":"https://arxiv.org/pdf/2407.10485v3.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2403.16970v4","updated":"2024-12-12T13:06:47Z","published":"2024-03-25T17:31:12Z","title":"A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual\n Attention Prediction Using Deep Learning","summary":" Purpose: As visual inspection is an inherent process during radiological\nscreening, the associated eye gaze data can provide valuable insights into\nrelevant clinical decisions. As deep learning has become the state-of-the-art\nfor computer-assisted diagnosis, integrating human behavior, such as eye gaze\ndata, into these systems is instrumental to help align machine predictions with\nclinical diagnostic criteria, thus enhancing the quality of automatic\nradiological diagnosis. Methods: We propose a novel deep learning framework for\njoint disease diagnosis and prediction of corresponding clinical visual\nattention maps for chest X-ray scans. Specifically, we introduce a new\ndual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a\nResidual and Squeeze-and-Excitation block-based encoder to extract diverse\nfeatures for visual attention map prediction, and a multi-scale feature-fusion\nclassifier to perform disease classification. To tackle the issue of\nasynchronous training schedules of individual tasks in multi-task learning, we\nproposed a multi-stage cooperative learning strategy, with contrastive learning\nfor feature encoder pretraining to boost performance. Results: Our proposed\nmethod is shown to significantly outperform existing techniques for chest X-ray\ndiagnosis (AUC=0.93) and the quality of visual attention map prediction\n(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed\nmulti-task multi-stage cooperative learning, our technique demonstrates the\nbenefit of integrating clinicians' eye gaze into clinical AI systems to boost\nperformance and potentially explainability.\n","authors":["Zirui Qiu","Hassan Rivaz","Yiming Xiao"],"pdf_url":"https://arxiv.org/pdf/2403.16970v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.13082v3","updated":"2024-12-12T13:02:17Z","published":"2024-05-21T06:44:40Z","title":"A Survey of Artificial Intelligence in Gait-Based Neurodegenerative\n Disease Diagnosis","summary":" Recent years have witnessed an increasing global population affected by\nneurodegenerative diseases (NDs), which traditionally require extensive\nhealthcare resources and human effort for medical diagnosis and monitoring. As\na crucial disease-related motor symptom, human gait can be exploited to\ncharacterize different NDs. The current advances in artificial intelligence\n(AI) models enable automatic gait analysis for NDs identification and\nclassification, opening a new avenue to facilitate faster and more\ncost-effective diagnosis of NDs. In this paper, we provide a comprehensive\nsurvey on recent progress of machine learning and deep learning based AI\ntechniques applied to diagnosis of five typical NDs through gait. We provide an\noverview of the process of AI-assisted NDs diagnosis, and present a systematic\ntaxonomy of existing gait data and AI models. Meanwhile, a novel quality\nevaluation criterion is proposed to quantitatively assess the quality of\nexisting studies. Through an extensive review and analysis of 169 studies, we\npresent recent technical advancements, discuss existing challenges, potential\nsolutions, and future directions in this field. Finally, we envision the\nprospective utilization of 3D skeleton data for human gait representation and\nthe development of more efficient AI models for NDs diagnosis.\n","authors":["Haocong Rao","Minlin Zeng","Xuejiao Zhao","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2405.13082v3.pdf","comment":"Article: 57 pages, citing 290 papers. Appendix: 30 pages. A\n up-to-date resource (papers, data, etc.) of this survey (AI4NDD) is provided\n at https://github.com/minlinzeng/AI4NDD-Survey"},{"id":"http://arxiv.org/abs/2412.09240v1","updated":"2024-12-12T12:49:42Z","published":"2024-12-12T12:49:42Z","title":"VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation\n with Unsupervised Domain Adaptation","summary":" Segmentation models are typically constrained by the categories defined\nduring training. To address this, researchers have explored two independent\napproaches: adapting Vision-Language Models (VLMs) and leveraging synthetic\ndata. However, VLMs often struggle with granularity, failing to disentangle\nfine-grained concepts, while synthetic data-based methods remain limited by the\nscope of available datasets.\n This paper proposes enhancing segmentation accuracy across diverse domains by\nintegrating Vision-Language reasoning with key strategies for Unsupervised\nDomain Adaptation (UDA). First, we improve the fine-grained segmentation\ncapabilities of VLMs through multi-scale contextual data, robust text\nembeddings with prompt augmentation, and layer-wise fine-tuning in our proposed\nFoundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS)\nframework. Next, we incorporate these enhancements into a UDA framework by\nemploying distillation to stabilize training and cross-domain mixed sampling to\nboost adaptability without compromising generalization. The resulting\nUDA-FROVSS framework is the first UDA approach to effectively adapt across\ndomains without requiring shared categories.\n","authors":["Roberto Alcover-Couso","Marcos Escudero-Viñolo","Juan C. SanMiguel","Jesus Bescos"],"pdf_url":"https://arxiv.org/pdf/2412.09240v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09230v1","updated":"2024-12-12T12:39:07Z","published":"2024-12-12T12:39:07Z","title":"Foundation Models and Adaptive Feature Selection: A Synergistic Approach\n to Video Question Answering","summary":" This paper tackles the intricate challenge of video question-answering\n(VideoQA). Despite notable progress, current methods fall short of effectively\nintegrating questions with video frames and semantic object-level abstractions\nto create question-aware video representations. We introduce Local-Global\nQuestion Aware Video Embedding (LGQAVE), which incorporates three major\ninnovations to integrate multi-modal knowledge better and emphasize semantic\nvisual concepts relevant to specific questions. LGQAVE moves beyond traditional\nad-hoc frame sampling by utilizing a cross-attention mechanism that precisely\nidentifies the most relevant frames concerning the questions. It captures the\ndynamics of objects within these frames using distinct graphs, grounding them\nin question semantics with the miniGPT model. These graphs are processed by a\nquestion-aware dynamic graph transformer (Q-DGT), which refines the outputs to\ndevelop nuanced global and local video representations. An additional\ncross-attention module integrates these local and global embeddings to generate\nthe final video embeddings, which a language model uses to generate answers.\nExtensive evaluations across multiple benchmarks demonstrate that LGQAVE\nsignificantly outperforms existing models in delivering accurate multi-choice\nand open-ended answers.\n","authors":["Sai Bhargav Rongali","Mohamad Hassan N C","Ankit Jha","Neha Bhargava","Saurabh Prasad","Biplab Banerjee"],"pdf_url":"https://arxiv.org/pdf/2412.09230v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09229v1","updated":"2024-12-12T12:38:33Z","published":"2024-12-12T12:38:33Z","title":"UADet: A Remarkably Simple Yet Effective Uncertainty-Aware Open-Set\n Object Detection Framework","summary":" We tackle the challenging problem of Open-Set Object Detection (OSOD), which\naims to detect both known and unknown objects in unlabelled images. The main\ndifficulty arises from the absence of supervision for these unknown classes,\nmaking it challenging to distinguish them from the background. Existing OSOD\ndetectors either fail to properly exploit or inadequately leverage the abundant\nunlabeled unknown objects in training data, restricting their performance. To\naddress these limitations, we propose UADet, an Uncertainty-Aware Open-Set\nObject Detector that considers appearance and geometric uncertainty. By\nintegrating these uncertainty measures, UADet effectively reduces the number of\nunannotated instances incorrectly utilized or omitted by previous methods.\nExtensive experiments on OSOD benchmarks demonstrate that UADet substantially\noutperforms previous state-of-the-art (SOTA) methods in detecting both known\nand unknown objects, achieving a 1.8x improvement in unknown recall while\nmaintaining high performance on known classes. When extended to Open World\nObject Detection (OWOD), our method shows significant advantages over the\ncurrent SOTA method, with average improvements of 13.8% and 6.9% in unknown\nrecall on M-OWODB and S-OWODB benchmarks, respectively. Extensive results\nvalidate the effectiveness of our uncertainty-aware approach across different\nopen-set scenarios.\n","authors":["Silin Cheng","Yuanpei Liu","Kai Han"],"pdf_url":"https://arxiv.org/pdf/2412.09229v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2407.00574v2","updated":"2024-12-12T12:37:32Z","published":"2024-06-30T03:31:21Z","title":"Humans as Checkerboards: Calibrating Camera Motion Scale for\n World-Coordinate Human Mesh Recovery","summary":" Accurate camera motion estimation is essential for recovering global human\nmotion in world coordinates from RGB video inputs. SLAM is widely used for\nestimating camera trajectory and point cloud, but monocular SLAM does so only\nup to an unknown scale factor. Previous works estimate the scale factor through\noptimization, but this is unreliable and time-consuming. This paper presents an\noptimization-free scale calibration framework, Human as Checkerboard (HAC). HAC\ninnovatively leverages the human body predicted by human mesh recovery model as\na calibration reference. Specifically, it uses the absolute depth of\nhuman-scene contact joints as references to calibrate the corresponding\nrelative scene depth from SLAM. HAC benefits from geometric priors encoded in\nhuman mesh recovery models to estimate the SLAM scale and achieves precise\nglobal human motion estimation. Simple yet powerful, our method sets a new\nstate-of-the-art performance for global human mesh estimation tasks, reducing\nmotion errors by 50% over prior local-to-global methods while using 100$\\times$\nless inference time than optimization-based methods. Project page:\nhttps://martayang.github.io/HAC.\n","authors":["Fengyuan Yang","Kerui Gu","Ha Linh Nguyen","Tze Ho Elden Tse","Angela Yao"],"pdf_url":"https://arxiv.org/pdf/2407.00574v2.pdf","comment":"13 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2108.10201v4","updated":"2024-12-12T12:28:39Z","published":"2021-08-23T14:37:58Z","title":"Improving generative adversarial network inversion via fine-tuning GAN\n encoders","summary":" Generative adversarial networks (GANs) can synthesize high-quality (HQ)\nimages, and GAN inversion is a technique that discovers how to invert given\nimages back to latent space. While existing methods perform on StyleGAN\ninversion, they have limited performance and are not generalized to different\nGANs. To address these issues, we proposed a self-supervised method to\npre-train and fine-tune GAN encoders. First, we designed an adaptive block to\nfit different encoder architectures for inverting diverse GANs. Then we\npre-train GAN encoders using synthesized images and emphasize local regions\nthrough cropping images. Finally, we fine-tune the pre-trained GAN encoder for\ninverting real images. Compared with state-of-the-art methods, our method\nachieved better results that reconstructed high-quality images on mainstream\nGANs. Our code and pre-trained models are available at:\nhttps://github.com/disanda/Deep-GAN-Encoders.\n","authors":["Cheng Yu","Wenmin Wang","Roberto Bugiolacchi"],"pdf_url":"https://arxiv.org/pdf/2108.10201v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09224v1","updated":"2024-12-12T12:26:08Z","published":"2024-12-12T12:26:08Z","title":"DASK: Distribution Rehearsing via Adaptive Style Kernel Learning for\n Exemplar-Free Lifelong Person Re-Identification","summary":" Lifelong person re-identification (LReID) is an important but challenging\ntask that suffers from catastrophic forgetting due to significant domain gaps\nbetween training steps. Existing LReID approaches typically rely on data replay\nand knowledge distillation to mitigate this issue. However, data replay methods\ncompromise data privacy by storing historical exemplars, while knowledge\ndistillation methods suffer from limited performance due to the cumulative\nforgetting of undistilled knowledge. To overcome these challenges, we propose a\nnovel paradigm that models and rehearses the distribution of the old domains to\nenhance knowledge consolidation during the new data learning, possessing a\nstrong anti-forgetting capacity without storing any exemplars. Specifically, we\nintroduce an exemplar-free LReID method called Distribution Rehearsing via\nAdaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser\nLearning mechanism that learns to transform arbitrary distribution data into\nthe current data style at each learning step. To enhance the style transfer\ncapacity of DRL, an Adaptive Kernel Prediction network is explored to achieve\nan instance-specific distribution adjustment. Additionally, we design a\nDistribution Rehearsing-driven LReID Training module, which rehearses old\ndistribution based on the new data via the old AKPNet model, achieving\neffective new-old knowledge accumulation under a joint knowledge consolidation\nscheme. Experimental results show our DASK outperforms the existing methods by\n3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity,\nrespectively. Our code is available at\nhttps://github.com/zhoujiahuan1991/AAAI2025-DASK\n","authors":["Kunlun Xu","Chenghao Jiang","Peixi Xiong","Yuxin Peng","Jiahuan Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.09224v1.pdf","comment":"in Proceedings of the 39th AAAI Conference on Artificial Intelligence\n (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.09220v1","updated":"2024-12-12T12:20:27Z","published":"2024-12-12T12:20:27Z","title":"USDRL: Unified Skeleton-Based Dense Representation Learning with\n Multi-Grained Feature Decorrelation","summary":" Contrastive learning has achieved great success in skeleton-based\nrepresentation learning recently. However, the prevailing methods are\npredominantly negative-based, necessitating additional momentum encoder and\nmemory bank to get negative samples, which increases the difficulty of model\ntraining. Furthermore, these methods primarily concentrate on learning a global\nrepresentation for recognition and retrieval tasks, while overlooking the rich\nand detailed local representations that are crucial for dense prediction tasks.\nTo alleviate these issues, we introduce a Unified Skeleton-based Dense\nRepresentation Learning framework based on feature decorrelation, called USDRL,\nwhich employs feature decorrelation across temporal, spatial, and instance\ndomains in a multi-grained manner to reduce redundancy among dimensions of the\nrepresentations to maximize information extraction from features. Additionally,\nwe design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action\nrepresentations effectively, thereby enhancing the performance of dense\nprediction tasks. Comprehensive experiments, conducted on the benchmarks\nNTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks\nincluding action recognition, action retrieval, and action detection,\nconclusively demonstrate that our approach significantly outperforms the\ncurrent state-of-the-art (SOTA) approaches. Our code and models are available\nat https://github.com/wengwanjiang/USDRL.\n","authors":["Wanjiang Weng","Hongsong Wang","Junbo He","Lei He","Guosen Xie"],"pdf_url":"https://arxiv.org/pdf/2412.09220v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2410.08926v2","updated":"2024-12-12T12:18:39Z","published":"2024-10-11T15:50:53Z","title":"Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million\n Images","summary":" We explore the transformative potential of SAM 2, a vision foundation model,\nin advancing gaze estimation and eye tracking technologies. By significantly\nreducing annotation time, lowering technical barriers through its ease of\ndeployment, and enhancing segmentation accuracy, SAM 2 addresses critical\nchallenges faced by researchers and practitioners. Utilizing its zero-shot\nsegmentation capabilities with minimal user input-a single click per video-we\ntested SAM 2 on over 14 million eye images from diverse datasets, including\nvirtual reality setups and the world's largest unified dataset recorded using\nwearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches\nthe performance of domain-specific models trained solely on eye images,\nachieving competitive mean Intersection over Union (mIoU) scores of up to 93%\nwithout fine-tuning. Additionally, we provide our code and segmentation masks\nfor these widely used datasets to promote further research.\n","authors":["Virmarie Maquiling","Sean Anthony Byrne","Diederick C. Niehorster","Marco Carminati","Enkelejda Kasneci"],"pdf_url":"https://arxiv.org/pdf/2410.08926v2.pdf","comment":"Virmarie Maquiling and Sean Anthony Byrne contributed equally to this\n paper, 8 pages, 3 figures, CHI Case Study, pre-print"},{"id":"http://arxiv.org/abs/2412.09213v1","updated":"2024-12-12T12:08:27Z","published":"2024-12-12T12:08:27Z","title":"Enhancing Implicit Neural Representations via Symmetric Power\n Transformation","summary":" We propose symmetric power transformation to enhance the capacity of Implicit\nNeural Representation~(INR) from the perspective of data transformation. Unlike\nprior work utilizing random permutation or index rearrangement, our method\nfeatures a reversible operation that does not require additional storage\nconsumption. Specifically, we first investigate the characteristics of data\nthat can benefit the training of INR, proposing the Range-Defined Symmetric\nHypothesis, which posits that specific range and symmetry can improve the\nexpressive ability of INR. Based on this hypothesis, we propose a nonlinear\nsymmetric power transformation to achieve both range-defined and symmetric\nproperties simultaneously. We use the power coefficient to redistribute data to\napproximate symmetry within the target range. To improve the robustness of the\ntransformation, we further design deviation-aware calibration and adaptive soft\nboundary to address issues of extreme deviation boosting and continuity\nbreaking. Extensive experiments are conducted to verify the performance of the\nproposed method, demonstrating that our transformation can reliably improve INR\ncompared with other data transformations. We also conduct 1D audio, 2D image\nand 3D video fitting tasks to demonstrate the effectiveness and applicability\nof our method.\n","authors":["Weixiang Zhang","Shuzhao Xie","Chengwei Ren","Shijia Ge","Mingzi Wang","Zhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09213v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2307.03270v2","updated":"2024-12-12T12:05:25Z","published":"2023-07-04T08:29:59Z","title":"A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony\n in Talking Head Generation","summary":" Animating still face images with deep generative models using a speech input\nsignal is an active research topic and has seen important recent\nprogress.However, much of the effort has been put into lip syncing and\nrendering quality while the generation of natural head motion, let alone the\naudio-visual correlation between head motion and speech, has often been\nneglected.In this work, we propose a multi-scale audio-visual synchrony loss\nand a multi-scale autoregressive GAN to better handle short and long-term\ncorrelation between speech and the dynamics of the head and lips.In particular,\nwe train a stack of syncer models on multimodal input pyramids and use these\nmodels as guidance in a multi-scale generator network to produce audio-aligned\nmotion unfolding over diverse time scales.Both the pyramid of audio-visual\nsyncers and the generative models are trained in a low-dimensional space that\nfully preserves dynamics cues.The experiments show significant improvements\nover the state-of-the-art in head motion dynamics quality and especially in\nmulti-scale audio-visual synchrony on a collection of benchmark datasets.\n","authors":["Louis Airale","Dominique Vaufreydaz","Xavier Alameda-Pineda"],"pdf_url":"https://arxiv.org/pdf/2307.03270v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09209v1","updated":"2024-12-12T12:02:23Z","published":"2024-12-12T12:02:23Z","title":"eCARLA-scenes: A synthetically generated dataset for event-based optical\n flow prediction","summary":" The joint use of event-based vision and Spiking Neural Networks (SNNs) is\nexpected to have a large impact in robotics in the near future, in tasks such\nas, visual odometry and obstacle avoidance. While researchers have used\nreal-world event datasets for optical flow prediction (mostly captured with\nUnmanned Aerial Vehicles (UAVs)), these datasets are limited in diversity,\nscalability, and are challenging to collect. Thus, synthetic datasets offer a\nscalable alternative by bridging the gap between reality and simulation. In\nthis work, we address the lack of datasets by introducing eWiz, a comprehensive\nlibrary for processing event-based data. It includes tools for data loading,\naugmentation, visualization, encoding, and generation of training data, along\nwith loss functions and performance metrics. We further present a synthetic\nevent-based datasets and data generation pipelines for optical flow prediction\ntasks. Built on top of eWiz, eCARLA-scenes makes use of the CARLA simulator to\nsimulate self-driving car scenarios. The ultimate goal of this dataset is the\ndepiction of diverse environments while laying a foundation for advancing\nevent-based camera applications in autonomous field vehicle navigation, paving\nthe way for using SNNs on neuromorphic hardware such as the Intel Loihi.\n","authors":["Jad Mansour","Hayat Rajani","Rafael Garcia","Nuno Gracias"],"pdf_url":"https://arxiv.org/pdf/2412.09209v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09202v1","updated":"2024-12-12T11:56:24Z","published":"2024-12-12T11:56:24Z","title":"Temporal Action Localization with Cross Layer Task Decoupling and\n Refinement","summary":" Temporal action localization (TAL) involves dual tasks to classify and\nlocalize actions within untrimmed videos. However, the two tasks often have\nconflicting requirements for features. Existing methods typically employ\nseparate heads for classification and localization tasks but share the same\ninput feature, leading to suboptimal performance. To address this issue, we\npropose a novel TAL method with Cross Layer Task Decoupling and Refinement\n(CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates\nsemantically strong features from higher pyramid layers and detailed\nboundary-aware boundary features from lower pyramid layers to effectively\ndisentangle the action classification and localization tasks. Moreover, the\nmultiple features from cross layers are also employed to refine and align the\ndisentangled classification and regression results. At last, a lightweight\nGated Multi-Granularity (GMG) module is proposed to comprehensively extract and\naggregate video features at instant, local, and global temporal granularities.\nBenefiting from the CLTDR and GMG modules, our method achieves state-of-the-art\nperformance on five challenging benchmarks: THUMOS14, MultiTHUMOS,\nEPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models\nare publicly available at: https://github.com/LiQiang0307/CLTDR-GMG.\n","authors":["Qiang Li","Di Liu","Jun Kong","Sen Li","Hui Xu","Jianzhong Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09202v1.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2412.09200v1","updated":"2024-12-12T11:53:19Z","published":"2024-12-12T11:53:19Z","title":"Accuracy Improvements for Convolutional and Differential Distance\n Function Approximations","summary":" Given a bounded domain, we deal with the problem of estimating the distance\nfunction from the internal points of the domain to the boundary of the domain.\nConvolutional and differential distance estimation schemes are considered and,\nfor both the schemes, accuracy improvements are proposed and evaluated.\nAsymptotics of Laplace integrals and Taylor series extrapolations are used to\nachieve the improvements.\n","authors":["Alexander Belyaev","Pierre-Alain Fayolle"],"pdf_url":"https://arxiv.org/pdf/2412.09200v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09199v1","updated":"2024-12-12T11:49:18Z","published":"2024-12-12T11:49:18Z","title":"MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place\n Recognition","summary":" Visual Place Recognition (VPR) aims to robustly identify locations by\nleveraging image retrieval based on descriptors encoded from environmental\nimages. However, drastic appearance changes of images captured from different\nviewpoints at the same location pose incoherent supervision signals for\ndescriptor learning, which severely hinder the performance of VPR. Previous\nwork proposes classifying images based on manually defined rules or ground\ntruth labels for viewpoints, followed by descriptor training based on the\nclassification results. However, not all datasets have ground truth labels of\nviewpoints and manually defined rules may be suboptimal, leading to degraded\ndescriptor performance.To address these challenges, we introduce the mutual\nlearning of viewpoint self-classification and VPR. Starting from coarse\nclassification based on geographical coordinates, we progress to finer\nclassification of viewpoints using simple clustering techniques. The dataset is\npartitioned in an unsupervised manner while simultaneously training a\ndescriptor extractor for place recognition. Experimental results show that this\napproach almost perfectly partitions the dataset based on viewpoints, thus\nachieving mutually reinforcing effects. Our method even excels state-of-the-art\n(SOTA) methods that partition datasets using ground truth labels.\n","authors":["Qiwen Gu","Xufei Wang","Fenglin Zhang","Junqiao Zhao","Siyue Tao","Chen Ye","Tiantian Feng","Changjun Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.09199v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2410.08490v2","updated":"2024-12-12T11:45:30Z","published":"2024-10-11T03:31:40Z","title":"CAS-GAN for Contrast-free Angiography Synthesis","summary":" Iodinated contrast agents are widely utilized in numerous interventional\nprocedures, yet posing substantial health risks to patients. This paper\npresents CAS-GAN, a novel GAN framework that serves as a \"virtual contrast\nagent\" to synthesize X-ray angiographies via disentanglement representation\nlearning and vessel semantic guidance, thereby reducing the reliance on\niodinated contrast agents during interventional procedures. Specifically, our\napproach disentangles X-ray angiographies into background and vessel\ncomponents, leveraging medical prior knowledge. A specialized predictor then\nlearns to map the interrelationships between these components. Additionally, a\nvessel semantic-guided generator and a corresponding loss function are\nintroduced to enhance the visual fidelity of generated images. Experimental\nresults on the XCAD dataset demonstrate the state-of-the-art performance of our\nCAS-GAN, achieving a FID of 5.87 and a MMD of 0.016. These promising results\nhighlight {\\tt CAS-GAN}'s potential for clinical applications.\n","authors":["De-Xing Huang","Xiao-Hu Zhou","Mei-Jiang Gui","Xiao-Liang Xie","Shi-Qi Liu","Shuang-Yi Wang","Hao Li","Tian-Yu Xiang","Zeng-Guang Hou"],"pdf_url":"https://arxiv.org/pdf/2410.08490v2.pdf","comment":"IEEE Symposium Series on Computational Intelligence (SSCI 2025)"},{"id":"http://arxiv.org/abs/2412.09193v1","updated":"2024-12-12T11:42:39Z","published":"2024-12-12T11:42:39Z","title":"ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local\n Motion Deblurring","summary":" Removing blur caused by moving objects is challenging, as the moving objects\nare usually significantly blurry while the static background remains clear.\nExisting methods that rely on local blur detection often suffer from\ninaccuracies and cannot generate satisfactory results when focusing solely on\nblurred regions. To overcome these problems, we first design a context-based\nlocal blur detection module that incorporates additional contextual information\nto improve the identification of blurry regions. Considering that modern\nsmartphones are equipped with cameras capable of providing short-exposure\nimages, we develop a blur-aware guided image restoration method that utilizes\nsharp structural details from short-exposure images, facilitating accurate\nreconstruction of heavily blurred regions. Furthermore, to restore images\nrealistically and visually-pleasant, we develop a short-exposure guided\ndiffusion model that explores useful features from short-exposure images and\nblurred regions to better constrain the diffusion process. Finally, we\nformulate the above components into a simple yet effective network, named\nExpRDiff. Experimental results show that ExpRDiff performs favorably against\nstate-of-the-art methods.\n","authors":["Zhongbao Yang","Jiangxin Dong","Jinhui Tang","Jinshan Pan"],"pdf_url":"https://arxiv.org/pdf/2412.09193v1.pdf","comment":"Project website: https://github.com/yzb1997/ExpRDiff"},{"id":"http://arxiv.org/abs/2412.09191v1","updated":"2024-12-12T11:38:46Z","published":"2024-12-12T11:38:46Z","title":"RAD: Region-Aware Diffusion Models for Image Inpainting","summary":" Diffusion models have achieved remarkable success in image generation, with\napplications broadening across various domains. Inpainting is one such\napplication that can benefit significantly from diffusion models. Existing\nmethods either hijack the reverse process of a pretrained diffusion model or\ncast the problem into a larger framework, \\ie, conditioned generation. However,\nthese approaches often require nested loops in the generation process or\nadditional components for conditioning. In this paper, we present region-aware\ndiffusion models (RAD) for inpainting with a simple yet effective reformulation\nof the vanilla diffusion models. RAD utilizes a different noise schedule for\neach pixel, which allows local regions to be generated asynchronously while\nconsidering the global image context. A plain reverse process requires no\nadditional components, enabling RAD to achieve inference time up to 100 times\nfaster than the state-of-the-art approaches. Moreover, we employ low-rank\nadaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models,\nreducing computational burdens in training as well. Experiments demonstrated\nthat RAD provides state-of-the-art results both qualitatively and\nquantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.\n","authors":["Sora Kim","Sungho Suh","Minsik Lee"],"pdf_url":"https://arxiv.org/pdf/2412.09191v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09182v1","updated":"2024-12-12T11:25:32Z","published":"2024-12-12T11:25:32Z","title":"On the effectiveness of Rotation-Equivariance in U-Net: A Benchmark for\n Image Segmentation","summary":" Numerous studies have recently focused on incorporating different variations\nof equivariance in Convolutional Neural Networks (CNNs). In particular,\nrotation-equivariance has gathered significant attention due to its relevance\nin many applications related to medical imaging, microscopic imaging, satellite\nimaging, industrial tasks, etc. While prior research has primarily focused on\nenhancing classification tasks with rotation equivariant CNNs, their impact on\nmore complex architectures, such as U-Net for image segmentation, remains\nscarcely explored. Indeed, previous work interested in integrating\nrotation-equivariance into U-Net architecture have focused on solving specific\napplications with a limited scope. In contrast, this paper aims to provide a\nmore exhaustive evaluation of rotation equivariant U-Net for image segmentation\nacross a broader range of tasks. We benchmark their effectiveness against\nstandard U-Net architectures, assessing improvements in terms of performance\nand sustainability (i.e., computational cost). Our evaluation focuses on\ndatasets whose orientation of objects of interest is arbitrary in the image\n(e.g., Kvasir-SEG), but also on more standard segmentation datasets (such as\nCOCO-Stuff) as to explore the wider applicability of rotation equivariance\nbeyond tasks undoubtedly concerned by rotation equivariance. The main\ncontribution of this work is to provide insights into the trade-offs and\nadvantages of integrating rotation equivariance for segmentation tasks.\n","authors":["Robin Ghyselinck","Valentin Delchevalerie","Bruno Dumas","Benoît Frénay"],"pdf_url":"https://arxiv.org/pdf/2412.09182v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.13712v2","updated":"2024-12-12T11:18:51Z","published":"2024-08-25T03:21:48Z","title":"Riemann-based Multi-scale Attention Reasoning Network for Text-3D\n Retrieval","summary":" Due to the challenges in acquiring paired Text-3D data and the inherent\nirregularity of 3D data structures, combined representation learning of 3D\npoint clouds and text remains unexplored. In this paper, we propose a novel\nRiemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D\nretrieval. Specifically, the extracted text and point cloud features are\nrefined by their respective Adaptive Feature Refiner (AFR). Furthermore, we\nintroduce the innovative Riemann Local Similarity (RLS) module and the Global\nPooling Similarity (GPS) module. However, as 3D point cloud data and text data\noften possess complex geometric structures in high-dimensional space, the\nproposed RLS employs a novel Riemann Attention Mechanism to reflect the\nintrinsic geometric relationships of the data. Without explicitly defining the\nmanifold, RMARN learns the manifold parameters to better represent the\ndistances between text-point cloud samples. To address the challenges of\nlacking paired text-3D data, we have created the large-scale Text-3D Retrieval\ndataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud\ndata. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained\nChinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs,\nrespectively. Experiments on our custom datasets demonstrate the superior\nperformance of the proposed method. Our code and proposed datasets are\navailable at \\url{https://github.com/liwrui/RMARN}.\n","authors":["Wenrui Li","Wei Han","Yandu Chen","Yeyu Chai","Yidan Lu","Xingtao Wang","Xiaopeng Fan"],"pdf_url":"https://arxiv.org/pdf/2408.13712v2.pdf","comment":"Accepted by AAAI25"},{"id":"http://arxiv.org/abs/2412.09177v1","updated":"2024-12-12T11:09:56Z","published":"2024-12-12T11:09:56Z","title":"Weighted Poisson-disk Resampling on Large-Scale Point Clouds","summary":" For large-scale point cloud processing, resampling takes the important role\nof controlling the point number and density while keeping the geometric\nconsistency. % in related tasks. However, current methods cannot balance such\ndifferent requirements. Particularly with large-scale point clouds, classical\nmethods often struggle with decreased efficiency and accuracy. To address such\nissues, we propose a weighted Poisson-disk (WPD) resampling method to improve\nthe usability and efficiency for the processing. We first design an initial\nPoisson resampling with a voxel-based estimation strategy. It is able to\nestimate a more accurate radius of the Poisson-disk while maintaining high\nefficiency. Then, we design a weighted tangent smoothing step to further\noptimize the Voronoi diagram for each point. At the same time, sharp features\nare detected and kept in the optimized results with isotropic property.\nFinally, we achieve a resampling copy from the original point cloud with the\nspecified point number, uniform density, and high-quality geometric\nconsistency. Experiments show that our method significantly improves the\nperformance of large-scale point cloud resampling for different applications,\nand provides a highly practical solution.\n","authors":["Xianhe Jiao","Chenlei Lv","Junli Zhao","Ran Yi","Yu-Hui Wen","Zhenkuan Pan","Zhongke Wu","Yong-jin Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09177v1.pdf","comment":"Accepted to AAAI 2025"},{"id":"http://arxiv.org/abs/2409.16925v2","updated":"2024-12-12T11:06:40Z","published":"2024-09-25T13:33:28Z","title":"Game4Loc: A UAV Geo-Localization Benchmark from Game Data","summary":" The vision-based geo-localization technology for UAV, serving as a secondary\nsource of GPS information in addition to the global navigation satellite\nsystems (GNSS), can still operate independently in the GPS-denied environment.\nRecent deep learning based methods attribute this as the task of image matching\nand retrieval. By retrieving drone-view images in geo-tagged satellite image\ndatabase, approximate localization information can be obtained. However, due to\nhigh costs and privacy concerns, it is usually difficult to obtain large\nquantities of drone-view images from a continuous area. Existing drone-view\ndatasets are mostly composed of small-scale aerial photography with a strong\nassumption that there exists a perfect one-to-one aligned reference image for\nany query, leaving a significant gap from the practical localization scenario.\nIn this work, we construct a large-range contiguous area UAV geo-localization\ndataset named GTA-UAV, featuring multiple flight altitudes, attitudes, scenes,\nand targets using modern computer games. Based on this dataset, we introduce a\nmore practical UAV geo-localization task including partial matches of\ncross-view paired data, and expand the image-level retrieval to the actual\nlocalization in terms of distance (meters). For the construction of drone-view\nand satellite-view pairs, we adopt a weight-based contrastive learning\napproach, which allows for effective learning while avoiding additional\npost-processing matching steps. Experiments demonstrate the effectiveness of\nour data and training method for UAV geo-localization, as well as the\ngeneralization capabilities to real-world scenarios.\n","authors":["Yuxiang Ji","Boyong He","Zhuoyue Tan","Liaoni Wu"],"pdf_url":"https://arxiv.org/pdf/2409.16925v2.pdf","comment":"AAAI 2025, Project page: https://yux1angji.github.io/game4loc/"},{"id":"http://arxiv.org/abs/2407.09552v2","updated":"2024-12-12T11:03:20Z","published":"2024-06-28T01:31:37Z","title":"Optimized 3D Point Labeling with Leaders Using the Beams Displacement\n Method","summary":" In three-dimensional geographical scenes, adding labels with leader lines to\npoint features can significantly improve their visibility. Leadered labels have\na large degree of freedom in position con-figuration, but existing methods are\nmostly based on limited position candidate models, which not only fail to\neffectively utilize the map space but also make it difficult to consider the\nrelative relationships between labels. Therefore, we conceptualize the dynamic\nconfiguration process of computing label positions as akin to solving a map\ndisplacement problem. We use a triangulated graph to delineate spatial\nrelationships among labels and calculate the forces exerted on labels\nconsidering the constraints associated with point feature labels. Then we use\nthe Beams Displacement Method to iteratively calculate new positions for the\nlabels. Our experimental outcomes demonstrate that this method effectively\nmitigates label overlay issues while maintaining minimal average directional\ndeviation between adjacent labels. Furthermore, this method is adaptable to\nvarious types of leader line labels. Meanwhile, we also discuss the block\nprocessing strategy to improve the efficiency of label configuration and\nanalyze the impact of different proximity graphs.\n","authors":["Zhiwei Wei","Nai Yang","Wenjia Xu","Su Ding","Li Minmin","Li You","Guo Renzhong"],"pdf_url":"https://arxiv.org/pdf/2407.09552v2.pdf","comment":"12 pages, in Chinese language, 10 figures"},{"id":"http://arxiv.org/abs/2412.09169v1","updated":"2024-12-12T10:59:44Z","published":"2024-12-12T10:59:44Z","title":"DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image\n Customization","summary":" Text-to-image (T2I) models can effectively capture the content or style of\nreference images to perform high-quality customization. A representative\ntechnique for this is fine-tuning using low-rank adaptations (LoRA), which\nenables efficient model customization with reference images. However,\nfine-tuning with a limited number of reference images often leads to\noverfitting, resulting in issues such as prompt misalignment or content\nleakage. These issues prevent the model from accurately following the input\nprompt or generating undesired objects during inference. To address this\nproblem, we examine the text embeddings that guide the diffusion model during\ninference. This study decomposes the text embedding matrix and conducts a\ncomponent analysis to understand the embedding space geometry and identify the\ncause of overfitting. Based on this, we propose DECOR, which projects text\nembeddings onto a vector space orthogonal to undesired token vectors, thereby\nreducing the influence of unwanted semantics in the text embeddings.\nExperimental results demonstrate that DECOR outperforms state-of-the-art\ncustomization models and achieves Pareto frontier performance across text and\nvisual alignment evaluation metrics. Furthermore, it generates images more\nfaithful to the input prompts, showcasing its effectiveness in addressing\noverfitting and enhancing text-to-image customization.\n","authors":["Geonhui Jang","Jin-Hwa Kim","Yong-Hyun Park","Junho Kim","Gayoung Lee","Yonghyun Jeong"],"pdf_url":"https://arxiv.org/pdf/2412.09169v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09168v1","updated":"2024-12-12T10:55:57Z","published":"2024-12-12T10:55:57Z","title":"YingSound: Video-Guided Sound Effects Generation with Multi-modal\n Chain-of-Thought Controls","summary":" Generating sound effects for product-level videos, where only a small amount\nof labeled data is available for diverse scenes, requires the production of\nhigh-quality sounds in few-shot settings. To tackle the challenge of limited\nlabeled data in real-world scenes, we introduce YingSound, a foundation model\ndesigned for video-guided sound generation that supports high-quality audio\ngeneration in few-shot settings. Specifically, YingSound consists of two major\nmodules. The first module uses a conditional flow matching transformer to\nachieve effective semantic alignment in sound generation across audio and\nvisual modalities. This module aims to build a learnable audio-visual\naggregator (AVA) that integrates high-resolution visual features with\ncorresponding audio features at multiple stages. The second module is developed\nwith a proposed multi-modal visual-audio chain-of-thought (CoT) approach to\ngenerate finer sound effects in few-shot settings. Finally, an\nindustry-standard video-to-audio (V2A) dataset that encompasses various\nreal-world scenarios is presented. We show that YingSound effectively generates\nhigh-quality synchronized sounds across diverse conditional inputs through\nautomated evaluations and human studies. Project Page:\n\\url{https://giantailab.github.io/yingsound/}\n","authors":["Zihao Chen","Haomin Zhang","Xinhan Di","Haoyu Wang","Sizhe Shan","Junjie Zheng","Yunming Liang","Yihan Fan","Xinfa Zhu","Wenjie Tian","Yihua Wang","Chaofan Ding","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2412.09168v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.09160v1","updated":"2024-12-12T10:46:14Z","published":"2024-12-12T10:46:14Z","title":"Pinpoint Counterfactuals: Reducing social bias in foundation models via\n localized counterfactual generation","summary":" Foundation models trained on web-scraped datasets propagate societal biases\nto downstream tasks. While counterfactual generation enables bias analysis,\nexisting methods introduce artifacts by modifying contextual elements like\nclothing and background. We present a localized counterfactual generation\nmethod that preserves image context by constraining counterfactual\nmodifications to specific attribute-relevant regions through automated masking\nand guided inpainting. When applied to the Conceptual Captions dataset for\ncreating gender counterfactuals, our method results in higher visual and\nsemantic fidelity than state-of-the-art alternatives, while maintaining the\nperformance of models trained using only real data on non-human-centric tasks.\nModels fine-tuned with our counterfactuals demonstrate measurable bias\nreduction across multiple metrics, including a decrease in gender\nclassification disparity and balanced person preference scores, while\npreserving ImageNet zero-shot performance. The results establish a framework\nfor creating balanced datasets that enable both accurate bias profiling and\neffective mitigation.\n","authors":["Kirill Sirotkin","Marcos Escudero-Viñolo","Pablo Carballeira","Mayug Maniparambil","Catarina Barata","Noel E. O'Connor"],"pdf_url":"https://arxiv.org/pdf/2412.09160v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09502v2","updated":"2024-12-12T10:39:16Z","published":"2024-11-14T15:13:13Z","title":"Golden Noise for Diffusion Models: A Learning Framework","summary":" Text-to-image diffusion model is a popular paradigm that synthesizes\npersonalized images by providing a text prompt and a random Gaussian noise.\nWhile people observe that some noises are ``golden noises'' that can achieve\nbetter text-image alignment and higher human preference than others, we still\nlack a machine learning framework to obtain those golden noises. To learn\ngolden noises for diffusion sampling, we mainly make three contributions in\nthis paper. First, we identify a new concept termed the \\textit{noise prompt},\nwhich aims at turning a random Gaussian noise into a golden noise by adding a\nsmall desirable perturbation derived from the text prompt. Following the\nconcept, we first formulate the \\textit{noise prompt learning} framework that\nsystematically learns ``prompted'' golden noise associated with a text prompt\nfor diffusion models. Second, we design a noise prompt data collection pipeline\nand collect a large-scale \\textit{noise prompt dataset}~(NPD) that contains\n100k pairs of random noises and golden noises with the associated text prompts.\nWith the prepared NPD as the training dataset, we trained a small \\textit{noise\nprompt network}~(NPNet) that can directly learn to transform a random noise\ninto a golden noise. The learned golden noise perturbation can be considered as\na kind of prompt for noise, as it is rich in semantic information and tailored\nto the given text prompt. Third, our extensive experiments demonstrate the\nimpressive effectiveness and generalization of NPNet on improving the quality\nof synthesized images across various diffusion models, including SDXL,\nDreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and\nefficient controller that acts as a plug-and-play module with very limited\nadditional inference and computational costs, as it just provides a golden\nnoise instead of a random noise without accessing the original pipeline.\n","authors":["Zikai Zhou","Shitong Shao","Lichen Bai","Zhiqiang Xu","Bo Han","Zeke Xie"],"pdf_url":"https://arxiv.org/pdf/2411.09502v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09150v1","updated":"2024-12-12T10:36:26Z","published":"2024-12-12T10:36:26Z","title":"Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond\n Standard Baselines","summary":" Adversarial attacks on traffic sign classification models were among the\nfirst successfully tried in the real world. Since then, the research in this\narea has been mainly restricted to repeating baseline models, such as LISA-CNN\nor GTSRB-CNN, and similar experiment settings, including white and black\npatches on traffic signs. In this work, we decouple model architectures from\nthe datasets and evaluate on further generic models to make a fair comparison.\nFurthermore, we compare two attack settings, inconspicuous and visible, which\nare usually regarded without direct comparison. Our results show that standard\nbaselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than\nthe generic ones. We, therefore, suggest evaluating new attacks on a broader\nspectrum of baselines in the future. Our code is available at\n\\url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.\n","authors":["Svetlana Pavlitska","Leopold Müller","J. Marius Zöllner"],"pdf_url":"https://arxiv.org/pdf/2412.09150v1.pdf","comment":"Accepted for publication at ICMLA 2024"},{"id":"http://arxiv.org/abs/2401.07450v4","updated":"2024-12-12T10:36:14Z","published":"2024-01-15T03:38:57Z","title":"HieraFashDiff: Hierarchical Fashion Design with Multi-stage Diffusion\n Models","summary":" Fashion design is a challenging and complex process.Recent works on fashion\ngeneration and editing are all agnostic of the actual fashion design process,\nwhich limits their usage in practice.In this paper, we propose a novel\nhierarchical diffusion-based framework tailored for fashion design, coined as\nHieraFashDiff. Our model is designed to mimic the practical fashion design\nworkflow, by unraveling the denosing process into two successive stages: 1) an\nideation stage that generates design proposals given high-level concepts and 2)\nan iteration stage that continuously refines the proposals using low-level\nattributes. Our model supports fashion design generation and fine-grained local\nediting in a single framework. To train our model, we contribute a new dataset\nof full-body fashion images annotated with hierarchical text descriptions.\nExtensive evaluations show that, as compared to prior approaches, our method\ncan generate fashion designs and edited results with higher fidelity and better\nprompt adherence, showing its promising potential to augment the practical\nfashion design workflow. Code and Dataset are available at\nhttps://github.com/haoli-zbdbc/hierafashdiff.\n","authors":["Zhifeng Xie","Hao Li","Huiming Ding","Mengtian Li","Xinhan Di","Ying Cao"],"pdf_url":"https://arxiv.org/pdf/2401.07450v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06864v2","updated":"2024-12-12T10:21:16Z","published":"2024-11-11T10:56:40Z","title":"Veri-Car: Towards Open-world Vehicle Information Retrieval","summary":" Many industrial and service sectors require tools to extract vehicle\ncharacteristics from images. This is a complex task not only by the variety of\nnoise, and large number of classes, but also by the constant introduction of\nnew vehicle models to the market. In this paper, we present Veri-Car, an\ninformation retrieval integrated approach designed to help on this task. It\nleverages supervised learning techniques to accurately identify the make, type,\nmodel, year, color, and license plate of cars. The approach also addresses the\nchallenge of handling open-world problems, where new car models and variations\nfrequently emerge, by employing a sophisticated combination of pre-trained\nmodels, and a hierarchical multi-similarity loss. Veri-Car demonstrates robust\nperformance, achieving high precision and accuracy in classifying both seen and\nunseen data. Additionally, it integrates an ensemble license plate detection,\nand an OCR model to extract license plate numbers with impressive accuracy.\n","authors":["Andrés Muñoz","Nancy Thomas","Annita Vapsi","Daniel Borrajo"],"pdf_url":"https://arxiv.org/pdf/2411.06864v2.pdf","comment":"33 pages, 12 figures"},{"id":"http://arxiv.org/abs/2403.03551v2","updated":"2024-12-12T10:15:41Z","published":"2024-03-06T08:51:09Z","title":"Enhanced Low-Dose CT Image Reconstruction by Domain and Task Shifting\n Gaussian Denoisers","summary":" Computed tomography from a low radiation dose (LDCT) is challenging due to\nhigh noise in the projection data. Popular approaches for LDCT image\nreconstruction are two-stage methods, typically consisting of the filtered\nbackprojection (FBP) algorithm followed by a neural network for LDCT image\nenhancement. Two-stage methods are attractive for their simplicity and\npotential for computational efficiency, typically requiring only a single FBP\nand a neural network forward pass for inference. However, the best\nreconstruction quality is currently achieved by unrolled iterative methods\n(Learned Primal-Dual and ItNet), which are more complex and thus have a higher\ncomputational cost for training and inference. We propose a method combining\nthe simplicity and efficiency of two-stage methods with state-of-the-art\nreconstruction quality. Our strategy utilizes a neural network pretrained for\nGaussian noise removal from natural grayscale images, fine-tuned for LDCT image\nenhancement. We call this method FBP-DTSGD (Domain and Task Shifted Gaussian\nDenoisers) as the fine-tuning is a task shift from Gaussian denoising to\nenhancing LDCT images and a domain shift from natural grayscale to LDCT images.\nAn ablation study with three different pretrained Gaussian denoisers indicates\nthat the performance of FBP-DTSGD does not depend on a specific denoising\narchitecture, suggesting future advancements in Gaussian denoising could\nbenefit the method. The study also shows that pretraining on natural images\nenhances LDCT reconstruction quality, especially with limited training data.\nNotably, pretraining involves no additional cost, as existing pretrained models\nare used. The proposed method currently holds the top mean position in the\nLoDoPaB-CT challenge.\n","authors":["Tim Selig","Thomas März","Martin Storath","Andreas Weinmann"],"pdf_url":"https://arxiv.org/pdf/2403.03551v2.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2406.11210v2","updated":"2024-12-12T10:12:13Z","published":"2024-06-17T05:03:44Z","title":"Zero-Shot Scene Change Detection","summary":" We present a novel, training-free approach to scene change detection. Our\nmethod leverages tracking models, which inherently perform change detection\nbetween consecutive frames of video by identifying common objects and detecting\nnew or missing objects. Specifically, our method takes advantage of the change\ndetection effect of the tracking model by inputting reference and query images\ninstead of consecutive frames. Furthermore, we focus on the content gap and\nstyle gap between two input images in change detection, and address both issues\nby proposing adaptive content threshold and style bridging layers,\nrespectively. Finally, we extend our approach to video, leveraging rich\ntemporal information to enhance the performance of scene change detection. We\ncompare our approach and baseline through various experiments. While existing\ntrain-based baseline tend to specialize only in the trained domain, our method\nshows consistent performance across various domains, proving the\ncompetitiveness of our approach.\n","authors":["Kyusik Cho","Dong Yeop Kim","Euntai Kim"],"pdf_url":"https://arxiv.org/pdf/2406.11210v2.pdf","comment":"AAAI 2025. Code available at: https://github.com/kyusik-cho/ZSSCD"},{"id":"http://arxiv.org/abs/2411.04956v2","updated":"2024-12-12T10:10:19Z","published":"2024-11-07T18:32:00Z","title":"Uncovering Hidden Subspaces in Video Diffusion Models Using\n Re-Identification","summary":" Latent Video Diffusion Models can easily deceive casual observers and domain\nexperts alike thanks to the produced image quality and temporal consistency.\nBeyond entertainment, this creates opportunities around safe data sharing of\nfully synthetic datasets, which are crucial in healthcare, as well as other\ndomains relying on sensitive personal information. However, privacy concerns\nwith this approach have not fully been addressed yet, and models trained on\nsynthetic data for specific downstream tasks still perform worse than those\ntrained on real data. This discrepancy may be partly due to the sampling space\nbeing a subspace of the training videos, effectively reducing the training data\nsize for downstream models. Additionally, the reduced temporal consistency when\ngenerating long videos could be a contributing factor.\n In this paper, we first show that training privacy-preserving models in\nlatent space is computationally more efficient and generalize better.\nFurthermore, to investigate downstream degradation factors, we propose to use a\nre-identification model, previously employed as a privacy preservation filter.\nWe demonstrate that it is sufficient to train this model on the latent space of\nthe video generator. Subsequently, we use these models to evaluate the subspace\ncovered by synthetic video datasets and thus introduce a new way to measure the\nfaithfulness of generative machine learning models. We focus on a specific\napplication in healthcare echocardiography to illustrate the effectiveness of\nour novel methods. Our findings indicate that only up to 30.8% of the training\nvideos are learned in latent video diffusion models, which could explain the\nlack of performance when training downstream tasks on synthetic data.\n","authors":["Mischa Dombrowski","Hadrien Reynaud","Bernhard Kainz"],"pdf_url":"https://arxiv.org/pdf/2411.04956v2.pdf","comment":"8 pages, 5 tables, 6 figures; v2 Acknowledgements added"},{"id":"http://arxiv.org/abs/2411.16171v2","updated":"2024-12-12T10:04:33Z","published":"2024-11-25T08:00:21Z","title":"Image Generation Diversity Issues and How to Tame Them","summary":" Generative methods now produce outputs nearly indistinguishable from real\ndata but often fail to fully capture the data distribution. Unlike quality\nissues, diversity limitations in generative models are hard to detect visually,\nrequiring specific metrics for assessment. In this paper, we draw attention to\nthe current lack of diversity in generative models and the inability of common\nmetrics to measure this. We achieve this by framing diversity as an image\nretrieval problem, where we measure how many real images can be retrieved using\nsynthetic data as queries. This yields the Image Retrieval Score (IRS), an\ninterpretable, hyperparameter-free metric that quantifies the diversity of a\ngenerative model's output. IRS requires only a subset of synthetic samples and\nprovides a statistical measure of confidence. Our experiments indicate that\ncurrent feature extractors commonly used in generative model assessment are\ninadequate for evaluating diversity effectively. Consequently, we perform an\nextensive search for the best feature extractors to assess diversity.\nEvaluation reveals that current diffusion models converge to limited subsets of\nthe real distribution, with no current state-of-the-art models superpassing 77%\nof the diversity of the training data. To address this limitation, we introduce\nDiversity-Aware Diffusion Models (DiADM), a novel approach that improves\ndiversity of unconditional diffusion models without loss of image quality. We\ndo this by disentangling diversity from image quality by using a diversity\naware module that uses pseudo-unconditional features as input. We provide a\nPython package offering unified feature extraction and metric computation to\nfurther facilitate the evaluation of generative models\nhttps://github.com/MischaD/beyondfid.\n","authors":["Mischa Dombrowski","Weitong Zhang","Sarah Cechnicka","Hadrien Reynaud","Bernhard Kainz"],"pdf_url":"https://arxiv.org/pdf/2411.16171v2.pdf","comment":"17 pages, 6 tables, 12 figures; v2 added acknowledgment"},{"id":"http://arxiv.org/abs/2412.09122v1","updated":"2024-12-12T09:57:20Z","published":"2024-12-12T09:57:20Z","title":"LVMark: Robust Watermark for latent video diffusion models","summary":" Rapid advancements in generative models have made it possible to create\nhyper-realistic videos. As their applicability increases, their unauthorized\nuse has raised significant concerns, leading to the growing demand for\ntechniques to protect the ownership of the generative model itself. While\nexisting watermarking methods effectively embed watermarks into\nimage-generative models, they fail to account for temporal information,\nresulting in poor performance when applied to video-generative models. To\naddress this issue, we introduce a novel watermarking method called LVMark,\nwhich embeds watermarks into video diffusion models. A key component of LVMark\nis a selective weight modulation strategy that efficiently embeds watermark\nmessages into the video diffusion model while preserving the quality of the\ngenerated videos. To accurately decode messages in the presence of malicious\nattacks, we design a watermark decoder that leverages spatio-temporal\ninformation in the 3D wavelet domain through a cross-attention module. To the\nbest of our knowledge, our approach is the first to highlight the potential of\nvideo-generative model watermarking as a valuable tool for enhancing the\neffectiveness of ownership protection in video-generative models.\n","authors":["MinHyuk Jang","Youngdong Jang","JaeHyeok Lee","Kodai Kawamura","Feng Yang","Sangpil Kim"],"pdf_url":"https://arxiv.org/pdf/2412.09122v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08344v2","updated":"2024-12-12T09:52:55Z","published":"2024-12-11T12:34:37Z","title":"CoDTS: Enhancing Sparsely Supervised Collaborative Perception with a\n Dual Teacher-Student Framework","summary":" Current collaborative perception methods often rely on fully annotated\ndatasets, which can be expensive to obtain in practical situations. To reduce\nannotation costs, some works adopt sparsely supervised learning techniques and\ngenerate pseudo labels for the missing instances. However, these methods fail\nto achieve an optimal confidence threshold that harmonizes the quality and\nquantity of pseudo labels. To address this issue, we propose an end-to-end\nCollaborative perception Dual Teacher-Student framework (CoDTS), which employs\nadaptive complementary learning to produce both high-quality and high-quantity\npseudo labels. Specifically, the Main Foreground Mining (MFM) module generates\nhigh-quality pseudo labels based on the prediction of the static teacher.\nSubsequently, the Supplement Foreground Mining (SFM) module ensures a balance\nbetween the quality and quantity of pseudo labels by adaptively identifying\nmissing instances based on the prediction of the dynamic teacher. Additionally,\nthe Neighbor Anchor Sampling (NAS) module is incorporated to enhance the\nrepresentation of pseudo labels. To promote the adaptive complementary\nlearning, we implement a staged training strategy that trains the student and\ndynamic teacher in a mutually beneficial manner. Extensive experiments\ndemonstrate that the CoDTS effectively ensures an optimal balance of pseudo\nlabels in both quality and quantity, establishing a new state-of-the-art in\nsparsely supervised collaborative perception.\n","authors":["Yushan Han","Hui Zhang","Honglei Zhang","Jing Wang","Yidong Li"],"pdf_url":"https://arxiv.org/pdf/2412.08344v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2408.13499v2","updated":"2024-12-12T09:50:24Z","published":"2024-08-24T06:52:14Z","title":"R2G: Reasoning to Ground in 3D Scenes","summary":" We propose Reasoning to Ground (R2G), a neural symbolic model that grounds\nthe target objects within 3D scenes in a reasoning manner. In contrast to prior\nworks, R2G explicitly models the 3D scene with a semantic concept-based scene\ngraph; recurrently simulates the attention transferring across object entities;\nthus makes the process of grounding the target objects with the highest\nprobability interpretable. Specifically, we respectively embed multiple object\nproperties within the graph nodes and spatial relations among entities within\nthe edges, utilizing a predefined semantic vocabulary. To guide attention\ntransferring, we employ learning or prompting-based methods to analyze the\nreferential utterance and convert it into reasoning instructions within the\nsame semantic space. In each reasoning round, R2G either (1) merges current\nattention distribution with the similarity between the instruction and embedded\nentity properties or (2) shifts the attention across the scene graph based on\nthe similarity between the instruction and embedded spatial relations. The\nexperiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result\nwith the prior works while maintaining improved interpretability, breaking a\nnew path for 3D language grounding.\n","authors":["Yixuan Li","Zan Wang","Wei Liang"],"pdf_url":"https://arxiv.org/pdf/2408.13499v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09115v1","updated":"2024-12-12T09:49:16Z","published":"2024-12-12T09:49:16Z","title":"Vision CNNs trained to estimate spatial latents learned similar\n ventral-stream-aligned representations","summary":" Studies of the functional role of the primate ventral visual stream have\ntraditionally focused on object categorization, often ignoring -- despite much\nprior evidence -- its role in estimating \"spatial\" latents such as object\nposition and pose. Most leading ventral stream models are derived by optimizing\nnetworks for object categorization, which seems to imply that the ventral\nstream is also derived under such an objective. Here, we explore an alternative\nhypothesis: Might the ventral stream be optimized for estimating spatial\nlatents? And a closely related question: How different -- if at all -- are\nrepresentations learned from spatial latent estimation compared to\ncategorization? To ask these questions, we leveraged synthetic image datasets\ngenerated by a 3D graphic engine and trained convolutional neural networks\n(CNNs) to estimate different combinations of spatial and category latents. We\nfound that models trained to estimate just a few spatial latents achieve neural\nalignment scores comparable to those trained on hundreds of categories, and the\nspatial latent performance of models strongly correlates with their neural\nalignment. Spatial latent and category-trained models have very similar -- but\nnot identical -- internal representations, especially in their early and middle\nlayers. We provide evidence that this convergence is partly driven by\nnon-target latent variability in the training data, which facilitates the\nimplicit learning of representations of those non-target latents. Taken\ntogether, these results suggest that many training objectives, such as spatial\nlatents, can lead to similar models aligned neurally with the ventral stream.\nThus, one should not assume that the ventral stream is optimized for object\ncategorization only. As a field, we need to continue to sharpen our measures of\ncomparing models to brains to better understand the functional roles of the\nventral stream.\n","authors":["Yudi Xie","Weichen Huang","Esther Alter","Jeremy Schwartz","Joshua B. Tenenbaum","James J. DiCarlo"],"pdf_url":"https://arxiv.org/pdf/2412.09115v1.pdf","comment":"29 pages, 20 figures, ICLR 2025"},{"id":"http://arxiv.org/abs/2412.06257v2","updated":"2024-12-12T09:38:22Z","published":"2024-12-09T07:14:58Z","title":"Advancing Extended Reality with 3D Gaussian Splatting: Innovations and\n Prospects","summary":" 3D Gaussian Splatting (3DGS) has attracted significant attention for its\npotential to revolutionize 3D representation, rendering, and interaction.\nDespite the rapid growth of 3DGS research, its direct application to Extended\nReality (XR) remains underexplored. Although many studies recognize the\npotential of 3DGS for XR, few have explicitly focused on or demonstrated its\neffectiveness within XR environments. In this paper, we aim to synthesize\ninnovations in 3DGS that show specific potential for advancing XR research and\ndevelopment. We conduct a comprehensive review of publicly available 3DGS\npapers, with a focus on those referencing XR-related concepts. Additionally, we\nperform an in-depth analysis of innovations explicitly relevant to XR and\npropose a taxonomy to highlight their significance. Building on these insights,\nwe propose several prospective XR research areas where 3DGS can make promising\ncontributions, yet remain rarely touched. By investigating the intersection of\n3DGS and XR, this paper provides a roadmap to push the boundaries of XR using\ncutting-edge 3DGS techniques.\n","authors":["Shi Qiu","Binzhu Xie","Qixuan Liu","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2412.06257v2.pdf","comment":"IEEE AIxVR 2025"},{"id":"http://arxiv.org/abs/2412.09105v1","updated":"2024-12-12T09:35:47Z","published":"2024-12-12T09:35:47Z","title":"ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal\n Resolution Motion Estimation","summary":" Event cameras hold significant promise for high-temporal-resolution (HTR)\nmotion estimation. However, estimating event-based HTR optical flow faces two\nkey challenges: the absence of HTR ground-truth data and the intrinsic sparsity\nof event data. Most existing approaches rely on the flow accumulation paradigms\nto indirectly supervise intermediate flows, often resulting in accumulation\nerrors and optimization difficulties. To address these challenges, we propose a\nresidual-based paradigm for estimating HTR optical flow with event data. Our\napproach separates HTR flow estimation into two stages: global linear motion\nestimation and HTR residual flow refinement. The residual paradigm effectively\nmitigates the impacts of event sparsity on optimization and is compatible with\nany LTR algorithm. Next, to address the challenge posed by the absence of HTR\nground truth, we incorporate novel learning strategies. Specifically, we\ninitially employ a shared refiner to estimate the residual flows, enabling both\nLTR supervision and HTR inference. Subsequently, we introduce regional noise to\nsimulate the residual patterns of intermediate flows, facilitating the\nadaptation from LTR supervision to HTR inference. Additionally, we show that\nthe noise-based strategy supports in-domain self-supervised training.\nComprehensive experimental results demonstrate that our approach achieves\nstate-of-the-art accuracy in both LTR and HTR metrics, highlighting its\neffectiveness and superiority.\n","authors":["Qianang Zhou","Zhiyu Zhu","Junhui Hou","Yongjian Deng","Youfu Li","Junlin Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.09105v1.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.09082v1","updated":"2024-12-12T09:08:13Z","published":"2024-12-12T09:08:13Z","title":"Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and\n Method","summary":" Existing Vision-Language Navigation (VLN) methods primarily focus on\nsingle-stage navigation, limiting their effectiveness in multi-stage and\nlong-horizon tasks within complex and dynamic environments. To address these\nlimitations, we propose a novel VLN task, named Long-Horizon Vision-Language\nNavigation (LH-VLN), which emphasizes long-term planning and decision\nconsistency across consecutive subtasks. Furthermore, to support LH-VLN, we\ndevelop an automated data generation platform NavGen, which constructs datasets\nwith complex task structures and improves data utility through a bidirectional,\nmulti-granularity generation approach. To accurately evaluate complex tasks, we\nconstruct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark\nconsisting of 3,260 tasks with an average of 150 task steps, serving as the\nfirst dataset specifically designed for the long-horizon vision-language\nnavigation task. Furthermore, we propose Independent Success Rate (ISR),\nConditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics,\nto provide fine-grained assessments of task completion. To improve model\nadaptability in complex tasks, we propose a novel Multi-Granularity Dynamic\nMemory (MGDM) module that integrates short-term memory blurring with long-term\nmemory retrieval to enable flexible navigation in dynamic environments. Our\nplatform, benchmark and method supply LH-VLN with a robust data generation\npipeline, comprehensive model evaluation dataset, reasonable metrics, and a\nnovel VLN model, establishing a foundational framework for advancing LH-VLN.\n","authors":["Xinshuai Song","Weixing Chen","Yang Liu","Weikai Chen","Guanbin Li","Liang Lin"],"pdf_url":"https://arxiv.org/pdf/2412.09082v1.pdf","comment":"A novel Vision-Language Navigation task: Long-Horizon Vision-Language\n Navigation"},{"id":"http://arxiv.org/abs/2303.15361v2","updated":"2024-12-12T09:06:56Z","published":"2023-03-27T16:32:21Z","title":"A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts","summary":" Machine learning methods strive to acquire a robust model during the training\nprocess that can effectively generalize to test samples, even in the presence\nof distribution shifts. However, these methods often suffer from performance\ndegradation due to unknown test distributions. Test-time adaptation (TTA), an\nemerging paradigm, has the potential to adapt a pre-trained model to unlabeled\ndata during testing, before making predictions. Recent progress in this\nparadigm has highlighted the significant benefits of using unlabeled data to\ntrain self-adapted models prior to inference. In this survey, we categorize TTA\ninto several distinct groups based on the form of test data, namely, test-time\ndomain adaptation, test-time batch adaptation, and online test-time adaptation.\nFor each category, we provide a comprehensive taxonomy of advanced algorithms\nand discuss various learning scenarios. Furthermore, we analyze relevant\napplications of TTA and discuss open challenges and promising areas for future\nresearch. For a comprehensive list of TTA methods, kindly refer to\n\\url{https://github.com/tim-learn/awesome-test-time-adaptation}.\n","authors":["Jian Liang","Ran He","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2303.15361v2.pdf","comment":"Discussions, comments, and questions are all welcomed in\n \\url{https://github.com/tim-learn/awesome-test-time-adaptation}"},{"id":"http://arxiv.org/abs/2404.18924v2","updated":"2024-12-12T09:06:01Z","published":"2024-04-29T17:59:02Z","title":"Swin2-MoSE: A New Single Image Super-Resolution Model for Remote Sensing","summary":" Due to the limitations of current optical and sensor technologies and the\nhigh cost of updating them, the spectral and spatial resolution of satellites\nmay not always meet desired requirements. For these reasons, Remote-Sensing\nSingle-Image Super-Resolution (RS-SISR) techniques have gained significant\ninterest. In this paper, we propose Swin2-MoSE model, an enhanced version of\nSwin2SR. Our model introduces MoE-SM, an enhanced Mixture-of-Experts (MoE) to\nreplace the Feed-Forward inside all Transformer block. MoE-SM is designed with\nSmart-Merger, and new layer for merging the output of individual experts, and\nwith a new way to split the work between experts, defining a new per-example\nstrategy instead of the commonly used per-token one. Furthermore, we analyze\nhow positional encodings interact with each other, demonstrating that\nper-channel bias and per-head bias can positively cooperate. Finally, we\npropose to use a combination of Normalized-Cross-Correlation (NCC) and\nStructural Similarity Index Measure (SSIM) losses, to avoid typical MSE loss\nlimitations. Experimental results demonstrate that Swin2-MoSE outperforms any\nSwin derived models by up to 0.377 - 0.958 dB (PSNR) on task of 2x, 3x and 4x\nresolution-upscaling (Sen2Venus and OLI2MSI datasets). It also outperforms SOTA\nmodels by a good margin, proving to be competitive and with excellent\npotential, especially for complex tasks. Additionally, an analysis of\ncomputational costs is also performed. Finally, we show the efficacy of\nSwin2-MoSE, applying it to a semantic segmentation task (SeasoNet dataset).\nCode and pretrained are available on\nhttps://github.com/IMPLabUniPr/swin2-mose/tree/official_code\n","authors":["Leonardo Rossi","Vittorio Bernuzzi","Tomaso Fontanini","Massimo Bertozzi","Andrea Prati"],"pdf_url":"https://arxiv.org/pdf/2404.18924v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11512v3","updated":"2024-12-12T08:59:33Z","published":"2024-09-17T19:26:21Z","title":"Good Grasps Only: A data engine for self-supervised fine-tuning of pose\n estimation using grasp poses for verification","summary":" In this paper, we present a novel method for self-supervised fine-tuning of\npose estimation. Leveraging zero-shot pose estimation, our approach enables the\nrobot to automatically obtain training data without manual labeling. After pose\nestimation the object is grasped, and in-hand pose estimation is used for data\nvalidation. Our pipeline allows the system to fine-tune while the process is\nrunning, removing the need for a learning phase. The motivation behind our work\nlies in the need for rapid setup of pose estimation solutions. Specifically, we\naddress the challenging task of bin picking, which plays a pivotal role in\nflexible robotic setups. Our method is implemented on a robotics work-cell, and\ntested with four different objects. For all objects, our method increases the\nperformance and outperforms a state-of-the-art method trained on the CAD model\nof the objects. Project page available at gogoengine.github.io\n","authors":["Frederik Hagelskjær"],"pdf_url":"https://arxiv.org/pdf/2409.11512v3.pdf","comment":"8 pages, 7 figures, 3 tables"},{"id":"http://arxiv.org/abs/2404.11824v4","updated":"2024-12-12T08:59:22Z","published":"2024-04-18T01:10:24Z","title":"TextCenGen: Attention-Guided Text-Centric Background Adaptation for\n Text-to-Image Generation","summary":" Recent advancements in Text-to-image (T2I) generation have witnessed a shift\nfrom adapting text to fixed backgrounds to creating images around text.\nTraditional approaches are often limited to generate layouts within static\nimages for effective text placement. Our proposed approach, TextCenGen,\nintroduces a dynamic adaptation of the blank region for text-friendly image\ngeneration, emphasizing text-centric design and visual harmony generation. Our\nmethod employs force-directed attention guidance in T2I models to generate\nimages that strategically reserve whitespace for pre-defined text areas, even\nfor text or icons at the golden ratio. Observing how cross-attention maps\naffect object placement, we detect and repel conflicting objects using a\nforce-directed graph approach, combined with a Spatial Excluding\nCross-Attention Constraint for smooth attention in whitespace areas. As a novel\ntask in graphic design, experiments indicate that TextCenGen outperforms\nexisting methods with more harmonious compositions. Furthermore, our method\nsignificantly enhances T2I model outcomes on our specially collected prompt\ndatasets, catering to varied text positions. These results demonstrate the\nefficacy of TextCenGen in creating more harmonious and integrated text-image\ncompositions.\n","authors":["Tianyi Liang","Jiangqi Liu","Yifei Huang","Shiqi Jiang","Sicheng Song","Jianshen Shi","Changbo Wang","Chenhui Li"],"pdf_url":"https://arxiv.org/pdf/2404.11824v4.pdf","comment":"7 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.09074v1","updated":"2024-12-12T08:59:08Z","published":"2024-12-12T08:59:08Z","title":"DomCLP: Domain-wise Contrastive Learning with Prototype Mixup for\n Unsupervised Domain Generalization","summary":" Self-supervised learning (SSL) methods based on the instance discrimination\ntasks with InfoNCE have achieved remarkable success. Despite their success, SSL\nmodels often struggle to generate effective representations for unseen-domain\ndata. To address this issue, research on unsupervised domain generalization\n(UDG), which aims to develop SSL models that can generate domain-irrelevant\nfeatures, has been conducted. Most UDG approaches utilize contrastive learning\nwith InfoNCE to generate representations, and perform feature alignment based\non strong assumptions to generalize domain-irrelevant common features from\nmulti-source domains. However, existing methods that rely on instance\ndiscrimination tasks are not effective at extracting domain-irrelevant common\nfeatures. This leads to the suppression of domain-irrelevant common features\nand the amplification of domain-relevant features, thereby hindering domain\ngeneralization. Furthermore, strong assumptions underlying feature alignment\ncan lead to biased feature learning, reducing the diversity of common features.\nIn this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive\nLearning with Prototype Mixup. We explore how InfoNCE suppresses\ndomain-irrelevant common features and amplifies domain-relevant features. Based\non this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance\ndomain-irrelevant common features. We also propose Prototype Mixup Learning\n(PMix) to generalize domain-irrelevant common features across multiple domains\nwithout relying on strong assumptions. The proposed method consistently\noutperforms state-of-the-art methods on the PACS and DomainNet datasets across\nvarious label fractions, showing significant improvements. Our code will be\nreleased. Our project page is available at https://github.com/jinsuby/DomCLP.\n","authors":["Jin-Seop Lee","Noo-ri Kim","Jee-Hyong Lee"],"pdf_url":"https://arxiv.org/pdf/2412.09074v1.pdf","comment":"Code page: https://github.com/jinsuby/DomCLP"},{"id":"http://arxiv.org/abs/2412.09073v1","updated":"2024-12-12T08:58:42Z","published":"2024-12-12T08:58:42Z","title":"SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain\n Few-Shot Learning","summary":" Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen\nsource domains to unseen target domains, which is crucial for evaluating the\ngeneralization and robustness of models. Recent studies focus on utilizing\nvisual styles to bridge the domain gap between different domains. However, the\nserious dilemma of gradient instability and local optimization problem occurs\nin those style-based CD-FSL methods. This paper addresses these issues and\nproposes a novel crop-global style perturbation method, called\n\\underline{\\textbf{S}}elf-\\underline{\\textbf{V}}ersatility\n\\underline{\\textbf{A}}dversarial \\underline{\\textbf{S}}tyle\n\\underline{\\textbf{P}}erturbation (\\textbf{SVasP}), which enhances the gradient\nstability and escapes from poor sharp minima jointly. Specifically, SVasP\nsimulates more diverse potential target domain adversarial styles via\ndiversifying input patterns and aggregating localized crop style gradients, to\nserve as global style perturbation stabilizers within one image, a concept we\nrefer to as self-versatility. Then a novel objective function is proposed to\nmaximize visual discrepancy while maintaining semantic consistency between\nglobal, crop, and adversarial features. Having the stabilized global style\nperturbation in the training phase, one can obtain a flattened minima in the\nloss landscape, boosting the transferability of the model to the target\ndomains. Extensive experiments on multiple benchmark datasets demonstrate that\nour method significantly outperforms existing state-of-the-art methods. Our\ncodes are available at https://github.com/liwenqianSEU/SVasP.\n","authors":["Wenqian Li","Pengfei Fang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2412.09073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09072v1","updated":"2024-12-12T08:58:20Z","published":"2024-12-12T08:58:20Z","title":"Cross-View Completion Models are Zero-shot Correspondence Estimators","summary":" In this work, we explore new perspectives on cross-view completion learning\nby drawing an analogy to self-supervised correspondence learning. Through our\nanalysis, we demonstrate that the cross-attention map within cross-view\ncompletion models captures correspondence more effectively than other\ncorrelations derived from encoder or decoder features. We verify the\neffectiveness of the cross-attention map by evaluating on both zero-shot\nmatching and learning-based geometric matching and multi-frame depth\nestimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.\n","authors":["Honggyu An","Jinhyeon Kim","Seonghoon Park","Jaewoo Jung","Jisang Han","Sunghwan Hong","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2412.09072v1.pdf","comment":"Project Page: https://cvlab-kaist.github.io/ZeroCo/"},{"id":"http://arxiv.org/abs/2404.06442v2","updated":"2024-12-12T08:48:02Z","published":"2024-04-09T16:42:54Z","title":"QueSTMaps: Queryable Semantic Topological Maps for 3D Scene\n Understanding","summary":" Robotic tasks such as planning and navigation require a hierarchical semantic\nunderstanding of a scene, which could include multiple floors and rooms.\nCurrent methods primarily focus on object segmentation for 3D scene\nunderstanding. However, such methods struggle to segment out topological\nregions like \"kitchen\" in the scene. In this work, we introduce a two-step\npipeline to solve this problem. First, we extract a topological map, i.e.,\nfloorplan of the indoor scene using a novel multi-channel occupancy\nrepresentation. Then, we generate CLIP-aligned features and semantic labels for\nevery room instance based on the objects it contains using a self-attention\ntransformer. Our language-topology alignment supports natural language\nquerying, e.g., a \"place to cook\" locates the \"kitchen\". We outperform the\ncurrent state-of-the-art on room segmentation by ~20% and room classification\nby ~12%. Our detailed qualitative analysis and ablation studies provide\ninsights into the problem of joint structural and semantic 3D scene\nunderstanding. Project Page: quest-maps.github.io\n","authors":["Yash Mehan","Kumaraditya Gupta","Rohit Jayanti","Anirudh Govil","Sourav Garg","Madhava Krishna"],"pdf_url":"https://arxiv.org/pdf/2404.06442v2.pdf","comment":"Accepted at 2024 IEEE/RSJ International Conference on Intelligent\n Robots and Systems (IROS) as Oral Presentation. Also presented at the 2nd\n Workshop on Open-Vocabulary 3D Scene Understanding (OpenSUN3D) at CVPR 2024"},{"id":"http://arxiv.org/abs/2412.09063v1","updated":"2024-12-12T08:46:22Z","published":"2024-12-12T08:46:22Z","title":"An Efficient Framework for Enhancing Discriminative Models via Diffusion\n Techniques","summary":" Image classification serves as the cornerstone of computer vision,\ntraditionally achieved through discriminative models based on deep neural\nnetworks. Recent advancements have introduced classification methods derived\nfrom generative models, which offer the advantage of zero-shot classification.\nHowever, these methods suffer from two main drawbacks: high computational\noverhead and inferior performance compared to discriminative models. Inspired\nby the coordinated cognitive processes of rapid-slow pathway interactions in\nthe human brain during visual signal recognition, we propose the\nDiffusion-Based Discriminative Model Enhancement Framework (DBMEF). This\nframework seamlessly integrates discriminative and generative models in a\ntraining-free manner, leveraging discriminative models for initial predictions\nand endowing deep neural networks with rethinking capabilities via diffusion\nmodels. Consequently, DBMEF can effectively enhance the classification accuracy\nand generalization capability of discriminative models in a plug-and-play\nmanner. We have conducted extensive experiments across 17 prevalent deep model\narchitectures with different training methods, including both CNN-based models\nsuch as ResNet and Transformer-based models like ViT, to demonstrate the\neffectiveness of the proposed DBMEF. Specifically, the framework yields a\n1.51\\% performance improvement for ResNet-50 on the ImageNet dataset and 3.02\\%\non the ImageNet-A dataset. In conclusion, our research introduces a novel\nparadigm for image classification, demonstrating stable improvements across\ndifferent datasets and neural networks.\n","authors":["Chunxiao Li","Xiaoxiao Wang","Boming Miao","Chuanlong Xie","Zizhe Wang","Yao Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.09063v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2310.00919v2","updated":"2024-12-12T08:42:35Z","published":"2023-10-02T06:15:50Z","title":"A simple thinking about the application of the attention mechanism in\n medical ultrasound image segmentation task","summary":" The AI-based assisted diagnosis programs have been widely investigated on\nmedical ultrasound images. Complex scenario of ultrasound image, in which the\ncoupled interference of internal and external factors is severe, brings a\nunique challenge for localize the object region automatically and precisely in\nultrasound images. In this study, we seek to propose a more general and robust\nBenchmark Attention Adaptive Framework (BAAF) to assist doctors segment or\ndiagnose lesions and tissues in ultrasound images more quickly and accurately.\nDifferent from existing attention schemes, the BAAF consists of a parallel\nhybrid attention module (PHAM) and an adaptive calibration mechanism (ACM).\nSpecifically, BAAF first coarsely calibrates the input features from the\nchannel and spatial dimensions, and then adaptively selects more robust lesion\nor tissue characterizations from the coarse-calibrated feature maps. The design\nof BAAF further optimizes the \"what\" and \"where\" focus and selection problems\nin CNNs and seeks to improve the segmentation accuracy of lesions or tissues in\nmedical ultrasound images. The method is evaluated on four medical ultrasound\nsegmentation tasks, and the adequate experimental results demonstrate the\nremarkable performance improvement over existing state-of-the-art methods. In\naddition, the comparison with existing attention mechanisms also demonstrates\nthe superiority of BAAF. This work provides the possibility for automated\nmedical ultrasound assisted diagnosis and reduces reliance on human accuracy\nand precision.\n","authors":["Gongping Chen","Rui Wang","Xiaotao Yin","Liang Cui","Yu Dai"],"pdf_url":"https://arxiv.org/pdf/2310.00919v2.pdf","comment":"10 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.05203v2","updated":"2024-12-12T08:37:20Z","published":"2024-12-06T17:32:53Z","title":"Archaeoscape: Bringing Aerial Laser Scanning Archaeology to the Deep\n Learning Era","summary":" Airborne Laser Scanning (ALS) technology has transformed modern archaeology\nby unveiling hidden landscapes beneath dense vegetation. However, the lack of\nexpert-annotated, open-access resources has hindered the analysis of ALS data\nusing advanced deep learning techniques. We address this limitation with\nArchaeoscape (available at https://archaeoscape.ai/data/2024/), a novel\nlarge-scale archaeological ALS dataset spanning 888 km$^2$ in Cambodia with\n31,141 annotated archaeological features from the Angkorian period.\nArchaeoscape is over four times larger than comparable datasets, and the first\nALS archaeology resource with open-access data, annotations, and models.\n We benchmark several recent segmentation models to demonstrate the benefits\nof modern vision techniques for this problem and highlight the unique\nchallenges of discovering subtle human-made structures under dense jungle\ncanopies. By making Archaeoscape available in open access, we hope to bridge\nthe gap between traditional archaeology and modern computer vision methods.\n","authors":["Yohann Perron","Vladyslav Sydorov","Adam P. Wijker","Damian Evans","Christophe Pottier","Loic Landrieu"],"pdf_url":"https://arxiv.org/pdf/2412.05203v2.pdf","comment":"NeurIPS 2024 - Datasets & Benchmarks Track (spotlight)"},{"id":"http://arxiv.org/abs/2407.17847v2","updated":"2024-12-12T08:28:20Z","published":"2024-07-25T08:00:49Z","title":"Move and Act: Enhanced Object Manipulation and Background Integrity for\n Image Editing","summary":" Current methods commonly utilize three-branch structures of inversion,\nreconstruction, and editing, to tackle consistent image editing task. However,\nthese methods lack control over the generation position of the edited object\nand have issues with background preservation. To overcome these limitations, we\npropose a tuning-free method with only two branches: inversion and editing.\nThis approach allows users to simultaneously edit the object's action and\ncontrol the generation position of the edited object. Additionally, it achieves\nimproved background preservation. Specifically, we transfer the edited object\ninformation to the target area and repair or preserve the background of other\nareas during the inversion process at a specific time step. In the editing\nstage, we use the image features in self-attention to query the key and value\nof the corresponding time step in the inversion to achieve consistent image\nediting. Impressive image editing results and quantitative evaluation\ndemonstrate the effectiveness of our method. The code is available at\nhttps://github.com/mobiushy/move-act.\n","authors":["Pengfei Jiang","Mingbao Lin","Fei Chao"],"pdf_url":"https://arxiv.org/pdf/2407.17847v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.09055v1","updated":"2024-12-12T08:27:39Z","published":"2024-12-12T08:27:39Z","title":"Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D\n Images","summary":" Reconstructing desired objects and scenes has long been a primary goal in 3D\ncomputer vision. Single-view point cloud reconstruction has become a popular\ntechnique due to its low cost and accurate results. However, single-view\nreconstruction methods often rely on expensive CAD models and complex geometric\npriors. Effectively utilizing prior knowledge about the data remains a\nchallenge. In this paper, we introduce hyperbolic space to 3D point cloud\nreconstruction, enabling the model to represent and understand complex\nhierarchical structures in point clouds with low distortion. We build upon\nprevious methods by proposing a hyperbolic Chamfer distance and a regularized\ntriplet loss to enhance the relationship between partial and complete point\nclouds. Additionally, we design adaptive boundary conditions to improve the\nmodel's understanding and reconstruction of 3D structures. Our model\noutperforms most existing models, and ablation studies demonstrate the\nsignificance of our model and its components. Experimental results show that\nour method significantly improves feature extraction capabilities. Our model\nachieves outstanding performance in 3D reconstruction tasks.\n","authors":["Wenrui Li","Zhe Yang","Wei Han","Hengyu Man","Xingtao Wang","Xiaopeng Fan"],"pdf_url":"https://arxiv.org/pdf/2412.09055v1.pdf","comment":"Accepted by AAAI25"},{"id":"http://arxiv.org/abs/2412.09050v1","updated":"2024-12-12T08:21:19Z","published":"2024-12-12T08:21:19Z","title":"ContextHOI: Spatial Context Learning for Human-Object Interaction\n Detection","summary":" Spatial contexts, such as the backgrounds and surroundings, are considered\ncritical in Human-Object Interaction (HOI) recognition, especially when the\ninstance-centric foreground is blurred or occluded. Recent advancements in HOI\ndetectors are usually built upon detection transformer pipelines. While such an\nobject-detection-oriented paradigm shows promise in localizing objects, its\nexploration of spatial context is often insufficient for accurately recognizing\nhuman actions. To enhance the capabilities of object detectors for HOI\ndetection, we present a dual-branch framework named ContextHOI, which\nefficiently captures both object detection features and spatial contexts. In\nthe context branch, we train the model to extract informative spatial context\nwithout requiring additional hand-craft background labels. Furthermore, we\nintroduce context-aware spatial and semantic supervision to the context branch\nto filter out irrelevant noise and capture informative contexts. ContextHOI\nachieves state-of-the-art performance on the HICO-DET and v-coco benchmarks.\nFor further validation, we construct a novel benchmark, HICO-ambiguous, which\nis a subset of HICO-DET that contains images with occluded or impaired instance\ncues. Extensive experiments across all benchmarks, complemented by\nvisualizations, underscore the enhancements provided by ContextHOI, especially\nin recognizing interactions involving occluded or blurred instances.\n","authors":["Mingda Jia","Liming Zhao","Ge Li","Yun Zheng"],"pdf_url":"https://arxiv.org/pdf/2412.09050v1.pdf","comment":"in proceedings of the 39th AAAI Conference on Artificial Intelligence\n (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.09044v1","updated":"2024-12-12T08:13:29Z","published":"2024-12-12T08:13:29Z","title":"Motif Guided Graph Transformer with Combinatorial Skeleton Prototype\n Learning for Skeleton-Based Person Re-Identification","summary":" Person re-identification (re-ID) via 3D skeleton data is a challenging task\nwith significant value in many scenarios. Existing skeleton-based methods\ntypically assume virtual motion relations between all joints, and adopt average\njoint or sequence representations for learning. However, they rarely explore\nkey body structure and motion such as gait to focus on more important body\njoints or limbs, while lacking the ability to fully mine valuable\nspatial-temporal sub-patterns of skeletons to enhance model learning. This\npaper presents a generic Motif guided graph transformer with Combinatorial\nskeleton prototype learning (MoCos) that exploits structure-specific and\ngait-related body relations as well as combinatorial features of skeleton\ngraphs to learn effective skeleton representations for person re-ID. In\nparticular, motivated by the locality within joints' structure and the\nbody-component collaboration in gait, we first propose the motif guided graph\ntransformer (MGT) that incorporates hierarchical structural motifs and gait\ncollaborative motifs, which simultaneously focuses on multi-order local joint\ncorrelations and key cooperative body parts to enhance skeleton relation\nlearning. Then, we devise the combinatorial skeleton prototype learning (CSP)\nthat leverages random spatial-temporal combinations of joint nodes and skeleton\ngraphs to generate diverse sub-skeleton and sub-tracklet representations, which\nare contrasted with the most representative features (prototypes) of each\nidentity to learn class-related semantics and discriminative skeleton\nrepresentations. Extensive experiments validate the superior performance of\nMoCos over existing state-of-the-art models. We further show its generality\nunder RGB-estimated skeletons, different graph modeling, and unsupervised\nscenarios.\n","authors":["Haocong Rao","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2412.09044v1.pdf","comment":"Accepted by AAAI 2025. Codes are available at\n https://github.com/Kali-Hac/MoCos"},{"id":"http://arxiv.org/abs/2412.09043v1","updated":"2024-12-12T08:10:31Z","published":"2024-12-12T08:10:31Z","title":"DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous\n Driving","summary":" Photorealistic 4D reconstruction of street scenes is essential for developing\nreal-world simulators in autonomous driving. However, most existing methods\nperform this task offline and rely on time-consuming iterative processes,\nlimiting their practical applications. To this end, we introduce the Large 4D\nGaussian Reconstruction Model (DrivingRecon), a generalizable driving scene\nreconstruction model, which directly predicts 4D Gaussian from surround view\nvideos. To better integrate the surround-view images, the Prune and Dilate\nBlock (PD-Block) is proposed to eliminate overlapping Gaussian points between\nadjacent views and remove redundant background points. To enhance\ncross-temporal information, dynamic and static decoupling is tailored to better\nlearn geometry and motion features. Experimental results demonstrate that\nDrivingRecon significantly improves scene reconstruction quality and novel view\nsynthesis compared to existing methods. Furthermore, we explore applications of\nDrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our\ncode is available at https://github.com/EnVision-Research/DriveRecon.\n","authors":["Hao Lu","Tianshuo Xu","Wenzhao Zheng","Yunpeng Zhang","Wei Zhan","Dalong Du","Masayoshi Tomizuka","Kurt Keutzer","Yingcong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09043v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.09560v1","updated":"2024-12-12T18:46:38Z","published":"2024-12-12T18:46:38Z","title":"Foundational Large Language Models for Materials Research","summary":" Materials discovery and development are critical for addressing global\nchallenges. Yet, the exponential growth in materials science literature\ncomprising vast amounts of textual data has created significant bottlenecks in\nknowledge extraction, synthesis, and scientific reasoning. Large Language\nModels (LLMs) offer unprecedented opportunities to accelerate materials\nresearch through automated analysis and prediction. Still, their effective\ndeployment requires domain-specific adaptation for understanding and solving\ndomain-relevant tasks. Here, we present LLaMat, a family of foundational models\nfor materials science developed through continued pretraining of LLaMA models\non an extensive corpus of materials literature and crystallographic data.\nThrough systematic evaluation, we demonstrate that LLaMat excels in\nmaterials-specific NLP and structured information extraction while maintaining\ngeneral linguistic capabilities. The specialized LLaMat-CIF variant\ndemonstrates unprecedented capabilities in crystal structure generation,\npredicting stable crystals with high coverage across the periodic table.\nIntriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2,\nwe observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific\nperformance across diverse materials science tasks, including structured\ninformation extraction from text and tables, more particularly in crystal\nstructure generation, a potential adaptation rigidity in overtrained LLMs.\nAltogether, the present work demonstrates the effectiveness of domain\nadaptation towards developing practically deployable LLM copilots for materials\nresearch. Beyond materials science, our findings reveal important\nconsiderations for domain adaptation of LLMs, such as model selection, training\nmethodology, and domain-specific performance, which may influence the\ndevelopment of specialized scientific AI systems.\n","authors":["Vaibhav Mishra","Somaditya Singh","Dhruv Ahlawat","Mohd Zaki","Vaibhav Bihani","Hargun Singh Grover","Biswajit Mishra","Santiago Miret"," Mausam","N. M. Anoop Krishnan"],"pdf_url":"https://arxiv.org/pdf/2412.09560v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02961v2","updated":"2024-12-12T15:21:44Z","published":"2023-04-06T09:38:54Z","title":"HGCH: A Hyperbolic Graph Convolution Network Model for Heterogeneous\n Collaborative Graph Recommendation","summary":" User-item interaction data in collaborative filtering and graph modeling\ntasks often exhibit power-law characteristics, which suggest the suitability of\nhyperbolic space modeling. Hyperbolic Graph Convolution Neural Networks (HGCNs)\nare a novel technique that leverages the advantages of GCN and hyperbolic\nspace, and then achieves remarkable results. However, existing HGCN methods\nhave several drawbacks: they fail to fully leverage hyperbolic space properties\ndue to arbitrary embedding initialization and imprecise tangent space\naggregation; they overlook auxiliary information that could enrich the\ncollaborative graph; and their training convergence is slow due to margin\nranking loss and random negative sampling. To overcome these challenges, we\npropose Hyperbolic Graph Collaborative for Heterogeneous Recommendation (HGCH),\nan enhanced HGCN-based model for collaborative filtering that integrates\ndiverse side information into a heterogeneous collaborative graph and improves\ntraining convergence speed. HGCH first preserves the long-tailed nature of the\ngraph by initializing node embeddings with power law prior; then it aggregates\nneighbors in hyperbolic space using the gyromidpoint method for accurate\ncomputation; finally, it fuses multiple embeddings from different hyperbolic\nspaces by the gate fusion with prior. Moreover, HGCH employs a hyperbolic\nuser-specific negative sampling to speed up convergence. We evaluate HGCH on\nfour real datasets, and the results show that HGCH achieves competitive results\nand outperforms leading baselines, including HGCNs. Extensive ablation studies\nfurther confirm its effectiveness.\n","authors":["Lu Zhang","Ning Wu"],"pdf_url":"https://arxiv.org/pdf/2304.02961v2.pdf","comment":"Proceedings of the 33rd ACM International Conference on Information\n and Knowledge Management (CIKM '24)"},{"id":"http://arxiv.org/abs/2412.09243v1","updated":"2024-12-12T12:53:30Z","published":"2024-12-12T12:53:30Z","title":"SPRec: Leveraging Self-Play to Debias Preference Alignment for Large\n Language Model-based Recommendations","summary":" Large language models (LLMs) have attracted significant attention in\nrecommendation systems. Current LLM-based recommender systems primarily rely on\nsupervised fine-tuning (SFT) to train the model for recommendation tasks.\nHowever, relying solely on positive samples limits the model's ability to align\nwith user satisfaction and expectations. To address this, researchers have\nintroduced Direct Preference Optimization (DPO), which explicitly aligns\nrecommendations with user preferences using offline preference ranking data.\nDespite its advantages, our theoretical analysis reveals that DPO inherently\nbiases the model towards a few items, exacerbating the filter bubble issue and\nultimately degrading user experience. In this paper, we propose SPRec, a novel\nself-play recommendation framework designed to mitigate over-recommendation and\nimprove fairness without requiring additional data or manual intervention. In\neach self-play iteration, the model undergoes an SFT step followed by a DPO\nstep, treating offline interaction data as positive samples and the predicted\noutputs from the previous iteration as negative samples. This effectively\nre-weights the DPO loss function using the model's logits, adaptively\nsuppressing biased items. Extensive experiments on multiple real-world datasets\ndemonstrate SPRec's effectiveness in enhancing recommendation accuracy and\naddressing fairness concerns.\n","authors":["Chongming Gao","Ruijun Chen","Shuai Yuan","Kexin Huang","Yuanqing Yu","Xiangnan He"],"pdf_url":"https://arxiv.org/pdf/2412.09243v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.04108v2","updated":"2024-12-12T10:56:35Z","published":"2024-04-05T14:04:07Z","title":"Large language models as oracles for instantiating ontologies with\n domain-specific knowledge","summary":" Background. Endowing intelligent systems with semantic data commonly requires\ndesigning and instantiating ontologies with domain-specific knowledge.\nEspecially in the early phases, those activities are typically performed\nmanually by human experts possibly leveraging on their own experience. The\nresulting process is therefore time-consuming, error-prone, and often biased by\nthe personal background of the ontology designer. Objective. To mitigate that\nissue, we propose a novel domain-independent approach to automatically\ninstantiate ontologies with domain-specific knowledge, by leveraging on large\nlanguage models (LLMs) as oracles. Method. Starting from (i) an initial schema\ncomposed by inter-related classes and properties and (ii) a set of query\ntemplates, our method queries the LLM multiple times, and generates instances\nfor both classes and properties from its replies. Thus, the ontology is\nautomatically filled with domain-specific knowledge, compliant to the initial\nschema. As a result, the ontology is quickly and automatically enriched with\nmanifold instances, which experts may consider to keep, adjust, discard, or\ncomplement according to their own needs and expertise. Contribution. We\nformalise our method in general way and instantiate it over various LLMs, as\nwell as on a concrete case study. We report experiments rooted in the\nnutritional domain where an ontology of food meals and their ingredients is\nautomatically instantiated from scratch, starting from a categorisation of\nmeals and their relationships. There, we analyse the quality of the generated\nontologies and compare ontologies attained by exploiting different LLMs.\nExperimentally, our approach achieves a quality metric that is up to five times\nhigher than the state-of-the-art, while reducing erroneous entities and\nrelations by up to ten times. Finally, we provide a SWOT analysis of the\nproposed method.\n","authors":["Giovanni Ciatto","Andrea Agiollo","Matteo Magnini","Andrea Omicini"],"pdf_url":"https://arxiv.org/pdf/2404.04108v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09165v1","updated":"2024-12-12T10:50:26Z","published":"2024-12-12T10:50:26Z","title":"When Text Embedding Meets Large Language Model: A Comprehensive Survey","summary":" Text embedding has become a foundational technology in natural language\nprocessing (NLP) during the deep learning era, driving advancements across a\nwide array of downstream tasks. While many natural language understanding\nchallenges can now be modeled using generative paradigms and leverage the\nrobust generative and comprehension capabilities of large language models\n(LLMs), numerous practical applications, such as semantic matching, clustering,\nand information retrieval, continue to rely on text embeddings for their\nefficiency and effectiveness. In this survey, we categorize the interplay\nbetween LLMs and text embeddings into three overarching themes: (1)\nLLM-augmented text embedding, enhancing traditional embedding methods with\nLLMs; (2) LLMs as text embedders, utilizing their innate capabilities for\nembedding generation; and (3) Text embedding understanding with LLMs,\nleveraging LLMs to analyze and interpret embeddings. By organizing these\nefforts based on interaction patterns rather than specific downstream\napplications, we offer a novel and systematic overview of contributions from\nvarious research and application domains in the era of LLMs. Furthermore, we\nhighlight the unresolved challenges that persisted in the pre-LLM era with\npre-trained language models (PLMs) and explore the emerging obstacles brought\nforth by LLMs. Building on this analysis, we outline prospective directions for\nthe evolution of text embedding, addressing both theoretical and practical\nopportunities in the rapidly advancing landscape of NLP.\n","authors":["Zhijie Nie","Zhangchi Feng","Mingxin Li","Cunwang Zhang","Yanzhao Zhang","Dingkun Long","Richong Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09165v1.pdf","comment":"Work in progress"},{"id":"http://arxiv.org/abs/2411.13173v2","updated":"2024-12-12T10:22:37Z","published":"2024-11-20T10:17:09Z","title":"Writing Style Matters: An Examination of Bias and Fairness in\n Information Retrieval Systems","summary":" The rapid advancement of Language Model technologies has opened new\nopportunities, but also introduced new challenges related to bias and fairness.\nThis paper explores the uncharted territory of potential biases in\nstate-of-the-art universal text embedding models towards specific document and\nquery writing styles within Information Retrieval (IR) systems. Our\ninvestigation reveals that different embedding models exhibit different\npreferences of document writing style, while more informal and emotive styles\nare less favored by most embedding models. In terms of query writing styles,\nmany embedding models tend to match the style of the query with the style of\nthe retrieved documents, but some show a consistent preference for specific\nstyles. Text embedding models fine-tuned on synthetic data generated by LLMs\ndisplay a consistent preference for certain style of generated data. These\nbiases in text embedding based IR systems can inadvertently silence or\nmarginalize certain communication styles, thereby posing a significant threat\nto fairness in information retrieval. Finally, we also compare the answer\nstyles of Retrieval Augmented Generation (RAG) systems based on different LLMs\nand find out that most text embedding models are biased towards LLM's answer\nstyles when used as evaluation metrics for answer correctness. This study sheds\nlight on the critical issue of writing style based bias in IR systems, offering\nvaluable insights for the development of more fair and robust models.\n","authors":["Hongliu Cao"],"pdf_url":"https://arxiv.org/pdf/2411.13173v2.pdf","comment":"In Proceedings of the Eighteenth ACM International Conference on Web\n Search and Data Mining (WSDM 25)"},{"id":"http://arxiv.org/abs/2408.09671v2","updated":"2024-12-12T08:48:27Z","published":"2024-08-19T03:13:20Z","title":"GANPrompt: Enhancing Robustness in LLM-Based Recommendations with\n GAN-Enhanced Diversity Prompts","summary":" In recent years, Large Language Models (LLMs) have demonstrated remarkable\nproficiency in comprehending and generating natural language, with a growing\nprevalence in the domain of recommendation systems. However, LLMs still face a\nsignificant challenge called prompt sensitivity, which refers to that it is\nhighly susceptible to the influence of prompt words. This inconsistency in\nresponse to minor alterations in prompt input may compromise the accuracy and\nresilience of recommendation models. To address this issue, this paper proposes\nGANPrompt, a multi-dimensional LLMs prompt diversity framework based on\nGenerative Adversarial Networks (GANs). The framework enhances the model's\nadaptability and stability to diverse prompts by integrating GANs generation\ntechniques with the deep semantic understanding capabilities of LLMs. GANPrompt\nfirst trains a generator capable of producing diverse prompts by analysing\nmultidimensional user behavioural data. These diverse prompts are then used to\ntrain the LLMs to improve its performance in the face of unseen prompts.\nFurthermore, to ensure a high degree of diversity and relevance of the prompts,\nthis study introduces a mathematical theory-based diversity constraint\nmechanism that optimises the generated prompts to ensure that they are not only\nsuperficially distinct, but also semantically cover a wide range of user\nintentions. Through extensive experiments on multiple datasets, we demonstrate\nthe effectiveness of the proposed framework, especially in improving the\nadaptability and robustness of recommendation systems in complex and dynamic\nenvironments. The experimental results demonstrate that GANPrompt yields\nsubstantial enhancements in accuracy and robustness relative to existing\nstate-of-the-art methodologies.\n","authors":["Xinyu Li","Chuang Zhao","Hongke Zhao","Likang Wu","Ming HE"],"pdf_url":"https://arxiv.org/pdf/2408.09671v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08950v1","updated":"2024-12-12T05:28:34Z","published":"2024-12-12T05:28:34Z","title":"Predicting Quality of Video Gaming Experience Using Global-Scale\n Telemetry Data and Federated Learning","summary":" Frames Per Second (FPS) significantly affects the gaming experience.\nProviding players with accurate FPS estimates prior to purchase benefits both\nplayers and game developers. However, we have a limited understanding of how to\npredict a game's FPS performance on a specific device. In this paper, we first\nconduct a comprehensive analysis of a wide range of factors that may affect\ngame FPS on a global-scale dataset to identify the determinants of FPS. This\nincludes player-side and game-side characteristics, as well as country-level\nsocio-economic statistics. Furthermore, recognizing that accurate FPS\npredictions require extensive user data, which raises privacy concerns, we\npropose a federated learning-based model to ensure user privacy. Each player\nand game is assigned a unique learnable knowledge kernel that gradually\nextracts latent features for improved accuracy. We also introduce a novel\ntraining and prediction scheme that allows these kernels to be dynamically\nplug-and-play, effectively addressing cold start issues. To train this model\nwith minimal bias, we collected a large telemetry dataset from 224 countries\nand regions, 100,000 users, and 835 games. Our model achieved a mean\nWasserstein distance of 0.469 between predicted and ground truth FPS\ndistributions, outperforming all baseline methods.\n","authors":["Zhongyang Zhang","Jinhe Wen","Zixi Chen","Dara Arbab","Sruti Sahani","Bijan Arbab","Haojian Jin","Tauhidur Rahman"],"pdf_url":"https://arxiv.org/pdf/2412.08950v1.pdf","comment":"22 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.08922v1","updated":"2024-12-12T04:13:09Z","published":"2024-12-12T04:13:09Z","title":"A Flexible Plug-and-Play Module for Generating Variable-Length","summary":" Deep supervised hashing has become a pivotal technique in large-scale image\nretrieval, offering significant benefits in terms of storage and search\nefficiency. However, existing deep supervised hashing models predominantly\nfocus on generating fixed-length hash codes. This approach fails to address the\ninherent trade-off between efficiency and effectiveness when using hash codes\nof varying lengths. To determine the optimal hash code length for a specific\ntask, multiple models must be trained for different lengths, leading to\nincreased training time and computational overhead. Furthermore, the current\nparadigm overlooks the potential relationships between hash codes of different\nlengths, limiting the overall effectiveness of the models. To address these\nchallenges, we propose the Nested Hash Layer (NHL), a plug-and-play module\ndesigned for existing deep supervised hashing models. The NHL framework\nintroduces a novel mechanism to simultaneously generate hash codes of varying\nlengths in a nested manner. To tackle the optimization conflicts arising from\nthe multiple learning objectives associated with different code lengths, we\nfurther propose an adaptive weights strategy that dynamically monitors and\nadjusts gradients during training. Additionally, recognizing that the\nstructural information in longer hash codes can provide valuable guidance for\nshorter hash codes, we develop a long-short cascade self-distillation method\nwithin the NHL to enhance the overall quality of the generated hash codes.\nExtensive experiments demonstrate that NHL not only accelerates the training\nprocess but also achieves superior retrieval performance across various deep\nhashing models. Our code is publicly available at\nhttps://github.com/hly1998/NHL.\n","authors":["Liyang He","Yuren Zhang","Rui Li","Zhenya Huang","Runze Wu","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.08922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08911v1","updated":"2024-12-12T03:47:40Z","published":"2024-12-12T03:47:40Z","title":"Goal-Conditioned Supervised Learning for Multi-Objective Recommendation","summary":" Multi-objective learning endeavors to concurrently optimize multiple\nobjectives using a single model, aiming to achieve high and balanced\nperformance across these diverse objectives. However, it often involves a more\ncomplex optimization problem, particularly when navigating potential conflicts\nbetween objectives, leading to solutions with higher memory requirements and\ncomputational complexity. This paper introduces a Multi-Objective\nGoal-Conditioned Supervised Learning (MOGCSL) framework for automatically\nlearning to achieve multiple objectives from offline sequential data. MOGCSL\nextends the conventional Goal-Conditioned Supervised Learning (GCSL) method to\nmulti-objective scenarios by redefining goals from one-dimensional scalars to\nmulti-dimensional vectors. The need for complex architectures and optimization\nconstraints can be naturally eliminated. MOGCSL benefits from filtering out\nuninformative or noisy instances that do not achieve desirable long-term\nrewards. It also incorporates a novel goal-choosing algorithm to model and\nselect \"high\" achievable goals for inference.\n While MOGCSL is quite general, we focus on its application to the next action\nprediction problem in commercial-grade recommender systems. In this context,\nany viable solution needs to be reasonably scalable and also be robust to large\namounts of noisy data that is characteristic of this application space. We show\nthat MOGCSL performs admirably on both counts. Specifically, extensive\nexperiments conducted on real-world recommendation datasets validate its\nefficacy and efficiency. Also, analysis and experiments are included to explain\nits strength in discounting the noisier portions of training data in\nrecommender systems.\n","authors":["Shijun Li","Hilaf Hasson","Jing Hu","Joydeep Ghosh"],"pdf_url":"https://arxiv.org/pdf/2412.08911v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10613v2","updated":"2024-12-12T03:24:29Z","published":"2024-08-20T07:48:19Z","title":"Task-level Distributionally Robust Optimization for Large Language\n Model-based Dense Retrieval","summary":" Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous\nheterogeneous fine-tuning collections from different domains. However, the\ndiscussion about its training data distribution is still minimal. Previous\nstudies rely on empirically assigned dataset choices or sampling ratios, which\ninevitably lead to sub-optimal retrieval performances. In this paper, we\npropose a new task-level Distributionally Robust Optimization (tDRO) algorithm\nfor LLM-DR fine-tuning, targeted at improving the universal domain\ngeneralization ability by end-to-end reweighting the data distribution of each\ntask. The tDRO parameterizes the domain weights and updates them with scaled\ndomain gradients. The optimized weights are then transferred to the LLM-DR\nfine-tuning to train more robust retrievers. Experiments show optimal\nimprovements in large-scale retrieval benchmarks and reduce up to 30% dataset\nusage after applying our optimization algorithm with a series of\ndifferent-sized LLM-DR models.\n","authors":["Guangyuan Ma","Yongliang Ma","Xing Wu","Zhenpeng Su","Ming Zhou","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2408.10613v2.pdf","comment":"Accepted by AAAI25. Source code is available at\n https://github.com/tdro-llm/tdro"},{"id":"http://arxiv.org/abs/2211.14219v2","updated":"2024-12-12T01:11:06Z","published":"2022-11-25T16:31:10Z","title":"The Informational Role of Online Recommendations: Evidence from a Field\n Experiment","summary":" We conduct a field experiment on a movie-recommendation platform to\ninvestigate whether and how online recommendations influence consumption\nchoices. Using a within-subjects design, our experiment measures the causal\neffect of recommendations on consumption and decomposes the relative importance\nof two economic mechanisms: expanding consumers' consideration sets and\nproviding information about their idiosyncratic match value. We find that the\ninformational component exerts a stronger influence - recommendations shape\nconsumer beliefs, which in turn drive consumption, particularly among less\nexperienced consumers. Our findings and experimental design provide valuable\ninsights for the economic evaluation and optimisation of online recommendation\nsystems.\n","authors":["Guy Aridor","Duarte Goncalves","Daniel Kluver","Ruoyan Kong","Joseph Konstan"],"pdf_url":"https://arxiv.org/pdf/2211.14219v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08847v1","updated":"2024-12-12T01:02:09Z","published":"2024-12-12T01:02:09Z","title":"MOPI-HFRS: A Multi-objective Personalized Health-aware Food\n Recommendation System with LLM-enhanced Interpretation","summary":" The prevalence of unhealthy eating habits has become an increasingly\nconcerning issue in the United States. However, major food recommendation\nplatforms (e.g., Yelp) continue to prioritize users' dietary preferences over\nthe healthiness of their choices. Although efforts have been made to develop\nhealth-aware food recommendation systems, the personalization of such systems\nbased on users' specific health conditions remains under-explored. In addition,\nfew research focus on the interpretability of these systems, which hinders\nusers from assessing the reliability of recommendations and impedes the\npractical deployment of these systems. In response to this gap, we first\nestablish two large-scale personalized health-aware food recommendation\nbenchmarks at the first attempt. We then develop a novel framework,\nMulti-Objective Personalized Interpretable Health-aware Food Recommendation\nSystem (MOPI-HFRS), which provides food recommendations by jointly optimizing\nthe three objectives: user preference, personalized healthiness and nutritional\ndiversity, along with an large language model (LLM)-enhanced reasoning module\nto promote healthy dietary knowledge through the interpretation of recommended\nresults. Specifically, this holistic graph learning framework first utilizes\ntwo structure learning and a structure pooling modules to leverage both\ndescriptive features and health data. Then it employs Pareto optimization to\nachieve designed multi-facet objectives. Finally, to further promote the\nhealthy dietary knowledge and awareness, we exploit an LLM by utilizing\nknowledge-infusion, prompting the LLMs with knowledge obtained from the\nrecommendation model for interpretation.\n","authors":["Zheyuan Zhang","Zehong Wang","Tianyi Ma","Varun Sameer Taneja","Sofia Nelson","Nhi Ha Lan Le","Keerthiram Murugesan","Mingxuan Ju","Nitesh V Chawla","Chuxu Zhang","Yanfang Ye"],"pdf_url":"https://arxiv.org/pdf/2412.08847v1.pdf","comment":null}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.09627v1","updated":"2024-12-12T18:59:59Z","published":"2024-12-12T18:59:59Z","title":"Doe-1: Closed-Loop Autonomous Driving with Large World Model","summary":" End-to-end autonomous driving has received increasing attention due to its\npotential to learn from large amounts of data. However, most existing methods\nare still open-loop and suffer from weak scalability, lack of high-order\ninteractions, and inefficient decision-making. In this paper, we explore a\nclosed-loop framework for autonomous driving and propose a large Driving wOrld\nmodEl (Doe-1) for unified perception, prediction, and planning. We formulate\nautonomous driving as a next-token generation problem and use multi-modal\ntokens to accomplish different tasks. Specifically, we use free-form texts\n(i.e., scene descriptions) for perception and generate future predictions\ndirectly in the RGB space with image tokens. For planning, we employ a\nposition-aware tokenizer to effectively encode action into discrete tokens. We\ntrain a multi-modal transformer to autoregressively generate perception,\nprediction, and planning tokens in an end-to-end and unified manner.\nExperiments on the widely used nuScenes dataset demonstrate the effectiveness\nof Doe-1 in various tasks including visual question-answering,\naction-conditioned video generation, and motion planning. Code:\nhttps://github.com/wzzheng/Doe.\n","authors":["Wenzhao Zheng","Zetian Xia","Yuanhui Huang","Sicheng Zuo","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.09627v1.pdf","comment":"Code is available at: https://github.com/wzzheng/Doe"},{"id":"http://arxiv.org/abs/2412.09607v1","updated":"2024-12-12T18:59:31Z","published":"2024-12-12T18:59:31Z","title":"Spectral Image Tokenizer","summary":" Image tokenizers map images to sequences of discrete tokens, and are a\ncrucial component of autoregressive transformer-based image generation. The\ntokens are typically associated with spatial locations in the input image,\narranged in raster scan order, which is not ideal for autoregressive modeling.\nIn this paper, we propose to tokenize the image spectrum instead, obtained from\na discrete wavelet transform (DWT), such that the sequence of tokens represents\nthe image in a coarse-to-fine fashion. Our tokenizer brings several advantages:\n1) it leverages that natural images are more compressible at high frequencies,\n2) it can take and reconstruct images of different resolutions without\nretraining, 3) it improves the conditioning for next-token prediction --\ninstead of conditioning on a partial line-by-line reconstruction of the image,\nit takes a coarse reconstruction of the full image, 4) it enables partial\ndecoding where the first few generated tokens can reconstruct a coarse version\nof the image, 5) it enables autoregressive models to be used for image\nupsampling. We evaluate the tokenizer reconstruction metrics as well as\nmultiscale image generation, text-guided image upsampling and editing.\n","authors":["Carlos Esteves","Mohammed Suhail","Ameesh Makadia"],"pdf_url":"https://arxiv.org/pdf/2412.09607v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09602v1","updated":"2024-12-12T18:59:13Z","published":"2024-12-12T18:59:13Z","title":"Hidden Biases of End-to-End Driving Datasets","summary":" End-to-end driving systems have made rapid progress, but have so far not been\napplied to the challenging new CARLA Leaderboard 2.0. Further, while there is a\nlarge body of literature on end-to-end architectures and training strategies,\nthe impact of the training dataset is often overlooked. In this work, we make a\nfirst attempt at end-to-end driving for Leaderboard 2.0. Instead of\ninvestigating architectures, we systematically analyze the training dataset,\nleading to new insights: (1) Expert style significantly affects downstream\npolicy performance. (2) In complex data sets, the frames should not be weighted\non the basis of simplistic criteria such as class frequencies. (3) Instead,\nestimating whether a frame changes the target labels compared to previous\nframes can reduce the size of the dataset without removing important\ninformation. By incorporating these findings, our model ranks first and second\nrespectively on the map and sensors tracks of the 2024 CARLA Challenge, and\nsets a new state-of-the-art on the Bench2Drive test routes. Finally, we uncover\na design flaw in the current evaluation metrics and propose a modification for\nfuture challenges. Our dataset, code, and pre-trained models are publicly\navailable at https://github.com/autonomousvision/carla_garage.\n","authors":["Julian Zimmerlin","Jens Beißwenger","Bernhard Jaeger","Andreas Geiger","Kashyap Chitta"],"pdf_url":"https://arxiv.org/pdf/2412.09602v1.pdf","comment":"Technical report for the CVPR 2024 Workshop on Foundation Models for\n Autonomous Systems. Runner-up of the track 'CARLA Autonomous Driving\n Challenge' in the 2024 Autonomous Grand Challenge\n (https://opendrivelab.com/challenge2024/)"},{"id":"http://arxiv.org/abs/2412.09600v1","updated":"2024-12-12T18:59:01Z","published":"2024-12-12T18:59:01Z","title":"Owl-1: Omni World Model for Consistent Long Video Generation","summary":" Video generation models (VGMs) have received extensive attention recently and\nserve as promising candidates for general-purpose large vision models. While\nthey can only generate short videos each time, existing methods achieve long\nvideo generation by iteratively calling the VGMs, using the last-frame output\nas the condition for the next-round generation. However, the last frame only\ncontains short-term fine-grained information about the scene, resulting in\ninconsistency in the long horizon. To address this, we propose an Omni World\nmodeL (Owl-1) to produce long-term coherent and comprehensive conditions for\nconsistent long video generation. As videos are observations of the underlying\nevolving world, we propose to model the long-term developments in a latent\nspace and use VGMs to film them into videos. Specifically, we represent the\nworld with a latent state variable which can be decoded into explicit video\nobservations. These observations serve as a basis for anticipating temporal\ndynamics which in turn update the state variable. The interaction between\nevolving dynamics and persistent state enhances the diversity and consistency\nof the long videos. Extensive experiments show that Owl-1 achieves comparable\nperformance with SOTA methods on VBench-I2V and VBench-Long, validating its\nability to generate high-quality video observations. Code:\nhttps://github.com/huang-yh/Owl.\n","authors":["Yuanhui Huang","Wenzhao Zheng","Yuan Gao","Xin Tao","Pengfei Wan","Di Zhang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.09600v1.pdf","comment":"Code is available at: https://github.com/huang-yh/Owl"},{"id":"http://arxiv.org/abs/2406.09390v2","updated":"2024-12-12T18:58:34Z","published":"2024-06-13T17:59:05Z","title":"LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living","summary":" Current Large Language Vision Models (LLVMs) trained on web videos perform\nwell in general video understanding but struggle with fine-grained details,\ncomplex human-object interactions (HOI), and view-invariant representation\nlearning essential for Activities of Daily Living (ADL). This limitation stems\nfrom a lack of specialized ADL video instruction-tuning datasets and\ninsufficient modality integration to capture discriminative action\nrepresentations. To address this, we propose a semi-automated framework for\ncurating ADL datasets, creating ADL-X, a multiview, multimodal RGBS\ninstruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM\nintegrating videos, 3D skeletons, and HOIs to model ADL's complex\nspatiotemporal relationships. For training LLAVIDAL a simple joint alignment of\nall modalities yields suboptimal results; thus, we propose a Multimodal\nProgressive (MMPro) training strategy, incorporating modalities in stages\nfollowing a curriculum. We also establish ADL MCQ and video description\nbenchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL\nachieves state-of-the-art performance across ADL benchmarks. Code and data will\nbe made publicly available at: https://adl-x.github.io/.\n","authors":["Dominick Reilly","Rajatsubhra Chakraborty","Arkaprava Sinha","Manish Kumar Govind","Pu Wang","Francois Bremond","Le Xue","Srijan Das"],"pdf_url":"https://arxiv.org/pdf/2406.09390v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09594v1","updated":"2024-12-12T18:58:14Z","published":"2024-12-12T18:58:14Z","title":"Wait-Less Offline Tuning and Re-solving for Online Decision Making","summary":" Online linear programming (OLP) has found broad applications in revenue\nmanagement and resource allocation. State-of-the-art OLP algorithms achieve low\nregret by repeatedly solving linear programming (LP) subproblems that\nincorporate updated resource information. However, LP-based methods are\ncomputationally expensive and often inefficient for large-scale applications.\nIn contrast, recent first-order OLP algorithms are more computationally\nefficient but typically suffer from worse regret guarantees. To address these\nshortcomings, we propose a new algorithm that combines the strengths of\nLP-based and first-order OLP methods. The algorithm re-solves the LP\nsubproblems periodically at a predefined frequency $f$ and uses the latest dual\nprices to guide online decision-making. In addition, a first-order method runs\nin parallel during each interval between LP re-solves, smoothing resource\nconsumption. Our algorithm achieves $\\mathscr{O}(\\log (T/f) + \\sqrt{f})$\nregret, delivering a \"wait-less\" online decision-making process that balances\nthe computational efficiency of first-order methods and the superior regret\nguarantee of LP-based methods.\n","authors":["Jingruo Sun","Wenzhi Gao","Ellen Vitercik","Yinyu Ye"],"pdf_url":"https://arxiv.org/pdf/2412.09594v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09582v1","updated":"2024-12-12T18:54:48Z","published":"2024-12-12T18:54:48Z","title":"Neptune: The Long Orbit to Benchmarking Long Video Understanding","summary":" This paper describes a semi-automatic pipeline to generate challenging\nquestion-answer-decoy sets for understanding long videos. Many existing video\ndatasets and models are focused on short clips (10s-30s). While some long video\ndatasets do exist, they can often be solved by powerful image models applied\nper frame (and often to very few frames) in a video, and are usually manually\nannotated at high cost. In order to mitigate both these problems, we propose a\nscalable dataset creation pipeline which leverages large models (VLMs and\nLLMs), to automatically generate dense, time-aligned video captions, as well as\ntough question answer decoy sets for video segments (up to 15 minutes in\nlength). Our dataset Neptune covers a broad range of long video reasoning\nabilities and consists of a subset that emphasizes multimodal reasoning. Since\nexisting metrics for open-ended question answering are either rule-based or may\nrely on proprietary models, we provide a new open source model-based metric GEM\nto score open-ended responses on Neptune. Benchmark evaluations reveal that\nmost current open-source long video models perform poorly on Neptune,\nparticularly on questions testing temporal ordering, counting and state\nchanges. Through Neptune, we aim to spur the development of more advanced\nmodels capable of understanding long videos. The dataset is available at\nhttps://github.com/google-deepmind/neptune\n","authors":["Arsha Nagrani","Mingda Zhang","Ramin Mehran","Rachel Hornung","Nitesh Bharadwaj Gundavarapu","Nilpa Jha","Austin Myers","Xingyi Zhou","Boqing Gong","Cordelia Schmid","Mikhail Sirotenko","Yukun Zhu","Tobias Weyand"],"pdf_url":"https://arxiv.org/pdf/2412.09582v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09579v1","updated":"2024-12-12T18:54:07Z","published":"2024-12-12T18:54:07Z","title":"A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural\n Networks","summary":" Knowledge distillation, where a small student model learns from a pre-trained\nlarge teacher model, has achieved substantial empirical success since the\nseminal work of \\citep{hinton2015distilling}. Despite prior theoretical studies\nexploring the benefits of knowledge distillation, an important question remains\nunanswered: why does soft-label training from the teacher require significantly\nfewer neurons than directly training a small neural network with hard labels?\nTo address this, we first present motivating experimental results using simple\nneural network models on a binary classification problem. These results\ndemonstrate that soft-label training consistently outperforms hard-label\ntraining in accuracy, with the performance gap becoming more pronounced as the\ndataset becomes increasingly difficult to classify. We then substantiate these\nobservations with a theoretical contribution based on two-layer neural network\nmodels. Specifically, we show that soft-label training using gradient descent\nrequires only $O\\left(\\frac{1}{\\gamma^2 \\epsilon}\\right)$ neurons to achieve a\nclassification loss averaged over epochs smaller than some $\\epsilon > 0$,\nwhere $\\gamma$ is the separation margin of the limiting kernel. In contrast,\nhard-label training requires $O\\left(\\frac{1}{\\gamma^4} \\cdot\n\\ln\\left(\\frac{1}{\\epsilon}\\right)\\right)$ neurons, as derived from an adapted\nversion of the gradient descent analysis in \\citep{ji2020polylogarithmic}. This\nimplies that when $\\gamma \\leq \\epsilon$, i.e., when the dataset is challenging\nto classify, the neuron requirement for soft-label training can be\nsignificantly lower than that for hard-label training. Finally, we present\nexperimental results on deep neural networks, further validating these\ntheoretical findings.\n","authors":["Saptarshi Mandal","Xiaojun Lin","R. Srikant"],"pdf_url":"https://arxiv.org/pdf/2412.09579v1.pdf","comment":"Main Body of the Paper is under Review at L4DC 2025"},{"id":"http://arxiv.org/abs/2412.09569v1","updated":"2024-12-12T18:51:13Z","published":"2024-12-12T18:51:13Z","title":"JuStRank: Benchmarking LLM Judges for System Ranking","summary":" Given the rapid progress of generative AI, there is a pressing need to\nsystematically compare and choose between the numerous models and\nconfigurations available. The scale and versatility of such evaluations make\nthe use of LLM-based judges a compelling solution for this challenge.\nCrucially, this approach requires first to validate the quality of the LLM\njudge itself. Previous work has focused on instance-based assessment of LLM\njudges, where a judge is evaluated over a set of responses, or response pairs,\nwhile being agnostic to their source systems. We argue that this setting\noverlooks critical factors affecting system-level ranking, such as a judge's\npositive or negative bias towards certain systems. To address this gap, we\nconduct the first large-scale study of LLM judges as system rankers. System\nscores are generated by aggregating judgment scores over multiple system\noutputs, and the judge's quality is assessed by comparing the resulting system\nranking to a human-based ranking. Beyond overall judge assessment, our analysis\nprovides a fine-grained characterization of judge behavior, including their\ndecisiveness and bias.\n","authors":["Ariel Gera","Odellia Boni","Yotam Perlitz","Roy Bar-Haim","Lilach Eden","Asaf Yehudai"],"pdf_url":"https://arxiv.org/pdf/2412.09569v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09565v1","updated":"2024-12-12T18:49:53Z","published":"2024-12-12T18:49:53Z","title":"Obfuscated Activations Bypass LLM Latent-Space Defenses","summary":" Recent latent-space monitoring techniques have shown promise as defenses\nagainst LLM attacks. These defenses act as scanners that seek to detect harmful\nactivations before they lead to undesirable actions. This prompts the question:\nCan models execute harmful behavior via inconspicuous latent states? Here, we\nstudy such obfuscated activations. We show that state-of-the-art latent-space\ndefenses -- including sparse autoencoders, representation probing, and latent\nOOD detection -- are all vulnerable to obfuscated activations. For example,\nagainst probes trained to classify harmfulness, our attacks can often reduce\nrecall from 100% to 0% while retaining a 90% jailbreaking rate. However,\nobfuscation has limits: we find that on a complex task (writing SQL code),\nobfuscation reduces model performance. Together, our results demonstrate that\nneural activations are highly malleable: we can reshape activation patterns in\na variety of ways, often while preserving a network's behavior. This poses a\nfundamental challenge to latent-space defenses.\n","authors":["Luke Bailey","Alex Serrano","Abhay Sheshadri","Mikhail Seleznyov","Jordan Taylor","Erik Jenner","Jacob Hilton","Stephen Casper","Carlos Guestrin","Scott Emmons"],"pdf_url":"https://arxiv.org/pdf/2412.09565v1.pdf","comment":"Project page: https://obfuscated-activations.github.io/"},{"id":"http://arxiv.org/abs/2412.09564v1","updated":"2024-12-12T18:49:11Z","published":"2024-12-12T18:49:11Z","title":"Improving the Reliability of Cable Broadband Networks via Proactive\n Network Maintenance","summary":" Cable broadband networks are one of the few \"last-mile\" broadband\ntechnologies widely available in the U.S. Unfortunately, they have poor\nreliability after decades of deployment. The cable industry proposed a\nframework called Proactive Network Maintenance (PNM) to diagnose the cable\nnetworks. However, there is little public knowledge or systematic study on how\nto use these data to detect and localize cable network problems. Existing tools\nin the public domain have prohibitive high false-positive rates. In this paper,\nwe propose CableMon, the first public-domain system that applies machine\nlearning techniques to PNM data to improve the reliability of cable broadband\nnetworks. CableMon tackles two key challenges faced by cable ISPs: accurately\ndetecting failures, and distinguishing whether a failure occurs within a\nnetwork or at a subscriber's premise. CableMon uses statistical models to\ngenerate features from time series data and uses customer trouble tickets as\nhints to infer abnormal/failure thresholds for these generated features.\nFurther, CableMon employs an unsupervised learning model to group cable devices\nsharing similar anomalous patterns and effectively identify impairments that\noccur inside a cable network and impairments occur at a subscriber's premise,\nas these two different faults require different types of technical personnel to\nrepair them. We use eight months of PNM data and customer trouble tickets from\nan ISP and experimental deployment to evaluate CableMon's performance. Our\nevaluation results show that CableMon can effectively detect and distinguish\nfailures from PNM data and outperforms existing public-domain tools.\n","authors":["Jiyao Hu","Zhenyu Zhou","Xiaowei Yang"],"pdf_url":"https://arxiv.org/pdf/2412.09564v1.pdf","comment":"15 pages including reference. Submitted to IEEE/ACM Transactions on\n Networking. Partly published in NSDI'20, this is the extended version"},{"id":"http://arxiv.org/abs/2412.09563v1","updated":"2024-12-12T18:48:51Z","published":"2024-12-12T18:48:51Z","title":"Does Representation Matter? Exploring Intermediate Layers in Large\n Language Models","summary":" Understanding what defines a good representation in large language models\n(LLMs) is fundamental to both theoretical understanding and practical\napplications. In this paper, we investigate the quality of intermediate\nrepresentations in various LLM architectures, including Transformers and State\nSpace Models (SSMs). We find that intermediate layers often yield more\ninformative representations for downstream tasks than the final layers. To\nmeasure the representation quality, we adapt and apply a suite of metrics -\nsuch as prompt entropy, curvature, and augmentation-invariance - originally\nproposed in other contexts. Our empirical study reveals significant\narchitectural differences, how representations evolve throughout training, and\nhow factors like input randomness and prompt length affect each layer. Notably,\nwe observe a bimodal pattern in the entropy of some intermediate layers and\nconsider potential explanations tied to training data. Overall, our results\nilluminate the internal mechanics of LLMs and guide strategies for\narchitectural optimization and training.\n","authors":["Oscar Skean","Md Rifat Arefin","Yann LeCun","Ravid Shwartz-Ziv"],"pdf_url":"https://arxiv.org/pdf/2412.09563v1.pdf","comment":"Accepted to 2024 NeurIPs Workshop on Machine Learning and Compression"},{"id":"http://arxiv.org/abs/2409.19069v3","updated":"2024-12-12T18:48:25Z","published":"2024-09-27T18:11:00Z","title":"Localizing Memorization in SSL Vision Encoders","summary":" Recent work on studying memorization in self-supervised learning (SSL)\nsuggests that even though SSL encoders are trained on millions of images, they\nstill memorize individual data points. While effort has been put into\ncharacterizing the memorized data and linking encoder memorization to\ndownstream utility, little is known about where the memorization happens inside\nSSL encoders. To close this gap, we propose two metrics for localizing\nmemorization in SSL encoders on a per-layer (layermem) and per-unit basis\n(unitmem). Our localization methods are independent of the downstream task, do\nnot require any label information, and can be performed in a forward pass. By\nlocalizing memorization in various encoder architectures (convolutional and\ntransformer-based) trained on diverse datasets with contrastive and\nnon-contrastive SSL frameworks, we find that (1) while SSL memorization\nincreases with layer depth, highly memorizing units are distributed across the\nentire encoder, (2) a significant fraction of units in SSL encoders experiences\nsurprisingly high memorization of individual data points, which is in contrast\nto models trained under supervision, (3) atypical (or outlier) data points\ncause much higher layer and unit memorization than standard data points, and\n(4) in vision transformers, most memorization happens in the fully-connected\nlayers. Finally, we show that localizing memorization in SSL has the potential\nto improve fine-tuning and to inform pruning strategies.\n","authors":["Wenhao Wang","Adam Dziedzic","Michael Backes","Franziska Boenisch"],"pdf_url":"https://arxiv.org/pdf/2409.19069v3.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.09557v1","updated":"2024-12-12T18:44:38Z","published":"2024-12-12T18:44:38Z","title":"Experimental Machine Learning with Classical and Quantum Data via NMR\n Quantum Kernels","summary":" Kernel methods map data into high-dimensional spaces, enabling linear\nalgorithms to learn nonlinear functions without explicitly storing the feature\nvectors. Quantum kernel methods promise efficient learning by encoding feature\nmaps into exponentially large Hilbert spaces inherent in quantum systems. In\nthis work we implement quantum kernels on a 10-qubit star-topology register in\na nuclear magnetic resonance (NMR) platform. We experimentally encode classical\ndata in the evolution of multiple quantum coherence orders using data-dependent\nunitary transformations and then demonstrate one-dimensional regression and\ntwo-dimensional classification tasks. By extending the register to a\ndouble-layered star configuration, we propose an extended quantum kernel to\nhandle non-parametrized operator inputs. By numerically simulating the extended\nquantum kernel, we show classification of entangling and nonentangling\nunitaries. These results confirm that quantum kernels exhibit strong\ncapabilities in classical as well as quantum machine learning tasks.\n","authors":["Vivek Sabarad","T. S. Mahesh"],"pdf_url":"https://arxiv.org/pdf/2412.09557v1.pdf","comment":"8 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.09556v1","updated":"2024-12-12T18:44:36Z","published":"2024-12-12T18:44:36Z","title":"Enhancing Convergence of Decentralized Gradient Tracking under the KL\n Property","summary":" We study decentralized multiagent optimization over networks, modeled as\nundirected graphs. The optimization problem consists of minimizing a nonconvex\nsmooth function plus a convex extended-value function, which enforces\nconstraints or extra structure on the solution (e.g., sparsity, low-rank). We\nfurther assume that the objective function satisfies the Kurdyka-{\\L}ojasiewicz\n(KL) property, with given exponent $\\theta\\in [0,1)$. The KL property is\nsatisfied by several (nonconvex) functions of practical interest, e.g., arising\nfrom machine learning applications; in the centralized setting, it permits to\nachieve strong convergence guarantees. Here we establish convergence of the\nsame type for the notorious decentralized gradient-tracking-based algorithm\nSONATA. Specifically, $\\textbf{(i)}$ when $\\theta\\in (0,1/2]$, the sequence\ngenerated by SONATA converges to a stationary solution of the problem at\nR-linear rate;$ \\textbf{(ii)} $when $\\theta\\in (1/2,1)$, sublinear rate is\ncertified; and finally $\\textbf{(iii)}$ when $\\theta=0$, the iterates will\neither converge in a finite number of steps or converges at R-linear rate. This\nmatches the convergence behavior of centralized proximal-gradient algorithms\nexcept when $\\theta=0$. Numerical results validate our theoretical findings.\n","authors":["Xiaokai Chen","Tianyu Cao","Gesualdo Scutari"],"pdf_url":"https://arxiv.org/pdf/2412.09556v1.pdf","comment":"25 pages, 4 figures"},{"id":"http://arxiv.org/abs/2407.16677v4","updated":"2024-12-12T18:40:16Z","published":"2024-07-23T17:44:54Z","title":"From Imitation to Refinement -- Residual RL for Precise Assembly","summary":" Recent advances in Behavior Cloning (BC) have made it easy to teach robots\nnew tasks. However, we find that the ease of teaching comes at the cost of\nunreliable performance that saturates with increasing data for tasks requiring\nprecision. The performance saturation can be attributed to two critical\nfactors: (a) distribution shift resulting from the use of offline data and (b)\nthe lack of closed-loop corrective control caused by action chucking\n(predicting a set of future actions executed open-loop) critical for BC\nperformance. Our key insight is that by predicting action chunks, BC policies\nfunction more like trajectory \"planners\" than closed-loop controllers necessary\nfor reliable execution. To address these challenges, we devise a simple yet\neffective method, ResiP (Residual for Precise Manipulation), that overcomes the\nreliability problem while retaining BC's ease of teaching and long-horizon\ncapabilities. ResiP augments a frozen, chunked BC model with a fully\nclosed-loop residual policy trained with reinforcement learning (RL) that\naddresses distribution shifts and introduces closed-loop corrections over\nopen-loop execution of action chunks predicted by the BC trajectory planner.\nVideos, code, and data: https://residual-assembly.github.io.\n","authors":["Lars Ankile","Anthony Simeonov","Idan Shenfeld","Marcel Torne","Pulkit Agrawal"],"pdf_url":"https://arxiv.org/pdf/2407.16677v4.pdf","comment":"Project website: https://residual-assembly.github.io"},{"id":"http://arxiv.org/abs/2412.09545v1","updated":"2024-12-12T18:35:26Z","published":"2024-12-12T18:35:26Z","title":"SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing","summary":" We introduce SimAvatar, a framework designed to generate simulation-ready\nclothed 3D human avatars from a text prompt. Current text-driven human avatar\ngeneration methods either model hair, clothing, and the human body using a\nunified geometry or produce hair and garments that are not easily adaptable for\nsimulation within existing simulation pipelines. The primary challenge lies in\nrepresenting the hair and garment geometry in a way that allows leveraging\nestablished prior knowledge from foundational image diffusion models (e.g.,\nStable Diffusion) while being simulation-ready using either physics or neural\nsimulators. To address this task, we propose a two-stage framework that\ncombines the flexibility of 3D Gaussians with simulation-ready hair strands and\ngarment meshes. Specifically, we first employ three text-conditioned 3D\ngenerative models to generate garment mesh, body shape and hair strands from\nthe given text prompt. To leverage prior knowledge from foundational diffusion\nmodels, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair\nstrands and learn the avatar appearance through optimization. To drive the\navatar given a pose sequence, we first apply physics simulators onto the\ngarment meshes and hair strands. We then transfer the motion onto 3D Gaussians\nthrough carefully designed mechanisms for each body part. As a result, our\nsynthesized avatars have vivid texture and realistic dynamic motion. To the\nbest of our knowledge, our method is the first to produce highly realistic,\nfully simulation-ready 3D avatars, surpassing the capabilities of current\napproaches.\n","authors":["Xueting Li","Ye Yuan","Shalini De Mello","Gilles Daviet","Jonathan Leaf","Miles Macklin","Jan Kautz","Umar Iqbal"],"pdf_url":"https://arxiv.org/pdf/2412.09545v1.pdf","comment":"Project website: https://nvlabs.github.io/SimAvatar/"},{"id":"http://arxiv.org/abs/2412.09544v1","updated":"2024-12-12T18:34:47Z","published":"2024-12-12T18:34:47Z","title":"Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels\n against Reward Hacking","summary":" Aligning AI systems with human preferences typically suffers from the\ninfamous reward hacking problem, where optimization of an imperfect reward\nmodel leads to undesired behaviors. In this paper, we investigate reward\nhacking in offline preference optimization, which aims to improve an initial\nmodel using a preference dataset. We identify two types of reward hacking\nstemming from statistical fluctuations in the dataset: Type I Reward Hacking\ndue to subpar choices appearing more favorable, and Type II Reward Hacking due\nto decent choices appearing less favorable. We prove that many (mainstream or\ntheoretical) preference optimization methods suffer from both types of reward\nhacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference\noptimization method that combines Guiasu's weighted entropy with a robust\nreward maximization objective. POWER enjoys finite-sample guarantees under\ngeneral function approximation, competing with the best covered policy in the\ndata. To mitigate Type II Reward Hacking, we analyze the learning dynamics of\npreference optimization and develop a novel technique that dynamically updates\npreference labels toward certain \"stationary labels\", resulting in diminishing\ngradients for untrustworthy samples. Empirically, POWER with dynamic labels\n(POWER-DL) consistently outperforms state-of-the-art methods on alignment\nbenchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and\n11.5 points on Arena-Hard over DPO, while also improving or maintaining\nperformance on downstream tasks such as mathematical reasoning. Strong\ntheoretical guarantees and empirical results demonstrate the promise of\nPOWER-DL in mitigating reward hacking.\n","authors":["Paria Rashidinejad","Yuandong Tian"],"pdf_url":"https://arxiv.org/pdf/2412.09544v1.pdf","comment":"46 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.09538v1","updated":"2024-12-12T18:28:55Z","published":"2024-12-12T18:28:55Z","title":"Capturing the Temporal Dependence of Training Data Influence","summary":" Traditional data influence estimation methods, like influence function,\nassume that learning algorithms are permutation-invariant with respect to\ntraining data. However, modern training paradigms, especially for foundation\nmodels using stochastic algorithms and multi-stage curricula, are sensitive to\ndata ordering, thus violating this assumption. This mismatch renders influence\nfunctions inadequate for answering a critical question in machine learning: How\ncan we capture the dependence of data influence on the optimization trajectory\nduring training? To address this gap, we formalize the concept of\ntrajectory-specific leave-one-out (LOO) influence, which quantifies the impact\nof removing a data point from a specific iteration during training, accounting\nfor the exact sequence of data encountered and the model's optimization\ntrajectory. However, exactly evaluating the trajectory-specific LOO presents a\nsignificant computational challenge. To address this, we propose data value\nembedding, a novel technique enabling efficient approximation of\ntrajectory-specific LOO. Specifically, we compute a training data embedding\nthat encapsulates the cumulative interactions between data and the evolving\nmodel parameters. The LOO can then be efficiently approximated through a simple\ndot-product between the data value embedding and the gradient of the given test\ndata. As data value embedding captures training data ordering, it offers\nvaluable insights into model training dynamics. In particular, we uncover\ndistinct phases of data influence, revealing that data points in the early and\nlate stages of training exert a greater impact on the final model. These\ninsights translate into actionable strategies for managing the computational\noverhead of data selection by strategically timing the selection process,\npotentially opening new avenues in data curation research.\n","authors":["Jiachen T. Wang","Dawn Song","James Zou","Prateek Mittal","Ruoxi Jia"],"pdf_url":"https://arxiv.org/pdf/2412.09538v1.pdf","comment":"Correspondence to Jiachen T. Wang and Ruoxi Jia"},{"id":"http://arxiv.org/abs/2409.01314v2","updated":"2024-12-12T18:21:03Z","published":"2024-09-02T15:16:07Z","title":"Disentangling Mean Embeddings for Better Diagnostics of Image Generators","summary":" The evaluation of image generators remains a challenge due to the limitations\nof traditional metrics in providing nuanced insights into specific image\nregions. This is a critical problem as not all regions of an image may be\nlearned with similar ease. In this work, we propose a novel approach to\ndisentangle the cosine similarity of mean embeddings into the product of cosine\nsimilarities for individual pixel clusters via central kernel alignment.\nConsequently, we can quantify the contribution of the cluster-wise performance\nto the overall image generation performance. We demonstrate how this enhances\nthe explainability and the likelihood of identifying pixel regions of model\nmisbehavior across various real-world use cases.\n","authors":["Sebastian G. Gruber","Pascal Tobias Ziegler","Florian Buettner"],"pdf_url":"https://arxiv.org/pdf/2409.01314v2.pdf","comment":"Published at Interpretable AI: Past, Present and Future Workshop at\n NeurIPS 2024"},{"id":"http://arxiv.org/abs/2408.16389v3","updated":"2024-12-12T18:19:40Z","published":"2024-08-29T09:50:31Z","title":"Addressing common misinterpretations of KART and UAT in neural network\n literature","summary":" This note addresses the Kolmogorov-Arnold Representation Theorem (KART) and\nthe Universal Approximation Theorem (UAT), focusing on their common\nmisinterpretations in some papers related to neural network approximation. Our\nremarks aim to support a more accurate understanding of KART and UAT among\nneural network specialists.\n","authors":["Vugar Ismailov"],"pdf_url":"https://arxiv.org/pdf/2408.16389v3.pdf","comment":"10 pages; a section, two theorems and several references added"},{"id":"http://arxiv.org/abs/2411.12377v2","updated":"2024-12-12T18:16:23Z","published":"2024-11-19T09:53:28Z","title":"Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics,\n Methods, Frameworks and Future Directions","summary":" Recent advances in machine learning have highlighted Federated Learning (FL)\nas a promising approach that enables multiple distributed users (so-called\nclients) to collectively train ML models without sharing their private data.\nWhile this privacy-preserving method shows potential, it struggles when data\nacross clients is not independent and identically distributed (non-IID) data.\nThe latter remains an unsolved challenge that can result in poorer model\nperformance and slower training times. Despite the significance of non-IID data\nin FL, there is a lack of consensus among researchers about its classification\nand quantification. This technical survey aims to fill that gap by providing a\ndetailed taxonomy for non-IID data, partition protocols, and metrics to\nquantify data heterogeneity. Additionally, we describe popular solutions to\naddress non-IID data and standardized frameworks employed in FL with\nheterogeneous data. Based on our state-of-the-art survey, we present key\nlessons learned and suggest promising future research directions.\n","authors":["Daniel M. Jimenez G.","David Solans","Mikko Heikkila","Andrea Vitaletti","Nicolas Kourtellis","Aris Anagnostopoulos","Ioannis Chatzigiannakis"],"pdf_url":"https://arxiv.org/pdf/2411.12377v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09520v1","updated":"2024-12-12T18:06:22Z","published":"2024-12-12T18:06:22Z","title":"GainAdaptor: Learning Quadrupedal Locomotion with Dual Actors for\n Adaptable and Energy-Efficient Walking on Various Terrains","summary":" Deep reinforcement learning (DRL) has emerged as an innovative solution for\ncontrolling legged robots in challenging environments using minimalist\narchitectures. Traditional control methods for legged robots, such as inverse\ndynamics, either directly manage joint torques or use proportional-derivative\n(PD) controllers to regulate joint positions at a higher level. In case of DRL,\ndirect torque control presents significant challenges, leading to a preference\nfor joint position control. However, this approach necessitates careful\nadjustment of joint PD gains, which can limit both adaptability and efficiency.\nIn this paper, we propose GainAdaptor, an adaptive gain control framework that\nautonomously tunes joint PD gains to enhance terrain adaptability and energy\nefficiency. The framework employs a dual-actor algorithm to dynamically adjust\nthe PD gains based on varying ground conditions. By utilizing a divided action\nspace, GainAdaptor efficiently learns stable and energy-efficient locomotion.\nWe validate the effectiveness of the proposed method through experiments\nconducted on a Unitree Go1 robot, demonstrating improved locomotion performance\nacross diverse terrains.\n","authors":["Mincheol Kim","Nahyun Kwon","Jung-Yup Kim"],"pdf_url":"https://arxiv.org/pdf/2412.09520v1.pdf","comment":"8 pages, 6 figures"},{"id":"http://arxiv.org/abs/2406.10391v2","updated":"2024-12-12T18:00:59Z","published":"2024-06-14T19:39:19Z","title":"BEACON: Benchmark for Comprehensive RNA Tasks and Language Models","summary":" RNA plays a pivotal role in translating genetic instructions into functional\noutcomes, underscoring its importance in biological processes and disease\nmechanisms. Despite the emergence of numerous deep learning approaches for RNA,\nparticularly universal RNA language models, there remains a significant lack of\nstandardized benchmarks to assess the effectiveness of these methods. In this\nstudy, we introduce the first comprehensive RNA benchmark BEACON\n(\\textbf{BE}nchm\\textbf{A}rk for \\textbf{CO}mprehensive R\\textbf{N}A Task and\nLanguage Models). First, BEACON comprises 13 distinct tasks derived from\nextensive previous work covering structural analysis, functional studies, and\nengineering applications, enabling a comprehensive assessment of the\nperformance of methods on various RNA understanding tasks. Second, we examine a\nrange of models, including traditional approaches like CNNs, as well as\nadvanced RNA foundation models based on language models, offering valuable\ninsights into the task-specific performances of these models. Third, we\ninvestigate the vital RNA language model components from the tokenizer and\npositional encoding aspects. Notably, our findings emphasize the superiority of\nsingle nucleotide tokenization and the effectiveness of Attention with Linear\nBiases (ALiBi) over traditional positional encoding methods. Based on these\ninsights, a simple yet strong baseline called BEACON-B is proposed, which can\nachieve outstanding performance with limited data and computational resources.\nThe datasets and source code of our benchmark are available at\nhttps://github.com/terry-r123/RNABenchmark.\n","authors":["Yuchen Ren","Zhiyuan Chen","Lifeng Qiao","Hongtai Jing","Yuchen Cai","Sheng Xu","Peng Ye","Xinzhu Ma","Siqi Sun","Hongliang Yan","Dong Yuan","Wanli Ouyang","Xihui Liu"],"pdf_url":"https://arxiv.org/pdf/2406.10391v2.pdf","comment":"Accepted by NeurIPS 2024 Dataset and Benchmark Track"},{"id":"http://arxiv.org/abs/2412.09500v1","updated":"2024-12-12T17:48:57Z","published":"2024-12-12T17:48:57Z","title":"Loss function to optimise signal significance in particle physics","summary":" We construct a surrogate loss to directly optimise the significance metric\nused in particle physics. We evaluate our loss function for a simple event\nclassification task using a linear model and show that it produces decision\nboundaries that change according to the cross sections of the processes\ninvolved. We find that the models trained with the new loss have higher signal\nefficiency for similar values of estimated signal significance compared to ones\ntrained with a cross-entropy loss, showing promise to improve sensitivity of\nparticle physics searches at colliders.\n","authors":["Jai Bardhan","Cyrin Neeraj","Subhadip Mitra","Tanumoy Mandal"],"pdf_url":"https://arxiv.org/pdf/2412.09500v1.pdf","comment":"9 pages, 4 figures. Appeared in the Machine Learning for Physical\n Sciences (ML4PS) workshop in NeurIPS 2024 conference"},{"id":"http://arxiv.org/abs/2412.09499v1","updated":"2024-12-12T17:47:19Z","published":"2024-12-12T17:47:19Z","title":"A novel ML-fuzzy control system for optimizing PHEV fuel efficiency and\n extending electric range under diverse driving conditions","summary":" Aiming for a greener transportation future, this study introduces an\ninnovative control system for plug-in hybrid electric vehicles (PHEVs) that\nutilizes machine learning (ML) techniques to forecast energy usage in the pure\nelectric mode of the vehicle and optimize power allocation across different\noperational modes, including pure electric, series hybrid, parallel hybrid, and\ninternal combustion operation. The fuzzy logic decision-making process governs\nthe vehicle control system. The performance was assessed under various driving\nconditions. Key findings include a significant enhancement in pure electric\nmode efficiency, achieving an extended full-electric range of approximately 84\nkilometers on an 80% utilization of a 20-kWh battery pack. During the WLTC\ndriving cycle, the control system reduced fuel consumption to 2.86 L/100km,\nrepresenting a 20% reduction in gasoline-equivalent fuel consumption.\nEvaluations of vehicle performance at discrete driving speeds, highlighted\neffective energy management, with the vehicle battery charging at lower speeds\nand discharging at higher speeds, showing optimized energy recovery and\nconsumption strategies. Initial battery charge levels notably influenced\nvehicle performance. A 90% initial charge enabled prolonged all-electric\noperation, minimizing fuel consumption to 2 L/100km less than that of the base\ncontrol system. Real-world driving pattern analysis revealed significant\nvariations, with shorter, slower cycles requiring lower fuel consumption due to\nprioritized electric propulsion, while longer, faster cycles increased internal\ncombustion engine usage. The control system also adapted to different battery\nstate of health (SOH) conditions, with higher SOH facilitating extended\nelectric mode usage, reducing total fuel consumption by up to 2.87 L/100km.\n","authors":["Mehrdad Raeesi","Saba Mansour","Sina Changizian"],"pdf_url":"https://arxiv.org/pdf/2412.09499v1.pdf","comment":"29 pages, 13 figures"},{"id":"http://arxiv.org/abs/2404.10745v2","updated":"2024-12-12T17:42:52Z","published":"2024-04-16T17:23:19Z","title":"Achieving Constant Regret in Linear Markov Decision Processes","summary":" We study the constant regret guarantees in reinforcement learning (RL). Our\nobjective is to design an algorithm that incurs only finite regret over\ninfinite episodes with high probability. We introduce an algorithm,\nCert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where\nboth the transition kernel and the reward function can be approximated by some\nlinear function up to misspecification level $\\zeta$. At the core of\nCert-LSVI-UCB is an innovative \\method, which facilitates a fine-grained\nconcentration analysis for multi-phase value-targeted regression, enabling us\nto establish an instance-dependent regret bound that is constant w.r.t. the\nnumber of episodes. Specifically, we demonstrate that for a linear MDP\ncharacterized by a minimal suboptimality gap $\\Delta$, Cert-LSVI-UCB has a\ncumulative regret of $\\tilde{\\mathcal{O}}(d^3H^5/\\Delta)$ with high\nprobability, provided that the misspecification level $\\zeta$ is below\n$\\tilde{\\mathcal{O}}(\\Delta / (\\sqrt{d}H^2))$. Here $d$ is the dimension of the\nfeature space and $H$ is the horizon. Remarkably, this regret bound is\nindependent of the number of episodes $K$. To the best of our knowledge,\nCert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent,\nhigh-probability regret bound in RL with linear function approximation without\nrelying on prior distribution assumptions.\n","authors":["Weitong Zhang","Zhiyuan Fan","Jiafan He","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2404.10745v2.pdf","comment":"45 pages, 3 tables, 2 figures, in 38th Conference on Neural\n Information Processing Systems (NeurIPS 2024)"},{"id":"http://arxiv.org/abs/2211.08043v3","updated":"2024-12-12T17:37:59Z","published":"2022-11-15T10:49:04Z","title":"The rate of convergence of Bregman proximal methods: Local geometry vs.\n regularity vs. sharpness","summary":" We examine the last-iterate convergence rate of Bregman proximal methods -\nfrom mirror descent to mirror-prox and its optimistic variants - as a function\nof the local geometry induced by the prox-mapping defining the method. For\ngenerality, we focus on local solutions of constrained, non-monotone\nvariational inequalities, and we show that the convergence rate of a given\nmethod depends sharply on its associated Legendre exponent, a notion that\nmeasures the growth rate of the underlying Bregman function (Euclidean,\nentropic, or other) near a solution. In particular, we show that boundary\nsolutions exhibit a stark separation of regimes between methods with a zero and\nnon-zero Legendre exponent: the former converge at a linear rate, while the\nlatter converge, in general, sublinearly. This dichotomy becomes even more\npronounced in linearly constrained problems where methods with entropic\nregularization achieve a linear convergence rate along sharp directions,\ncompared to convergence in a finite number of steps under Euclidean\nregularization.\n","authors":["Waïss Azizian","Franck Iutzeler","Jérôme Malick","Panayotis Mertikopoulos"],"pdf_url":"https://arxiv.org/pdf/2211.08043v3.pdf","comment":"30 pages, 3 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.09486v1","updated":"2024-12-12T17:35:36Z","published":"2024-12-12T17:35:36Z","title":"Regression and Classification with Single-Qubit Quantum Neural Networks","summary":" Since classical machine learning has become a powerful tool for developing\ndata-driven algorithms, quantum machine learning is expected to similarly\nimpact the development of quantum algorithms. The literature reflects a\nmutually beneficial relationship between machine learning and quantum\ncomputing, where progress in one field frequently drives improvements in the\nother. Motivated by the fertile connection between machine learning and quantum\ncomputing enabled by parameterized quantum circuits, we use a\nresource-efficient and scalable Single-Qubit Quantum Neural Network (SQQNN) for\nboth regression and classification tasks. The SQQNN leverages parameterized\nsingle-qubit unitary operators and quantum measurements to achieve efficient\nlearning. To train the model, we use gradient descent for regression tasks. For\nclassification, we introduce a novel training method inspired by the Taylor\nseries, which can efficiently find a global minimum in a single step. This\napproach significantly accelerates training compared to iterative methods.\nEvaluated across various applications, the SQQNN exhibits virtually error-free\nand strong performance in regression and classification tasks, including the\nMNIST dataset. These results demonstrate the versatility, scalability, and\nsuitability of the SQQNN for deployment on near-term quantum devices.\n","authors":["Leandro C. Souza","Bruno C. Guingo","Gilson Giraldi","Renato Portugal"],"pdf_url":"https://arxiv.org/pdf/2412.09486v1.pdf","comment":"21 pages, 7 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.09483v1","updated":"2024-12-12T17:33:06Z","published":"2024-12-12T17:33:06Z","title":"Early Detection of At-Risk Students Using Machine Learning","summary":" This research presents preliminary work to address the challenge of\nidentifying at-risk students using supervised machine learning and three unique\ndata categories: engagement, demographics, and performance data collected from\nFall 2023 using Canvas and the California State University, Fullerton\ndashboard. We aim to tackle the persistent challenges of higher education\nretention and student dropout rates by screening for at-risk students and\nbuilding a high-risk identification system. By focusing on previously\noverlooked behavioral factors alongside traditional metrics, this work aims to\naddress educational gaps, enhance student outcomes, and significantly boost\nstudent success across disciplines at the University. Pre-processing steps take\nplace to establish a target variable, anonymize student information, manage\nmissing data, and identify the most significant features. Given the mixed data\ntypes in the datasets and the binary classification nature of this study, this\nwork considers several machine learning models, including Support Vector\nMachines (SVM), Naive Bayes, K-nearest neighbors (KNN), Decision Trees,\nLogistic Regression, and Random Forest. These models predict at-risk students\nand identify critical periods of the semester when student performance is most\nvulnerable. We will use validation techniques such as train test split and\nk-fold cross-validation to ensure the reliability of the models. Our analysis\nindicates that all algorithms generate an acceptable outcome for at-risk\nstudent predictions, while Naive Bayes performs best overall.\n","authors":["Azucena L. Jimenez Martinez","Kanika Sood","Rakeshkumar Mahto"],"pdf_url":"https://arxiv.org/pdf/2412.09483v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.18465v2","updated":"2024-12-12T17:24:47Z","published":"2023-10-27T20:19:03Z","title":"Nearly Minimax Optimal Submodular Maximization with Bandit Feedback","summary":" We consider maximizing an unknown monotonic, submodular set function $f:\n2^{[n]} \\rightarrow [0,1]$ with cardinality constraint under stochastic bandit\nfeedback. At each time $t=1,\\dots,T$ the learner chooses a set $S_t \\subset\n[n]$ with $|S_t| \\leq k$ and receives reward $f(S_t) + \\eta_t$ where $\\eta_t$\nis mean-zero sub-Gaussian noise. The objective is to minimize the learner's\nregret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| =\nk$, obtained through robust greedy maximization of $f$. To date, the best\nregret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially\ntreating every set as a unique arm one deduces that $\\sqrt{ {n \\choose k} T }$\nis also achievable using standard multi-armed bandit algorithms. In this work,\nwe establish the first minimax lower bound for this setting that scales like\n$\\tilde{\\Omega}(\\min_{L \\le k}(L^{1/3}n^{1/3}T^{2/3} + \\sqrt{{n \\choose k -\nL}T}))$. For a slightly restricted algorithm class, we prove a stronger regret\nlower bound of $\\tilde{\\Omega}(\\min_{L \\le k}(Ln^{1/3}T^{2/3} + \\sqrt{{n\n\\choose k - L}T}))$. Moreover, we propose an algorithm Sub-UCB that achieves\nregret $\\tilde{\\mathcal{O}}(\\min_{L \\le k}(Ln^{1/3}T^{2/3} + \\sqrt{{n \\choose k\n- L}T}))$ capable of matching the lower bound on regret for the restricted\nclass up to logarithmic factors.\n","authors":["Artin Tajdini","Lalit Jain","Kevin Jamieson"],"pdf_url":"https://arxiv.org/pdf/2310.18465v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18070v2","updated":"2024-12-12T17:23:31Z","published":"2024-10-23T17:53:11Z","title":"Training Free Guided Flow Matching with Optimal Control","summary":" Controlled generation with pre-trained Diffusion and Flow Matching models has\nvast applications. One strategy for guiding ODE-based generative models is\nthrough optimizing a target loss $R(x_1)$ while staying close to the prior\ndistribution. Along this line, some recent work showed the effectiveness of\nguiding flow model by differentiating through its ODE sampling process. Despite\nthe superior performance, the theoretical understanding of this line of methods\nis still preliminary, leaving space for algorithm improvement. Moreover,\nexisting methods predominately focus on Euclidean data manifold, and there is a\ncompelling need for guided flow methods on complex geometries such as SO(3),\nwhich prevails in high-stake scientific applications like protein design. We\npresent OC-Flow, a general and theoretically grounded training-free framework\nfor guided flow matching using optimal control. Building upon advances in\noptimal control theory, we develop effective and practical algorithms for\nsolving optimal control in guided ODE-based generation and provide a systematic\ntheoretical analysis of the convergence guarantee in both Euclidean and SO(3).\nWe show that existing backprop-through-ODE methods can be interpreted as\nspecial cases of Euclidean OC-Flow. OC-Flow achieved superior performance in\nextensive experiments on text-guided image manipulation, conditional molecule\ngeneration, and all-atom peptide design.\n","authors":["Luran Wang","Chaoran Cheng","Yizhen Liao","Yanru Qu","Ge Liu"],"pdf_url":"https://arxiv.org/pdf/2410.18070v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09477v1","updated":"2024-12-12T17:21:50Z","published":"2024-12-12T17:21:50Z","title":"Bayesian Optimization via Continual Variational Last Layer Training","summary":" Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate\nmodels for Bayesian optimization (BO) due to their ability to model uncertainty\nand their performance on tasks where correlations are easily captured (such as\nthose defined by Euclidean metrics) and their ability to be efficiently updated\nonline. However, the performance of GPs depends on the choice of kernel, and\nkernel selection for complex correlation structures is often difficult or must\nbe made bespoke. While Bayesian neural networks (BNNs) are a promising\ndirection for higher capacity surrogate models, they have so far seen limited\nuse due to poor performance on some problem types. In this paper, we propose an\napproach which shows competitive performance on many problem types, including\nsome that BNNs typically struggle with. We build on variational Bayesian last\nlayers (VBLLs), and connect training of these models to exact conditioning in\nGPs. We exploit this connection to develop an efficient online training\nalgorithm that interleaves conditioning and optimization. Our findings suggest\nthat VBLL networks significantly outperform GPs and other BNN architectures on\ntasks with complex input correlations, and match the performance of well-tuned\nGPs on established benchmark tasks.\n","authors":["Paul Brunzema","Mikkel Jordahn","John Willes","Sebastian Trimpe","Jasper Snoek","James Harrison"],"pdf_url":"https://arxiv.org/pdf/2412.09477v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09472v1","updated":"2024-12-12T17:18:49Z","published":"2024-12-12T17:18:49Z","title":"A Novel Ensemble-Based Deep Learning Model with Explainable AI for\n Accurate Kidney Disease Diagnosis","summary":" Chronic Kidney Disease (CKD) represents a significant global health\nchallenge, characterized by the progressive decline in renal function, leading\nto the accumulation of waste products and disruptions in fluid balance within\nthe body. Given its pervasive impact on public health, there is a pressing need\nfor effective diagnostic tools to enable timely intervention. Our study delves\ninto the application of cutting-edge transfer learning models for the early\ndetection of CKD. Leveraging a comprehensive and publicly available dataset, we\nmeticulously evaluate the performance of several state-of-the-art models,\nincluding EfficientNetV2, InceptionNetV2, MobileNetV2, and the Vision\nTransformer (ViT) technique. Remarkably, our analysis demonstrates superior\naccuracy rates, surpassing the 90% threshold with MobileNetV2 and achieving\n91.5% accuracy with ViT. Moreover, to enhance predictive capabilities further,\nwe integrate these individual methodologies through ensemble modeling,\nresulting in our ensemble model exhibiting a remarkable 96% accuracy in the\nearly detection of CKD. This significant advancement holds immense promise for\nimproving clinical outcomes and underscores the critical role of machine\nlearning in addressing complex medical challenges.\n","authors":["Md. Arifuzzaman","Iftekhar Ahmed","Md. Jalal Uddin Chowdhury","Shadman Sakib","Mohammad Shoaib Rahman","Md. Ebrahim Hossain","Shakib Absar"],"pdf_url":"https://arxiv.org/pdf/2412.09472v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09469v1","updated":"2024-12-12T17:16:41Z","published":"2024-12-12T17:16:41Z","title":"Neural Network Symmetrisation in Concrete Settings","summary":" Cornish (2024) recently gave a general theory of neural network\nsymmetrisation in the abstract context of Markov categories. We give a\nhigh-level overview of these results, and their concrete implications for the\nsymmetrisation of deterministic functions and of Markov kernels.\n","authors":["Rob Cornish"],"pdf_url":"https://arxiv.org/pdf/2412.09469v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09468v1","updated":"2024-12-12T17:15:49Z","published":"2024-12-12T17:15:49Z","title":"STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized\n Variational Autoencoders for Financial Trading","summary":" In financial trading, factor models are widely used to price assets and\ncapture excess returns from mispricing. Recently, we have witnessed the rise of\nvariational autoencoder-based latent factor models, which learn latent factors\nself-adaptively. While these models focus on modeling overall market\nconditions, they often fail to effectively capture the temporal patterns of\nindividual stocks. Additionally, representing multiple factors as single values\nsimplifies the model but limits its ability to capture complex relationships\nand dependencies. As a result, the learned factors are of low quality and lack\ndiversity, reducing their effectiveness and robustness across different trading\nperiods. To address these issues, we propose a Spatio-Temporal factOR Model\nbased on dual vector quantized variational autoencoders, named STORM, which\nextracts features of stocks from temporal and spatial perspectives, then fuses\nand aligns these features at the fine-grained and semantic level, and\nrepresents the factors as multi-dimensional embeddings. The discrete codebooks\ncluster similar factor embeddings, ensuring orthogonality and diversity, which\nhelps distinguish between different factors and enables factor selection in\nfinancial trading. To show the performance of the proposed factor model, we\napply it to two downstream experiments: portfolio management on two stock\ndatasets and individual trading tasks on six specific stocks. The extensive\nexperiments demonstrate STORM's flexibility in adapting to downstream tasks and\nsuperior performance over baseline models.\n","authors":["Yilei Zhao","Wentao Zhang","Tingran Yang","Yong Jiang","Fei Huang","Wei Yang Bryan Lim"],"pdf_url":"https://arxiv.org/pdf/2412.09468v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.09541v3","updated":"2024-12-12T17:12:06Z","published":"2024-09-14T21:42:17Z","title":"Autonomous Goal Detection and Cessation in Reinforcement Learning: A\n Case Study on Source Term Estimation","summary":" Reinforcement Learning has revolutionized decision-making processes in\ndynamic environments, yet it often struggles with autonomously detecting and\nachieving goals without clear feedback signals. For example, in a Source Term\nEstimation problem, the lack of precise environmental information makes it\nchallenging to provide clear feedback signals and to define and evaluate how\nthe source's location is determined. To address this challenge, the Autonomous\nGoal Detection and Cessation (AGDC) module was developed, enhancing various RL\nalgorithms by incorporating a self-feedback mechanism for autonomous goal\ndetection and cessation upon task completion. Our method effectively identifies\nand ceases undefined goals by approximating the agent's belief, significantly\nenhancing the capabilities of RL algorithms in environments with limited\nfeedback. To validate effectiveness of our approach, we integrated AGDC with\ndeep Q-Network, proximal policy optimization, and deep deterministic policy\ngradient algorithms, and evaluated its performance on the Source Term\nEstimation problem. The experimental results showed that AGDC-enhanced RL\nalgorithms significantly outperformed traditional statistical methods such as\ninfotaxis, entrotaxis, and dual control for exploitation and exploration, as\nwell as a non-statistical random action selection method. These improvements\nwere evident in terms of success rate, mean traveled distance, and search time,\nhighlighting AGDC's effectiveness and efficiency in complex, real-world\nscenarios.\n","authors":["Yiwei Shi","Muning Wen","Qi Zhang","Weinan Zhang","Cunjia Liu","Weiru Liu"],"pdf_url":"https://arxiv.org/pdf/2409.09541v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09453v1","updated":"2024-12-12T17:06:21Z","published":"2024-12-12T17:06:21Z","title":"Finite-PINN: A Physics-Informed Neural Network Architecture for Solving\n Solid Mechanics Problems with General Geometries","summary":" PINN models have demonstrated impressive capabilities in addressing fluid PDE\nproblems, and their potential in solid mechanics is beginning to emerge. This\nstudy identifies two key challenges when using PINN to solve general solid\nmechanics problems. These challenges become evident when comparing the\nlimitations of PINN with the well-established numerical methods commonly used\nin solid mechanics, such as the finite element method (FEM). Specifically: a)\nPINN models generate solutions over an infinite domain, which conflicts with\nthe finite boundaries typical of most solid structures; and b) the solution\nspace utilised by PINN is Euclidean, which is inadequate for addressing the\ncomplex geometries often present in solid structures.\n This work proposes a PINN architecture used for general solid mechanics\nproblems, termed the Finite-PINN model. The proposed model aims to effectively\naddress these two challenges while preserving as much of the original\nimplementation of PINN as possible. The unique architecture of the Finite-PINN\nmodel addresses these challenges by separating the approximation of stress and\ndisplacement fields, and by transforming the solution space from the\ntraditional Euclidean space to a Euclidean-topological joint space. Several\ncase studies presented in this paper demonstrate that the Finite-PINN model\nprovides satisfactory results for a variety of problem types, including both\nforward and inverse problems, in both 2D and 3D contexts. The developed\nFinite-PINN model offers a promising tool for addressing general solid\nmechanics problems, particularly those not yet well-explored in current\nresearch.\n","authors":["Haolin Li","Yuyang Miao","Zahra Sharif Khodaei","M. H. Aliabadi"],"pdf_url":"https://arxiv.org/pdf/2412.09453v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09444v1","updated":"2024-12-12T16:57:46Z","published":"2024-12-12T16:57:46Z","title":"Search Strategy Generation for Branch and Bound Using Genetic\n Programming","summary":" Branch-and-Bound (B\\&B) is an exact method in integer programming that\nrecursively divides the search space into a tree. During the resolution\nprocess, determining the next subproblem to explore within the tree-known as\nthe search strategy-is crucial. Hand-crafted heuristics are commonly used, but\nnone are effective over all problem classes. Recent approaches utilizing neural\nnetworks claim to make more intelligent decisions but are computationally\nexpensive. In this paper, we introduce GP2S (Genetic Programming for Search\nStrategy), a novel machine learning approach that automatically generates a\nB\\&B search strategy heuristic, aiming to make intelligent decisions while\nbeing computationally lightweight. We define a policy as a function that\nevaluates the quality of a B\\&B node by combining features from the node and\nthe problem; the search strategy policy is then defined by a best-first search\nbased on this node ranking. The policy space is explored using a genetic\nprogramming algorithm, and the policy that achieves the best performance on a\ntraining set is selected. We compare our approach with the standard method of\nthe SCIP solver, a recent graph neural network-based method, and handcrafted\nheuristics. Our first evaluation includes three types of primal hard problems,\ntested on instances similar to the training set and on larger instances. Our\nmethod is at most 2\\% slower than the best baseline and consistently\noutperforms SCIP, achieving an average speedup of 11.3\\%. Additionally, GP2S is\ntested on the MIPLIB 2017 dataset, generating multiple heuristics from\ndifferent subsets of instances. It exceeds SCIP's average performance in 7 out\nof 10 cases across 15 times more instances and under a time limit 15 times\nlonger, with some GP2S methods leading on most experiments in terms of the\nnumber of feasible solutions or optimality gap.\n","authors":["Gwen Maudet","Grégoire Danoy"],"pdf_url":"https://arxiv.org/pdf/2412.09444v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2412.09441v1","updated":"2024-12-12T16:57:20Z","published":"2024-12-12T16:57:20Z","title":"MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental\n Learning","summary":" Class-Incremental Learning (CIL) requires models to continually acquire\nknowledge of new classes without forgetting old ones. Despite Pre-trained\nModels (PTMs) have shown excellent performance in CIL, catastrophic forgetting\nstill occurs as the model learns new concepts. Existing work seeks to utilize\nlightweight components to adjust the PTM, while the forgetting phenomenon still\ncomes from {\\em parameter and retrieval} levels. Specifically, iterative\nupdates of the model result in parameter drift, while mistakenly retrieving\nirrelevant modules leads to the mismatch during inference. To this end, we\npropose MOdel Surgery (MOS) to rescue the model from forgetting previous\nknowledge. By training task-specific adapters, we continually adjust the PTM to\ndownstream tasks. To mitigate parameter-level forgetting, we present an adapter\nmerging approach to learn task-specific adapters, which aims to bridge the gap\nbetween different components while reserve task-specific information. Besides,\nto address retrieval-level forgetting, we introduce a training-free\nself-refined adapter retrieval mechanism during inference, which leverages the\nmodel's inherent ability for better adapter retrieval. By jointly rectifying\nthe model with those steps, MOS can robustly resist catastrophic forgetting in\nthe learning process. Extensive experiments on seven benchmark datasets\nvalidate MOS's state-of-the-art performance. Code is available at:\nhttps://github.com/sun-hailong/AAAI25-MOS\n","authors":["Hai-Long Sun","Da-Wei Zhou","Hanbin Zhao","Le Gan","De-Chuan Zhan","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2412.09441v1.pdf","comment":"Accepted to AAAI 2025. Code is available at:\n https://github.com/sun-hailong/AAAI25-MOS"},{"id":"http://arxiv.org/abs/2411.05231v2","updated":"2024-12-12T16:40:18Z","published":"2024-11-07T22:51:47Z","title":"Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams","summary":" Recent advances in generative artificial intelligence (AI) have shown promise\nin accurately grading open-ended student responses. However, few prior works\nhave explored grading handwritten responses due to a lack of data and the\nchallenge of combining visual and textual information. In this work, we\nleverage state-of-the-art multi-modal AI models, in particular GPT-4o, to\nautomatically grade handwritten responses to college-level math exams. Using\nreal student responses to questions in a probability theory exam, we evaluate\nGPT-4o's alignment with ground-truth scores from human graders using various\nprompting techniques. We find that while providing rubrics improves alignment,\nthe model's overall accuracy is still too low for real-world settings, showing\nthere is significant room for growth in this task.\n","authors":["Adriana Caraeni","Alexander Scarlatos","Andrew Lan"],"pdf_url":"https://arxiv.org/pdf/2411.05231v2.pdf","comment":"Published in LAK 2025: The 15th International Learning Analytics and\n Knowledge Conference"},{"id":"http://arxiv.org/abs/2412.09423v1","updated":"2024-12-12T16:30:23Z","published":"2024-12-12T16:30:23Z","title":"Data Efficient Prediction of excited-state properties using Quantum\n Neural Networks","summary":" Understanding the properties of excited states of complex molecules is\ncrucial for many chemical and physical processes. Calculating these properties\nis often significantly more resource-intensive than calculating their ground\nstate counterparts. We present a quantum machine learning model that predicts\nexcited-state properties from the molecular ground state for different\ngeometric configurations. The model comprises a symmetry-invariant quantum\nneural network and a conventional neural network and is able to provide\naccurate predictions with only a few training data points. The proposed\nprocedure is fully NISQ compatible. This is achieved by using a quantum circuit\nthat requires a number of parameters linearly proportional to the number of\nmolecular orbitals, along with a parameterized measurement observable, thereby\nreducing the number of necessary measurements. We benchmark the algorithm on\nthree different molecules by evaluating its performance in predicting excited\nstate transition energies and transition dipole moments. We show that, in many\ninstances, the procedure is able to outperform various classical models that\nrely solely on classical features.\n","authors":["Manuel Hagelüken","Marco F. Huber","Marco Roth"],"pdf_url":"https://arxiv.org/pdf/2412.09423v1.pdf","comment":"10 + 4 pages, 7 + 3 figures"},{"id":"http://arxiv.org/abs/2412.09420v1","updated":"2024-12-12T16:26:38Z","published":"2024-12-12T16:26:38Z","title":"Mixture of neural fields for heterogeneous reconstruction in cryo-EM","summary":" Cryo-electron microscopy (cryo-EM) is an experimental technique for protein\nstructure determination that images an ensemble of macromolecules in\nnear-physiological contexts. While recent advances enable the reconstruction of\ndynamic conformations of a single biomolecular complex, current methods do not\nadequately model samples with mixed conformational and compositional\nheterogeneity. In particular, datasets containing mixtures of multiple proteins\nrequire the joint inference of structure, pose, compositional class, and\nconformational states for 3D reconstruction. Here, we present Hydra, an\napproach that models both conformational and compositional heterogeneity fully\nab initio by parameterizing structures as arising from one of K neural fields.\nWe employ a new likelihood-based loss function and demonstrate the\neffectiveness of our approach on synthetic datasets composed of mixtures of\nproteins with large degrees of conformational variability. We additionally\ndemonstrate Hydra on an experimental dataset of a cellular lysate containing a\nmixture of different protein complexes. Hydra expands the expressivity of\nheterogeneous reconstruction methods and thus broadens the scope of cryo-EM to\nincreasingly complex samples.\n","authors":["Axel Levy","Rishwanth Raghu","David Shustin","Adele Rui-Yang Peng","Huan Li","Oliver Biggs Clarke","Gordon Wetzstein","Ellen D. Zhong"],"pdf_url":"https://arxiv.org/pdf/2412.09420v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09417v1","updated":"2024-12-12T16:25:10Z","published":"2024-12-12T16:25:10Z","title":"Reinforcement Learning Within the Classical Robotics Stack: A Case Study\n in Robot Soccer","summary":" Robot decision-making in partially observable, real-time, dynamic, and\nmulti-agent environments remains a difficult and unsolved challenge. Model-free\nreinforcement learning (RL) is a promising approach to learning decision-making\nin such domains, however, end-to-end RL in complex environments is often\nintractable. To address this challenge in the RoboCup Standard Platform League\n(SPL) domain, we developed a novel architecture integrating RL within a\nclassical robotics stack, while employing a multi-fidelity sim2real approach\nand decomposing behavior into learned sub-behaviors with heuristic selection.\nOur architecture led to victory in the 2024 RoboCup SPL Challenge Shield\nDivision. In this work, we fully describe our system's architecture and\nempirically analyze key design decisions that contributed to its success. Our\napproach demonstrates how RL-based behaviors can be integrated into complete\nrobot behavior architectures.\n","authors":["Adam Labiosa","Zhihan Wang","Siddhant Agarwal","William Cong","Geethika Hemkumar","Abhinav Narayan Harish","Benjamin Hong","Josh Kelle","Chen Li","Yuhao Li","Zisen Shao","Peter Stone","Josiah P. Hanna"],"pdf_url":"https://arxiv.org/pdf/2412.09417v1.pdf","comment":"Submitted to ICRA 2025"},{"id":"http://arxiv.org/abs/2412.00104v2","updated":"2024-12-12T16:10:51Z","published":"2024-11-27T22:12:29Z","title":"Differential learning kinetics govern the transition from memorization\n to generalization during in-context learning","summary":" Transformers exhibit in-context learning (ICL): the ability to use novel\ninformation presented in the context without additional weight updates. Recent\nwork shows that ICL emerges when models are trained on a sufficiently diverse\nset of tasks and the transition from memorization to generalization is sharp\nwith increasing task diversity. One interpretation is that a network's limited\ncapacity to memorize favors generalization. Here, we examine the mechanistic\nunderpinnings of this transition using a small transformer applied to a\nsynthetic ICL task. Using theory and experiment, we show that the sub-circuits\nthat memorize and generalize can be viewed as largely independent. The relative\nrates at which these sub-circuits learn explains the transition from\nmemorization to generalization, rather than capacity constraints. We uncover a\nmemorization scaling law, which determines the task diversity threshold at\nwhich the network generalizes. The theory quantitatively explains a variety of\nother ICL-related phenomena, including the long-tailed distribution of when ICL\nis acquired, the bimodal behavior of solutions close to the task diversity\nthreshold, the influence of contextual and data distributional statistics on\nICL, and the transient nature of ICL.\n","authors":["Alex Nguyen","Gautam Reddy"],"pdf_url":"https://arxiv.org/pdf/2412.00104v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09405v1","updated":"2024-12-12T16:09:57Z","published":"2024-12-12T16:09:57Z","title":"Learned Compression for Compressed Learning","summary":" Modern sensors produce increasingly rich streams of high-resolution data. Due\nto resource constraints, machine learning systems discard the vast majority of\nthis information via resolution reduction. Compressed-domain learning allows\nmodels to operate on compact latent representations, allowing higher effective\nresolution for the same budget. However, existing compression systems are not\nideal for compressed learning. Linear transform coding and end-to-end learned\ncompression systems reduce bitrate, but do not uniformly reduce dimensionality;\nthus, they do not meaningfully increase efficiency. Generative autoencoders\nreduce dimensionality, but their adversarial or perceptual objectives lead to\nsignificant information loss. To address these limitations, we introduce WaLLoC\n(Wavelet Learned Lossy Compression), a neural codec architecture that combines\nlinear transform coding with nonlinear dimensionality-reducing autoencoders.\nWaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck\nbetween an invertible wavelet packet transform. Across several key metrics,\nWaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion\nmodels. WaLLoC does not require perceptual or adversarial losses to represent\nhigh-frequency detail, providing compatibility with modalities beyond RGB\nimages and stereo audio. WaLLoC's encoder consists almost entirely of linear\noperations, making it exceptionally efficient and suitable for mobile\ncomputing, remote sensing, and learning directly from compressed data. We\ndemonstrate WaLLoC's capability for compressed-domain learning across several\ntasks, including image classification, colorization, document understanding,\nand music source separation. Our code, experiments, and pre-trained audio and\nimage codecs are available at https://ut-sysml.org/walloc\n","authors":["Dan Jacobellis","Neeraja J. Yadwadkar"],"pdf_url":"https://arxiv.org/pdf/2412.09405v1.pdf","comment":"Accepted as paper to 2025 IEEE Data Compression Conference"},{"id":"http://arxiv.org/abs/2412.09404v1","updated":"2024-12-12T16:09:50Z","published":"2024-12-12T16:09:50Z","title":"Opinion de-polarization of social networks with GNNs","summary":" Nowadays, social media is the ground for political debate and exchange of\nopinions. There is a significant amount of research that suggests that social\nmedia are highly polarized. A phenomenon that is commonly observed is the echo\nchamber structure, where users are organized in polarized communities and form\nconnections only with similar-minded individuals, limiting themselves to\nconsume specific content. In this paper we explore a way to decrease the\npolarization of networks with two echo chambers. Particularly, we observe that\nif some users adopt a moderate opinion about a topic, the polarization of the\nnetwork decreases. Based on this observation, we propose an efficient algorithm\nto identify a good set of K users, such that if they adopt a moderate stance\naround a topic, the polarization is minimized. Our algorithm employs a Graph\nNeural Network and thus it can handle large graphs more effectively than other\napproaches\n","authors":["Konstantinos Mylonas","Thrasyvoulos Spyropoulos"],"pdf_url":"https://arxiv.org/pdf/2412.09404v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09399v1","updated":"2024-12-12T16:05:39Z","published":"2024-12-12T16:05:39Z","title":"A Geometry-Aware Message Passing Neural Network for Modeling\n Aerodynamics over Airfoils","summary":" Computational modeling of aerodynamics is a key problem in aerospace\nengineering, often involving flows interacting with solid objects such as\nairfoils. Deep surrogate models have emerged as purely data-driven approaches\nthat learn direct mappings from simulation conditions to solutions based on\neither simulation or experimental data. Here, we consider modeling of\nincompressible flows over solid objects, wherein geometric structures are a key\nfactor in determining aerodynamics. To effectively incorporate geometries, we\npropose a message passing scheme that efficiently and expressively integrates\nthe airfoil shape with the mesh representation. Under this framework, we first\nobtain a representation of the geometry in the form of a latent graph on the\nairfoil surface. We subsequently propagate this representation to all\ncollocation points through message passing on a directed, bipartite graph. We\ndemonstrate that this framework supports efficient training by downsampling the\nsolution mesh while avoiding distribution shifts at test time when evaluated on\nthe full mesh. To enable our model to be able to distinguish between distinct\nspatial regimes of dynamics relative to the airfoil, we represent mesh points\nin both a leading edge and trailing edge coordinate system. We further enhance\nthe expressiveness of our coordinate system representations by embedding our\nhybrid Polar-Cartesian coordinates using sinusoidal and spherical harmonics\nbases. We additionally find that a change of basis to canonicalize input\nrepresentations with respect to inlet velocity substantially improves\ngeneralization. Altogether, these design choices lead to a purely data-driven\nmachine learning framework known as GeoMPNN, which won the Best Student\nSubmission award at the NeurIPS 2024 ML4CFD Competition, placing 4th overall.\nOur code is publicly available as part of the AIRS library\n(https://github.com/divelab/AIRS).\n","authors":["Jacob Helwig","Xuan Zhang","Haiyang Yu","Shuiwang Ji"],"pdf_url":"https://arxiv.org/pdf/2412.09399v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09386v1","updated":"2024-12-12T15:53:14Z","published":"2024-12-12T15:53:14Z","title":"Multi-Stage Segmentation and Cascade Classification Methods for\n Improving Cardiac MRI Analysis","summary":" The segmentation and classification of cardiac magnetic resonance imaging are\ncritical for diagnosing heart conditions, yet current approaches face\nchallenges in accuracy and generalizability. In this study, we aim to further\nadvance the segmentation and classification of cardiac magnetic resonance\nimages by introducing a novel deep learning-based approach. Using a multi-stage\nprocess with U-Net and ResNet models for segmentation, followed by Gaussian\nsmoothing, the method improved segmentation accuracy, achieving a Dice\ncoefficient of 0.974 for the left ventricle and 0.947 for the right ventricle.\nFor classification, a cascade of deep learning classifiers was employed to\ndistinguish heart conditions, including hypertrophic cardiomyopathy, myocardial\ninfarction, and dilated cardiomyopathy, achieving an average accuracy of 97.2%.\nThe proposed approach outperformed existing models, enhancing segmentation\naccuracy and classification precision. These advancements show promise for\nclinical applications, though further validation and interpretation across\ndiverse imaging protocols is necessary.\n","authors":["Vitalii Slobodzian","Pavlo Radiuk","Oleksander Barmak","Iurii Krak"],"pdf_url":"https://arxiv.org/pdf/2412.09386v1.pdf","comment":"Cardiac MRI, heart pathology, deep learning, segmentation, Gaussian\n smoothing, classification, cascade"},{"id":"http://arxiv.org/abs/2410.22296v3","updated":"2024-12-12T15:48:47Z","published":"2024-10-29T17:45:57Z","title":"LLMs are Highly-Constrained Biophysical Sequence Optimizers","summary":" Large language models (LLMs) have recently shown significant potential in\nvarious biological tasks such as protein engineering and molecule design. These\ntasks typically involve black-box discrete sequence optimization, where the\nchallenge lies in generating sequences that are not only biologically feasible\nbut also adhere to hard fine-grained constraints. However, LLMs often struggle\nwith such constraints, especially in biological contexts where verifying\ncandidate solutions is costly and time-consuming. In this study, we explore the\npossibility of employing LLMs as highly-constrained bilevel optimizers through\na methodology we refer to as Language Model Optimization with Margin\nExpectation (LLOME). This approach combines both offline and online\noptimization, utilizing limited oracle evaluations to iteratively enhance the\nsequences generated by the LLM. We additionally propose a novel training\nobjective -- Margin-Aligned Expectation (MargE) -- that trains the LLM to\nsmoothly interpolate between the reward and reference distributions. Lastly, we\nintroduce a synthetic test suite that bears strong geometric similarity to real\nbiophysical problems and enables rapid evaluation of LLM optimizers without\ntime-consuming lab validation. Our findings reveal that, in comparison to\ngenetic algorithm baselines, LLMs achieve significantly lower regret solutions\nwhile requiring fewer test function evaluations. However, we also observe that\nLLMs exhibit moderate miscalibration, are susceptible to generator collapse,\nand have difficulty finding the optimal solution when no explicit ground truth\nrewards are available.\n","authors":["Angelica Chen","Samuel D. Stanton","Robert G. Alberstein","Andrew M. Watkins","Richard Bonneau","Vladimir Gligorijević","Kyunghyun Cho","Nathan C. Frey"],"pdf_url":"https://arxiv.org/pdf/2410.22296v3.pdf","comment":"Supercedes arXiv:2407.00236v1"},{"id":"http://arxiv.org/abs/2412.09380v1","updated":"2024-12-12T15:47:59Z","published":"2024-12-12T15:47:59Z","title":"Diffusion Model with Representation Alignment for Protein Inverse\n Folding","summary":" Protein inverse folding is a fundamental problem in bioinformatics, aiming to\nrecover the amino acid sequences from a given protein backbone structure.\nDespite the success of existing methods, they struggle to fully capture the\nintricate inter-residue relationships critical for accurate sequence\nprediction. We propose a novel method that leverages diffusion models with\nrepresentation alignment (DMRA), which enhances diffusion-based inverse folding\nby (1) proposing a shared center that aggregates contextual information from\nthe entire protein structure and selectively distributes it to each residue;\nand (2) aligning noisy hidden representations with clean semantic\nrepresentations during the denoising process. This is achieved by predefined\nsemantic representations for amino acid types and a representation alignment\nmethod that utilizes type embeddings as semantic feedback to normalize each\nresidue. In experiments, we conduct extensive evaluations on the CATH4.2\ndataset to demonstrate that DMRA outperforms leading methods, achieving\nstate-of-the-art performance and exhibiting strong generalization capabilities\non the TS50 and TS500 datasets.\n","authors":["Chenglin Wang","Yucheng Zhou","Zijie Zhai","Jianbing Shen","Kai Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09380v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09379v1","updated":"2024-12-12T15:47:17Z","published":"2024-12-12T15:47:17Z","title":"Hybrid variable spiking graph neural networks for energy-efficient\n scientific machine learning","summary":" Graph-based representations for samples of computational mechanics-related\ndatasets can prove instrumental when dealing with problems like irregular\ndomains or molecular structures of materials, etc. To effectively analyze and\nprocess such datasets, deep learning offers Graph Neural Networks (GNNs) that\nutilize techniques like message-passing within their architecture. The issue,\nhowever, is that as the individual graph scales and/ or GNN architecture\nbecomes increasingly complex, the increased energy budget of the overall deep\nlearning model makes it unsustainable and restricts its applications in\napplications like edge computing. To overcome this, we propose in this paper\nHybrid Variable Spiking Graph Neural Networks (HVS-GNNs) that utilize Variable\nSpiking Neurons (VSNs) within their architecture to promote sparse\ncommunication and hence reduce the overall energy budget. VSNs, while promoting\nsparse event-driven computations, also perform well for regression tasks, which\nare often encountered in computational mechanics applications and are the main\ntarget of this paper. Three examples dealing with prediction of mechanical\nproperties of material based on microscale/ mesoscale structures are shown to\ntest the performance of the proposed HVS-GNNs in regression tasks. We have also\ncompared the performance of HVS-GNN architectures with the performance of\nvanilla GNNs and GNNs utilizing leaky integrate and fire neurons. The results\nproduced show that HVS-GNNs perform well for regression tasks, all while\npromoting sparse communication and, hence, energy efficiency.\n","authors":["Isha Jain","Shailesh Garg","Shaurya Shriyam","Souvik Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2412.09379v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09376v1","updated":"2024-12-12T15:45:21Z","published":"2024-12-12T15:45:21Z","title":"A comprehensive interpretable machine learning framework for Mild\n Cognitive Impairment and Alzheimer's disease diagnosis","summary":" An interpretable machine learning (ML) framework is introduced to enhance the\ndiagnosis of Mild Cognitive Impairment (MCI) and Alzheimer's disease (AD) by\nensuring robustness of the ML models' interpretations. The dataset used\ncomprises volumetric measurements from brain MRI and genetic data from healthy\nindividuals and patients with MCI/AD, obtained through the Alzheimer's Disease\nNeuroimaging Initiative. The existing class imbalance is addressed by an\nensemble learning approach, while various attribution-based and\ncounterfactual-based interpretability methods are leveraged towards producing\ndiverse explanations related to the pathophysiology of MCI/AD. A unification\nmethod combining SHAP with counterfactual explanations assesses the\ninterpretability techniques' robustness. The best performing model yielded\n87.5% balanced accuracy and 90.8% F1-score. The attribution-based\ninterpretability methods highlighted significant volumetric and genetic\nfeatures related to MCI/AD risk. The unification method provided useful\ninsights regarding those features' necessity and sufficiency, further\nshowcasing their significance in MCI/AD diagnosis.\n","authors":["Maria Eleftheria Vlontzou","Maria Athanasiou","Kalliopi Dalakleidi","Ioanna Skampardoni","Christos Davatzikos","Konstantina Nikita"],"pdf_url":"https://arxiv.org/pdf/2412.09376v1.pdf","comment":"This preprint has not been peer-reviewed yet but has been submitted\n to a journal"},{"id":"http://arxiv.org/abs/2410.03955v3","updated":"2024-12-12T15:43:14Z","published":"2024-10-04T22:34:58Z","title":"Model Developmental Safety: A Retention-Centric Method and Applications\n in Vision-Language Models","summary":" In the real world, a learning-enabled system usually undergoes multiple\ncycles of model development to enhance the system's ability to handle difficult\nor emerging tasks. This continual model development process raises a\nsignificant issue that the model development for acquiring new or improving\nexisting capabilities may inadvertently lose capabilities of the old model,\nalso known as catastrophic forgetting. Existing continual learning studies\nfocus on mitigating catastrophic forgetting by trading off performance on\nprevious tasks and new tasks to ensure good average performance. However, they\nare inadequate for many applications especially in safety-critical domains, as\nfailure to strictly preserve the good performance of the old model not only\nintroduces safety risks and uncertainties but also imposes substantial expenses\nin the re-improving and re-validation of existing properties. To address this\nissue, we introduce model developmental safety as a guarantee of a learning\nsystem such that in the model development process the new model should strictly\npreserve the existing protected capabilities of the old model while improving\nits performance on target tasks. To ensure the model developmental safety, we\npresent a retention-centric framework by formulating the model developmental\nsafety as data-dependent constraints. Under this framework, we study how to\ndevelop a pretrained vision-language model, specifically the CLIP model, for\nacquiring new capabilities or improving existing capabilities of image\nclassification. We propose an efficient constrained optimization algorithm with\ntheoretical guarantee and use its insights to finetune a CLIP model with\ntask-dependent heads for promoting the model developmental safety. Our\nexperiments on improving vision perception capabilities on autonomous driving\nand scene recognition datasets demonstrate the efficacy of the proposed\napproach.\n","authors":["Gang Li","Wendi Yu","Yao Yao","Wei Tong","Yingbin Liang","Qihang Lin","Tianbao Yang"],"pdf_url":"https://arxiv.org/pdf/2410.03955v3.pdf","comment":"43 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.09369v1","updated":"2024-12-12T15:37:02Z","published":"2024-12-12T15:37:02Z","title":"Distribution free uncertainty quantification in neuroscience-inspired\n deep operators","summary":" Energy-efficient deep learning algorithms are essential for a sustainable\nfuture and feasible edge computing setups. Spiking neural networks (SNNs),\ninspired from neuroscience, are a positive step in the direction of achieving\nthe required energy efficiency. However, in a bid to lower the energy\nrequirements, accuracy is marginally sacrificed. Hence, predictions of such\ndeep learning algorithms require an uncertainty measure that can inform users\nregarding the bounds of a certain output. In this paper, we introduce the\nConformalized Randomized Prior Operator (CRP-O) framework that leverages\nRandomized Prior (RP) networks and Split Conformal Prediction (SCP) to quantify\nuncertainty in both conventional and spiking neural operators. To further\nenable zero-shot super-resolution in UQ, we propose an extension incorporating\nGaussian Process Regression. This enhanced super-resolution-enabled CRP-O\nframework is integrated with the recently developed Variable Spiking Wavelet\nNeural Operator (VSWNO). To test the performance of the obtained calibrated\nuncertainty bounds, we discuss four different examples covering both\none-dimensional and two-dimensional partial differential equations. Results\ndemonstrate that the uncertainty bounds produced by the conformalized RP-VSWNO\nsignificantly enhance UQ estimates compared to vanilla RP-VSWNO, Quantile WNO\n(Q-WNO), and Conformalized Quantile WNO (CQ-WNO). These findings underscore the\npotential of the proposed approach for practical applications.\n","authors":["Shailesh Garg","Souvik Chakraborty"],"pdf_url":"https://arxiv.org/pdf/2412.09369v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2309.15257v3","updated":"2024-12-12T15:29:41Z","published":"2023-09-26T20:31:19Z","title":"STARC: A General Framework For Quantifying Differences Between Reward\n Functions","summary":" In order to solve a task using reinforcement learning, it is necessary to\nfirst formalise the goal of that task as a reward function. However, for many\nreal-world tasks, it is very difficult to manually specify a reward function\nthat never incentivises undesirable behaviour. As a result, it is increasingly\npopular to use reward learning algorithms, which attempt to learn a reward\nfunction from data. However, the theoretical foundations of reward learning are\nnot yet well-developed. In particular, it is typically not known when a given\nreward learning algorithm with high probability will learn a reward function\nthat is safe to optimise. This means that reward learning algorithms generally\nmust be evaluated empirically, which is expensive, and that their failure modes\nare difficult to anticipate in advance. One of the roadblocks to deriving\nbetter theoretical guarantees is the lack of good methods for quantifying the\ndifference between reward functions. In this paper we provide a solution to\nthis problem, in the form of a class of pseudometrics on the space of all\nreward functions that we call STARC (STAndardised Reward Comparison) metrics.\nWe show that STARC metrics induce both an upper and a lower bound on worst-case\nregret, which implies that our metrics are tight, and that any metric with the\nsame properties must be bilipschitz equivalent to ours. Moreover, we also\nidentify a number of issues with reward metrics proposed by earlier works.\nFinally, we evaluate our metrics empirically, to demonstrate their practical\nefficacy. STARC metrics can be used to make both theoretical and empirical\nanalysis of reward learning algorithms both easier and more principled.\n","authors":["Joar Skalse","Lucy Farnik","Sumeet Ramesh Motwani","Erik Jenner","Adam Gleave","Alessandro Abate"],"pdf_url":"https://arxiv.org/pdf/2309.15257v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09346v1","updated":"2024-12-12T15:13:34Z","published":"2024-12-12T15:13:34Z","title":"Quantitative Evaluation of Motif Sets in Time Series","summary":" Time Series Motif Discovery (TSMD), which aims at finding recurring patterns\nin time series, is an important task in numerous application domains, and many\nmethods for this task exist. These methods are usually evaluated qualitatively.\nA few metrics for quantitative evaluation, where discovered motifs are compared\nto some ground truth, have been proposed, but they typically make implicit\nassumptions that limit their applicability. This paper introduces PROM, a\nbroadly applicable metric that overcomes those limitations, and TSMD-Bench, a\nbenchmark for quantitative evaluation of time series motif discovery.\nExperiments with PROM and TSMD-Bench show that PROM provides a more\ncomprehensive evaluation than existing metrics, that TSMD-Bench is a more\nchallenging benchmark than earlier ones, and that the combination can help\nunderstand the relative performance of TSMD methods. More generally, the\nproposed approach enables large-scale, systematic performance comparisons in\nthis field.\n","authors":["Daan Van Wesenbeeck","Aras Yurtman","Wannes Meert","Hendrik Blockeel"],"pdf_url":"https://arxiv.org/pdf/2412.09346v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09342v1","updated":"2024-12-12T15:10:22Z","published":"2024-12-12T15:10:22Z","title":"Diffusion Predictive Control with Constraints","summary":" Diffusion models have recently gained popularity for policy learning in\nrobotics due to their ability to capture high-dimensional and multimodal\ndistributions. However, diffusion policies are inherently stochastic and\ntypically trained offline, limiting their ability to handle unseen and dynamic\nconditions where novel constraints not represented in the training data must be\nsatisfied. To overcome this limitation, we propose diffusion predictive control\nwith constraints (DPCC), an algorithm for diffusion-based control with explicit\nstate and action constraints that can deviate from those in the training data.\nDPCC uses constraint tightening and incorporates model-based projections into\nthe denoising process of a trained trajectory diffusion model. This allows us\nto generate constraint-satisfying, dynamically feasible, and goal-reaching\ntrajectories for predictive control. We show through simulations of a robot\nmanipulator that DPCC outperforms existing methods in satisfying novel\ntest-time constraints while maintaining performance on the learned control\ntask.\n","authors":["Ralf Römer","Alexander von Rohr","Angela P. Schoellig"],"pdf_url":"https://arxiv.org/pdf/2412.09342v1.pdf","comment":"Code: https://github.com/ralfroemer99/dpcc. 14 pages, 3 figures, 3\n tables"},{"id":"http://arxiv.org/abs/2412.09328v1","updated":"2024-12-12T14:51:48Z","published":"2024-12-12T14:51:48Z","title":"Auto-Regressive Moving Diffusion Models for Time Series Forecasting","summary":" Time series forecasting (TSF) is essential in various domains, and recent\nadvancements in diffusion-based TSF models have shown considerable promise.\nHowever, these models typically adopt traditional diffusion patterns, treating\nTSF as a noise-based conditional generation task. This approach neglects the\ninherent continuous sequential nature of time series, leading to a fundamental\nmisalignment between diffusion mechanisms and the TSF objective, thereby\nseverely impairing performance. To bridge this misalignment, and inspired by\nthe classic Auto-Regressive Moving Average (ARMA) theory, which views time\nseries as continuous sequential progressions evolving from previous data\npoints, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to\nfirst achieve the continuous sequential diffusion-based TSF. Unlike previous\nmethods that start from white Gaussian noise, our model employs chain-based\ndiffusion with priors, accurately modeling the evolution of time series and\nleveraging intermediate state information to improve forecasting accuracy and\nstability. Specifically, our approach reinterprets the diffusion process by\nconsidering future series as the initial state and historical series as the\nfinal state, with intermediate series generated using a sliding-based technique\nduring the forward process. This design aligns the diffusion model's sampling\nprocedure with the forecasting objective, resulting in an unconditional,\ncontinuous sequential diffusion TSF model. Extensive experiments conducted on\nseven widely used datasets demonstrate that our model achieves state-of-the-art\nperformance, significantly outperforming existing diffusion-based TSF models.\nOur code is available on GitHub: https://github.com/daxin007/ARMD.\n","authors":["Jiaxin Gao","Qinglong Cao","Yuntian Chen"],"pdf_url":"https://arxiv.org/pdf/2412.09328v1.pdf","comment":"no comment"},{"id":"http://arxiv.org/abs/2410.08760v2","updated":"2024-12-12T14:43:48Z","published":"2024-10-11T12:19:18Z","title":"Unlocking FedNL: Self-Contained Compute-Optimized Implementation","summary":" Federated Learning (FL) is an emerging paradigm that enables intelligent\nagents to collaboratively train Machine Learning (ML) models in a distributed\nmanner, eliminating the need for sharing their local data. The recent work\n(arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL)\nalgorithms, marking a significant step towards applying second-order methods to\nFL and large-scale optimization. However, the reference FedNL prototype\nexhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch\na single experiment in a sever-grade workstation; (ii) The prototype only\nsimulates multi-node setting; (iii) Prototype integration into\nresource-constrained applications is challenging. To bridge the gap between\ntheory and practice, we present a self-contained implementation of FedNL,\nFedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves\nthe aforementioned issues and reduces the wall clock time by x1000. With this\nFedNL outperforms alternatives for training logistic regression in a\nsingle-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark\n(arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose\ntwo practical-orientated compressors for FedNL - adaptive TopLEK and\ncache-aware RandSeqK, which fulfill the theory of FedNL.\n","authors":["Konstantin Burlachenko","Peter Richtárik"],"pdf_url":"https://arxiv.org/pdf/2410.08760v2.pdf","comment":"55 pages, 12 figures, 12 tables"},{"id":"http://arxiv.org/abs/2412.00727v2","updated":"2024-12-12T14:28:42Z","published":"2024-12-01T08:39:12Z","title":"Perturb and Recover: Fine-tuning for Effective Backdoor Removal from\n CLIP","summary":" Vision-Language models like CLIP have been shown to be highly effective at\nlinking visual perception and natural language understanding, enabling\nsophisticated image-text capabilities, including strong retrieval and zero-shot\nclassification performance. Their widespread use, as well as the fact that CLIP\nmodels are trained on image-text pairs from the web, make them both a\nworthwhile and relatively easy target for backdoor attacks. As training\nfoundational models, such as CLIP, from scratch is very expensive, this paper\nfocuses on cleaning potentially poisoned models via fine-tuning. We first show\nthat existing cleaning techniques are not effective against simple structured\ntriggers used in Blended or BadNet backdoor attacks, exposing a critical\nvulnerability for potential real-world deployment of these models. Then, we\nintroduce PAR, Perturb and Recover, a surprisingly simple yet effective\nmechanism to remove backdoors from CLIP models. Through extensive experiments\nacross different encoders and types of backdoor attacks, we show that PAR\nachieves high backdoor removal rate while preserving good standard performance.\nFinally, we illustrate that our approach is effective even only with synthetic\ntext-image pairs, i.e. without access to real training data. The code and\nmodels are available at https://github.com/nmndeep/PerturbAndRecover.\n","authors":["Naman Deep Singh","Francesco Croce","Matthias Hein"],"pdf_url":"https://arxiv.org/pdf/2412.00727v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09308v1","updated":"2024-12-12T14:24:04Z","published":"2024-12-12T14:24:04Z","title":"Dynamic Prompt Allocation and Tuning for Continual Test-Time Adaptation","summary":" Continual test-time adaptation (CTTA) has recently emerged to adapt a\npre-trained source model to continuously evolving target distributions, which\naccommodates the dynamic nature of real-world environments. To mitigate the\nrisk of catastrophic forgetting in CTTA, existing methods typically incorporate\nexplicit regularization terms to constrain the variation of model parameters.\nHowever, they cannot fundamentally resolve catastrophic forgetting because they\nrely on a single shared model to adapt across all target domains, which\ninevitably leads to severe inter-domain interference. In this paper, we\nintroduce learnable domain-specific prompts that guide the model to adapt to\ncorresponding target domains, thereby partially disentangling the parameter\nspace of different domains. In the absence of domain identity for target\nsamples, we propose a novel dynamic Prompt AllocatIon aNd Tuning (PAINT)\nmethod, which utilizes a query mechanism to dynamically determine whether the\ncurrent samples come from a known domain or an unexplored one. For known\ndomains, the corresponding domain-specific prompt is directly selected, while\nfor previously unseen domains, a new prompt is allocated. Prompt tuning is\nsubsequently performed using mutual information maximization along with\nstructural regularization. Extensive experiments on three benchmark datasets\ndemonstrate the effectiveness of our PAINT method for CTTA. We have released\nour code at https://github.com/Cadezzyr/PAINT.\n","authors":["Chaoran Cui","Yongrui Zhen","Shuai Gong","Chunyun Zhang","Hui Liu","Yilong Yin"],"pdf_url":"https://arxiv.org/pdf/2412.09308v1.pdf","comment":"21 pages, 5 figures, and 6 tables"},{"id":"http://arxiv.org/abs/2412.09292v1","updated":"2024-12-12T14:06:37Z","published":"2024-12-12T14:06:37Z","title":"Transfer Learning of RSSI to Improve Indoor Localisation Performance","summary":" With the growing demand for health monitoring systems, in-home localisation\nis essential for tracking patient conditions. The unique spatial\ncharacteristics of each house required annotated data for Bluetooth Low Energy\n(BLE) Received Signal Strength Indicator (RSSI)-based monitoring system.\nHowever, collecting annotated training data is time-consuming, particularly for\npatients with limited health conditions. To address this, we propose\nConditional Generative Adversarial Networks (ConGAN)-based augmentation,\ncombined with our transfer learning framework (T-ConGAN), to enable the\ntransfer of generic RSSI information between different homes, even when data is\ncollected using different experimental protocols. This enhances the performance\nand scalability of such intelligent systems by reducing the need for annotation\nin each home. We are the first to demonstrate that BLE RSSI data can be shared\nacross different homes, and that shared information can improve the indoor\nlocalisation performance. Our T-ConGAN enhances the macro F1 score of\nroom-level indoor localisation by up to 12.2%, with a remarkable 51%\nimprovement in challenging areas such as stairways or outside spaces. This\nstate-of-the-art RSSI augmentation model significantly enhances the robustness\nof in-home health monitoring systems.\n","authors":["Thanaphon Suwannaphong","Ryan McConville","Ian Craddock"],"pdf_url":"https://arxiv.org/pdf/2412.09292v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09289v1","updated":"2024-12-12T13:59:21Z","published":"2024-12-12T13:59:21Z","title":"Optimising TinyML with Quantization and Distillation of Transformer and\n Mamba Models for Indoor Localisation on Edge Devices","summary":" This paper proposes small and efficient machine learning models (TinyML) for\nresource-constrained edge devices, specifically for on-device indoor\nlocalisation. Typical approaches for indoor localisation rely on centralised\nremote processing of data transmitted from lower powered devices such as\nwearables. However, there are several benefits for moving this to the edge\ndevice itself, including increased battery life, enhanced privacy, reduced\nlatency and lowered operational costs, all of which are key for common\napplications such as health monitoring. The work focuses on model compression\ntechniques, including quantization and knowledge distillation, to significantly\nreduce the model size while maintaining high predictive performance. We base\nour work on a large state-of-the-art transformer-based model and seek to deploy\nit within low-power MCUs. We also propose a state-space-based architecture\nusing Mamba as a more compact alternative to the transformer. Our results show\nthat the quantized transformer model performs well within a 64 KB RAM\nconstraint, achieving an effective balance between model size and localisation\nprecision. Additionally, the compact Mamba model has strong performance under\neven tighter constraints, such as a 32 KB of RAM, without the need for model\ncompression, making it a viable option for more resource-limited environments.\nWe demonstrate that, through our framework, it is feasible to deploy advanced\nindoor localisation models onto low-power MCUs with restricted memory\nlimitations. The application of these TinyML models in healthcare has the\npotential to revolutionize patient monitoring by providing accurate, real-time\nlocation data while minimizing power consumption, increasing data privacy,\nimproving latency and reducing infrastructure costs.\n","authors":["Thanaphon Suwannaphong","Ferdian Jovan","Ian Craddock","Ryan McConville"],"pdf_url":"https://arxiv.org/pdf/2412.09289v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09286v1","updated":"2024-12-12T13:56:36Z","published":"2024-12-12T13:56:36Z","title":"Learning Novel Skills from Language-Generated Demonstrations","summary":" Current robot learning algorithms for acquiring novel skills often rely on\ndemonstration datasets or environment interactions, resulting in high labor\ncosts and potential safety risks. To address these challenges, this study\nproposes a skill-learning framework that enables robots to acquire novel skills\nfrom natural language instructions. The proposed pipeline leverages\nvision-language models to generate demonstration videos of novel skills, which\nare processed by an inverse dynamics model to extract actions from the\nunlabeled demonstrations. These actions are subsequently mapped to\nenvironmental contexts via imitation learning, enabling robots to learn new\nskills effectively. Experimental evaluations in the MetaWorld simulation\nenvironments demonstrate the pipeline's capability to generate high-fidelity\nand reliable demonstrations. Using the generated demonstrations, various skill\nlearning algorithms achieve an accomplishment rate three times the original on\nnovel tasks. These results highlight a novel approach to robot learning,\noffering a foundation for the intuitive and intelligent acquisition of novel\nrobotic skills.\n","authors":["Ao-Qun Jin","Tian-Yu Xiang","Xiao-Hu Zhou","Mei-Jiang Gui","Xiao-Liang Xie","Shi-Qi Liu","Shuang-Yi Wang","Yue Cao","Sheng-Bin Duan","Fu-Chao Xie","Zeng-Guang Hou"],"pdf_url":"https://arxiv.org/pdf/2412.09286v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07435v2","updated":"2024-12-12T13:48:18Z","published":"2024-12-10T11:50:46Z","title":"Parallel simulation for sampling under isoperimetry and score-based\n diffusion models","summary":" In recent years, there has been a surge of interest in proving discretization\nbounds for sampling under isoperimetry and for diffusion models. As data size\ngrows, reducing the iteration cost becomes an important goal. Inspired by the\ngreat success of the parallel simulation of the initial value problem in\nscientific computation, we propose parallel Picard methods for sampling tasks.\nRigorous theoretical analysis reveals that our algorithm achieves better\ndependence on dimension $d$ than prior works in iteration complexity (i.e.,\nreduced from $\\widetilde{O}(\\log^2 d)$ to $\\widetilde{O}(\\log d)$), which is\neven optimal for sampling under isoperimetry with specific iteration\ncomplexity. Our work highlights the potential advantages of simulation methods\nin scientific computation for dynamics-based sampling and diffusion models.\n","authors":["Huanjian Zhou","Masashi Sugiyama"],"pdf_url":"https://arxiv.org/pdf/2412.07435v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09282v1","updated":"2024-12-12T13:45:11Z","published":"2024-12-12T13:45:11Z","title":"CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of\n LLMs","summary":" Powerful large language models (LLMs) are increasingly expected to be\ndeployed with lower computational costs, enabling their capabilities on\nresource-constrained devices. Post-training quantization (PTQ) has emerged as a\nstar approach to achieve this ambition, with best methods compressing weights\nto less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector\nQuantization (CRVQ), a novel technique that significantly improves the\nperformance of PTQ baselines at the cost of only minimal additional bits. This\nstate-of-the-art extreme compression method achieves its results through two\nkey innovations: (1) carefully selecting and reordering a very small subset of\ncritical weight channels, and (2) leveraging multiple codebooks to relax the\nconstraint of critical channels. With our method, we demonstrate a 38.9%\nimprovement over the current strongest sub-2-bit PTQ baseline, enabling nearer\nlossless 1-bit compression. Furthermore, our approach offers flexible\ncustomization of quantization bit-width and performance, providing a wider\nrange of deployment options for diverse hardware platforms.\n","authors":["Yuzhuang Xu","Shiyu Ji","Qingfu Zhu","Wanxiang Che"],"pdf_url":"https://arxiv.org/pdf/2412.09282v1.pdf","comment":"5 figures, 4 tables"},{"id":"http://arxiv.org/abs/2402.05541v2","updated":"2024-12-12T13:40:37Z","published":"2024-02-08T10:22:12Z","title":"FedAA: A Reinforcement Learning Perspective on Adaptive Aggregation for\n Fair and Robust Federated Learning","summary":" Federated Learning (FL) has emerged as a promising approach for\nprivacy-preserving model training across decentralized devices. However, it\nfaces challenges such as statistical heterogeneity and susceptibility to\nadversarial attacks, which can impact model robustness and fairness.\nPersonalized FL attempts to provide some relief by customizing models for\nindividual clients. However, it falls short in addressing server-side\naggregation vulnerabilities. We introduce a novel method called \\textbf{FedAA},\nwhich optimizes client contributions via \\textbf{A}daptive \\textbf{A}ggregation\nto enhance model robustness against malicious clients and ensure fairness\nacross participants in non-identically distributed settings. To achieve this\ngoal, we propose an approach involving a Deep Deterministic Policy\nGradient-based algorithm for continuous control of aggregation weights, an\ninnovative client selection method based on model parameter distances, and a\nreward mechanism guided by validation set performance. Empirically, extensive\nexperiments demonstrate that, in terms of robustness, \\textbf{FedAA}\noutperforms the state-of-the-art methods, while maintaining comparable levels\nof fairness, offering a promising solution to build resilient and fair\nfederated systems. Our code is available at https://github.com/Gp1g/FedAA.\n","authors":["Jialuo He","Wei Chen","Xiaojin Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.05541v2.pdf","comment":"AAAI 2025"},{"id":"http://arxiv.org/abs/2407.13291v4","updated":"2024-12-12T13:35:22Z","published":"2024-07-18T08:45:14Z","title":"Scikit-fingerprints: easy and efficient computation of molecular\n fingerprints in Python","summary":" In this work, we present scikit-fingerprints, a Python package for\ncomputation of molecular fingerprints for applications in chemoinformatics. Our\nlibrary offers an industry-standard scikit-learn interface, allowing intuitive\nusage and easy integration with machine learning pipelines. It is also highly\noptimized, featuring parallel computation that enables efficient processing of\nlarge molecular datasets. Currently, scikit-fingerprints stands as the most\nfeature-rich library in the open source Python ecosystem, offering over 30\nmolecular fingerprints. Our library simplifies chemoinformatics tasks based on\nmolecular fingerprints, including molecular property prediction and virtual\nscreening. It is also flexible, highly efficient, and fully open source.\n","authors":["Jakub Adamczyk","Piotr Ludynia"],"pdf_url":"https://arxiv.org/pdf/2407.13291v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.13094v2","updated":"2024-12-12T13:33:57Z","published":"2023-11-22T01:50:43Z","title":"Newton-CG methods for nonconvex unconstrained optimization with Hölder\n continuous Hessian","summary":" In this paper we consider a nonconvex unconstrained optimization problem\nminimizing a twice differentiable objective function with H\\\"older continuous\nHessian. Specifically, we first propose a Newton-conjugate gradient (Newton-CG)\nmethod for finding an approximate first- and second-order stationary point of\nthis problem, assuming the associated the H\\\"older parameters are explicitly\nknown. Then we develop a parameter-free Newton-CG method without requiring any\nprior knowledge of these parameters. To the best of our knowledge, this method\nis the first parameter-free second-order method achieving the best-known\niteration and operation complexity for finding an approximate first- and\nsecond-order stationary point of this problem. Finally, we present preliminary\nnumerical results to demonstrate the superior practical performance of our\nparameter-free Newton-CG method over a well-known regularized Newton method.\n","authors":["Chuan He","Heng Huang","Zhaosong Lu"],"pdf_url":"https://arxiv.org/pdf/2311.13094v2.pdf","comment":"arXiv admin note: text overlap with arXiv:2301.03139"},{"id":"http://arxiv.org/abs/2412.09265v1","updated":"2024-12-12T13:22:02Z","published":"2024-12-12T13:22:02Z","title":"Score and Distribution Matching Policy: Advanced Accelerated Visuomotor\n Policies via Matched Distillation","summary":" Visual-motor policy learning has advanced with architectures like\ndiffusion-based policies, known for modeling complex robotic trajectories.\nHowever, their prolonged inference times hinder high-frequency control tasks\nrequiring real-time feedback. While consistency distillation (CD) accelerates\ninference, it introduces errors that compromise action quality. To address\nthese limitations, we propose the Score and Distribution Matching Policy (SDM\nPolicy), which transforms diffusion-based policies into single-step generators\nthrough a two-stage optimization process: score matching ensures alignment with\ntrue action distributions, and distribution matching minimizes KL divergence\nfor consistency. A dual-teacher mechanism integrates a frozen teacher for\nstability and an unfrozen teacher for adversarial training, enhancing\nrobustness and alignment with target distributions. Evaluated on a 57-task\nsimulation benchmark, SDM Policy achieves a 6x inference speedup while having\nstate-of-the-art action quality, providing an efficient and reliable framework\nfor high-frequency robotic tasks.\n","authors":["Bofang Jia","Pengxiang Ding","Can Cui","Mingyang Sun","Pengfang Qian","Zhaoxin Fan","Donglin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09265v1.pdf","comment":"17 pages"},{"id":"http://arxiv.org/abs/2412.09261v1","updated":"2024-12-12T13:20:23Z","published":"2024-12-12T13:20:23Z","title":"Single-View Graph Contrastive Learning with Soft Neighborhood Awareness","summary":" Most graph contrastive learning (GCL) methods heavily rely on cross-view\ncontrast, thus facing several concomitant challenges, such as the complexity of\ndesigning effective augmentations, the potential for information loss between\nviews, and increased computational costs. To mitigate reliance on cross-view\ncontrasts, we propose \\ttt{SIGNA}, a novel single-view graph contrastive\nlearning framework. Regarding the inconsistency between structural connection\nand semantic similarity of neighborhoods, we resort to soft neighborhood\nawareness for GCL. Specifically, we leverage dropout to obtain\nstructurally-related yet randomly-noised embedding pairs for neighbors, which\nserve as potential positive samples. At each epoch, the role of partial\nneighbors is switched from positive to negative, leading to probabilistic\nneighborhood contrastive learning effect. Furthermore, we propose a normalized\nJensen-Shannon divergence estimator for a better effect of contrastive\nlearning. Surprisingly, experiments on diverse node-level tasks demonstrate\nthat our simple single-view GCL framework consistently outperforms existing\nmethods by margins of up to 21.74% (PPI). In particular, with soft neighborhood\nawareness, SIGNA can adopt MLPs instead of complicated GCNs as the encoder to\ngenerate representations in transductive learning tasks, thus speeding up its\ninference process by 109 times to 331 times. The source code is available at\nhttps://github.com/sunisfighting/SIGNA.\n","authors":["Qingqiang Sun","Chaoqi Chen","Ziyue Qiao","Xubin Zheng","Kai Wang"],"pdf_url":"https://arxiv.org/pdf/2412.09261v1.pdf","comment":"Accepted by AAAI2025; full version including appendix"},{"id":"http://arxiv.org/abs/2302.14112v2","updated":"2024-12-12T13:15:42Z","published":"2023-02-27T19:51:42Z","title":"Injectivity of ReLU networks: perspectives from statistical physics","summary":" When can the input of a ReLU neural network be inferred from its output? In\nother words, when is the network injective? We consider a single layer, $x\n\\mapsto \\mathrm{ReLU}(Wx)$, with a random Gaussian $m \\times n$ matrix $W$, in\na high-dimensional setting where $n, m \\to \\infty$. Recent work connects this\nproblem to spherical integral geometry giving rise to a conjectured sharp\ninjectivity threshold for $\\alpha = \\frac{m}{n}$ by studying the expected Euler\ncharacteristic of a certain random set. We adopt a different perspective and\nshow that injectivity is equivalent to a property of the ground state of the\nspherical perceptron, an important spin glass model in statistical physics. By\nleveraging the (non-rigorous) replica symmetry-breaking theory, we derive\nanalytical equations for the threshold whose solution is at odds with that from\nthe Euler characteristic. Furthermore, we use Gordon's min--max theorem to\nprove that a replica-symmetric upper bound refutes the Euler characteristic\nprediction. Along the way we aim to give a tutorial-style introduction to key\nideas from statistical physics in an effort to make the exposition accessible\nto a broad audience. Our analysis establishes a connection between spin glasses\nand integral geometry but leaves open the problem of explaining the\ndiscrepancies.\n","authors":["Antoine Maillard","Afonso S. Bandeira","David Belius","Ivan Dokmanić","Shuta Nakajima"],"pdf_url":"https://arxiv.org/pdf/2302.14112v2.pdf","comment":"62 pages ; Changes to match the published version (v2), in particular\n Appendix A.7 was added, and Appendix G was re-worked as an alternative proof\n of Theorem 1.8"},{"id":"http://arxiv.org/abs/2412.09254v1","updated":"2024-12-12T13:11:02Z","published":"2024-12-12T13:11:02Z","title":"When Can Memorization Improve Fairness?","summary":" We study to which extent additive fairness metrics (statistical parity, equal\nopportunity and equalized odds) can be influenced in a multi-class\nclassification problem by memorizing a subset of the population. We give\nexplicit expressions for the bias resulting from memorization in terms of the\nlabel and group membership distribution of the memorized dataset and the\nclassifier bias on the unmemorized dataset. We also characterize the memorized\ndatasets that eliminate the bias for all three metrics considered. Finally we\nprovide upper and lower bounds on the total probability mass in the memorized\ndataset that is necessary for the complete elimination of these biases.\n","authors":["Bob Pepin","Christian Igel","Raghavendra Selvan"],"pdf_url":"https://arxiv.org/pdf/2412.09254v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01102v2","updated":"2024-12-12T13:08:13Z","published":"2024-12-02T04:19:47Z","title":"Personalized Coupled Tensor Decomposition for Multimodal Data Fusion:\n Uniqueness and Algorithms","summary":" Coupled tensor decompositions (CTDs) perform data fusion by linking factors\nfrom different datasets. Although many CTDs have been already proposed, current\nworks do not address important challenges of data fusion, where: 1) the\ndatasets are often heterogeneous, constituting different \"views\" of a given\nphenomena (multimodality); and 2) each dataset can contain personalized or\ndataset-specific information, constituting distinct factors that are not\ncoupled with other datasets. In this work, we introduce a personalized CTD\nframework tackling these challenges. A flexible model is proposed where each\ndataset is represented as the sum of two components, one related to a common\ntensor through a multilinear measurement model, and another specific to each\ndataset. Both the common and distinct components are assumed to admit a\npolyadic decomposition. This generalizes several existing CTD models. We\nprovide conditions for specific and generic uniqueness of the decomposition\nthat are easy to interpret. These conditions employ uni-mode uniqueness of\ndifferent individual datasets and properties of the measurement model. Two\nalgorithms are proposed to compute the common and distinct components: a\nsemi-algebraic one and a coordinate-descent optimization method. Experimental\nresults illustrate the advantage of the proposed framework compared with the\nstate of the art approaches.\n","authors":["Ricardo Augusto Borsoi","Konstantin Usevich","David Brie","Tülay Adali"],"pdf_url":"https://arxiv.org/pdf/2412.01102v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.16970v4","updated":"2024-12-12T13:06:47Z","published":"2024-03-25T17:31:12Z","title":"A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual\n Attention Prediction Using Deep Learning","summary":" Purpose: As visual inspection is an inherent process during radiological\nscreening, the associated eye gaze data can provide valuable insights into\nrelevant clinical decisions. As deep learning has become the state-of-the-art\nfor computer-assisted diagnosis, integrating human behavior, such as eye gaze\ndata, into these systems is instrumental to help align machine predictions with\nclinical diagnostic criteria, thus enhancing the quality of automatic\nradiological diagnosis. Methods: We propose a novel deep learning framework for\njoint disease diagnosis and prediction of corresponding clinical visual\nattention maps for chest X-ray scans. Specifically, we introduce a new\ndual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a\nResidual and Squeeze-and-Excitation block-based encoder to extract diverse\nfeatures for visual attention map prediction, and a multi-scale feature-fusion\nclassifier to perform disease classification. To tackle the issue of\nasynchronous training schedules of individual tasks in multi-task learning, we\nproposed a multi-stage cooperative learning strategy, with contrastive learning\nfor feature encoder pretraining to boost performance. Results: Our proposed\nmethod is shown to significantly outperform existing techniques for chest X-ray\ndiagnosis (AUC=0.93) and the quality of visual attention map prediction\n(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed\nmulti-task multi-stage cooperative learning, our technique demonstrates the\nbenefit of integrating clinicians' eye gaze into clinical AI systems to boost\nperformance and potentially explainability.\n","authors":["Zirui Qiu","Hassan Rivaz","Yiming Xiao"],"pdf_url":"https://arxiv.org/pdf/2403.16970v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09250v1","updated":"2024-12-12T13:04:54Z","published":"2024-12-12T13:04:54Z","title":"GeLoRA: Geometric Adaptive Ranks For Efficient LoRA Fine-tuning","summary":" Fine-tuning large language models (LLMs) is computationally intensive because\nit requires updating all parameters. Low-Rank Adaptation (LoRA) improves\nefficiency by modifying only a subset of weights but introduces a trade-off\nbetween expressivity and computational cost: lower ranks reduce resources but\nlimit expressiveness, while higher ranks enhance expressivity at increased\ncost. Despite recent advances in adaptive LoRA techniques, existing methods\nfail to provide a theoretical basis for optimizing the trade-off between model\nperformance and efficiency. We propose Geometric Low-Rank Adaptation (GeLoRA),\na novel framework that computes the intrinsic dimensionality of hidden state\nrepresentations to adaptively select LoRA ranks. We demonstrate that the\nintrinsic dimension provides a lower bound for the optimal rank of LoRA\nmatrices, allowing for a principled selection that balances efficiency and\nexpressivity. GeLoRA dynamically adjusts the rank for each layer based on the\nintrinsic dimensionality of its input and output representations, recognizing\nthat not all model parameters equally impact fine-tuning. Empirical validation\non multiple tasks shows that GeLoRA consistently outperforms recent baselines\nwithin the same parameter budget.\n","authors":["Abdessalam Ed-dib","Zhanibek Datbayev","Amine Mohamed Aboussalah"],"pdf_url":"https://arxiv.org/pdf/2412.09250v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04076v2","updated":"2024-12-12T13:04:39Z","published":"2024-12-05T11:17:03Z","title":"Distance-Adaptive Quaternion Knowledge Graph Embedding with\n Bidirectional Rotation","summary":" Quaternion contains one real part and three imaginary parts, which provided a\nmore expressive hypercomplex space for learning knowledge graph. Existing\nquaternion embedding models measure the plausibility of a triplet either\nthrough semantic matching or geometric distance scoring functions. However, it\nappears that semantic matching diminishes the separability of entities, while\nthe distance scoring function weakens the semantics of entities. To address\nthis issue, we propose a novel quaternion knowledge graph embedding model. Our\nmodel combines semantic matching with entity's geometric distance to better\nmeasure the plausibility of triplets. Specifically, in the quaternion space, we\nperform a right rotation on head entity and a reverse rotation on tail entity\nto learn rich semantic features. Then, we utilize distance adaptive\ntranslations to learn geometric distance between entities. Furthermore, we\nprovide mathematical proofs to demonstrate our model can handle complex logical\nrelationships. Extensive experimental results and analyses show our model\nsignificantly outperforms previous models on well-known knowledge graph\ncompletion benchmark datasets. Our code is available at\nhttps://github.com/llqy123/DaBR.\n","authors":["Weihua Wang","Qiuyu Liang","Feilong Bao","Guanglai Gao"],"pdf_url":"https://arxiv.org/pdf/2412.04076v2.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2405.13082v3","updated":"2024-12-12T13:02:17Z","published":"2024-05-21T06:44:40Z","title":"A Survey of Artificial Intelligence in Gait-Based Neurodegenerative\n Disease Diagnosis","summary":" Recent years have witnessed an increasing global population affected by\nneurodegenerative diseases (NDs), which traditionally require extensive\nhealthcare resources and human effort for medical diagnosis and monitoring. As\na crucial disease-related motor symptom, human gait can be exploited to\ncharacterize different NDs. The current advances in artificial intelligence\n(AI) models enable automatic gait analysis for NDs identification and\nclassification, opening a new avenue to facilitate faster and more\ncost-effective diagnosis of NDs. In this paper, we provide a comprehensive\nsurvey on recent progress of machine learning and deep learning based AI\ntechniques applied to diagnosis of five typical NDs through gait. We provide an\noverview of the process of AI-assisted NDs diagnosis, and present a systematic\ntaxonomy of existing gait data and AI models. Meanwhile, a novel quality\nevaluation criterion is proposed to quantitatively assess the quality of\nexisting studies. Through an extensive review and analysis of 169 studies, we\npresent recent technical advancements, discuss existing challenges, potential\nsolutions, and future directions in this field. Finally, we envision the\nprospective utilization of 3D skeleton data for human gait representation and\nthe development of more efficient AI models for NDs diagnosis.\n","authors":["Haocong Rao","Minlin Zeng","Xuejiao Zhao","Chunyan Miao"],"pdf_url":"https://arxiv.org/pdf/2405.13082v3.pdf","comment":"Article: 57 pages, citing 290 papers. Appendix: 30 pages. A\n up-to-date resource (papers, data, etc.) of this survey (AI4NDD) is provided\n at https://github.com/minlinzeng/AI4NDD-Survey"},{"id":"http://arxiv.org/abs/2405.05097v5","updated":"2024-12-12T12:54:46Z","published":"2024-05-08T14:49:27Z","title":"Biology-inspired joint distribution neurons based on Hierarchical\n Correlation Reconstruction allowing for multidirectional neural networks","summary":" Biological neural networks seem qualitatively superior (e.g. in learning,\nflexibility, robustness) to current artificial like Multi-Layer Perceptron\n(MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them:\nbiological have fundamentally multidirectional signal propagation \\cite{axon},\nalso of probability distributions e.g. for uncertainty estimation, and are\nbelieved not being able to use standard backpropagation training\n\\cite{backprop}. There are proposed novel artificial neurons based on HCR\n(Hierarchical Correlation Reconstruction) allowing to remove the above low\nlevel differences: with neurons containing local joint distribution model (of\nits connections), representing joint density on normalized variables as just\nlinear combination of $(f_\\mathbf{j})$ orthonormal polynomials:\n$\\rho(\\mathbf{x})=\\sum_{\\mathbf{j}\\in B} a_\\mathbf{j} f_\\mathbf{j}(\\mathbf{x})$\nfor $\\mathbf{x} \\in [0,1]^d$ and $B\\subset \\mathbb{N}^d$ some chosen basis. By\nvarious index summations of such $(a_\\mathbf{j})_{\\mathbf{j}\\in B}$ tensor as\nneuron parameters, we get simple formulas for e.g. conditional expected values\nfor propagation in any direction, like $E[x|y,z]$, $E[y|x]$, which degenerate\nto KAN-like parametrization if restricting to pairwise dependencies. Such HCR\nnetwork can also propagate probability distributions (also joint) like\n$\\rho(y,z|x)$. It also allows for additional training approaches, like direct\n$(a_\\mathbf{j})$ estimation, through tensor decomposition, or more biologically\nplausible information bottleneck training: layers directly influencing only\nneighbors, optimizing content to maximize information about the next layer, and\nminimizing about the previous to remove noise, extract crucial information.\n","authors":["Jarek Duda"],"pdf_url":"https://arxiv.org/pdf/2405.05097v5.pdf","comment":"9 pages, 9 figures"},{"id":"http://arxiv.org/abs/2401.09274v5","updated":"2024-12-12T12:53:10Z","published":"2024-01-17T15:25:50Z","title":"Avoiding strict saddle points of nonconvex regularized problems","summary":" In this paper, we consider a class of non-convex and non-smooth sparse\noptimization problems, which encompass most existing nonconvex\nsparsity-inducing terms. We show the second-order optimality conditions only\ndepend on the nonzeros of the stationary points. We propose two damped\niterative reweighted algorithms including the iteratively reweighted $\\ell_1$\nalgorithm (DIRL$_1$) and the iteratively reweighted $\\ell_2$ (DIRL$_2$)\nalgorithm, to solve these problems. For DIRL$_1$, we show the reweighted\n$\\ell_1$ subproblem has support identification property so that DIRL$_1$\nlocally reverts to a gradient descent algorithm around a stationary point. For\nDIRL$_2$, we show the solution map of the reweighted $\\ell_2$ subproblem is\ndifferentiable and Lipschitz continuous everywhere. Therefore, the map of\nDIRL$_1$ and DIRL$_2$ and their inverse are Lipschitz continuous, and the\nstrict saddle points are their unstable fixed points. By applying the stable\nmanifold theorem, these algorithms are shown to converge only to local\nminimizers with randomly initialization when the strictly saddle point property\nis assumed.\n","authors":["Luwei Bai","Yaohua Hu","Hao Wang","Xiaoqi Yang"],"pdf_url":"https://arxiv.org/pdf/2401.09274v5.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2412.09232v1","updated":"2024-12-12T12:43:42Z","published":"2024-12-12T12:43:42Z","title":"Uplift modeling with continuous treatments: A predict-then-optimize\n approach","summary":" The goal of uplift modeling is to recommend actions that optimize specific\noutcomes by determining which entities should receive treatment. One common\napproach involves two steps: first, an inference step that estimates\nconditional average treatment effects (CATEs), and second, an optimization step\nthat ranks entities based on their CATE values and assigns treatment to the top\nk within a given budget. While uplift modeling typically focuses on binary\ntreatments, many real-world applications are characterized by continuous-valued\ntreatments, i.e., a treatment dose. This paper presents a predict-then-optimize\nframework to allow for continuous treatments in uplift modeling. First, in the\ninference step, conditional average dose responses (CADRs) are estimated from\ndata using causal machine learning techniques. Second, in the optimization\nstep, we frame the assignment task of continuous treatments as a\ndose-allocation problem and solve it using integer linear programming (ILP).\nThis approach allows decision-makers to efficiently and effectively allocate\ntreatment doses while balancing resource availability, with the possibility of\nadding extra constraints like fairness considerations or adapting the objective\nfunction to take into account instance-dependent costs and benefits to maximize\nutility. The experiments compare several CADR estimators and illustrate the\ntrade-offs between policy value and fairness, as well as the impact of an\nadapted objective function. This showcases the framework's advantages and\nflexibility across diverse applications in healthcare, lending, and human\nresource management. All code is available on github.com/SimonDeVos/UMCT.\n","authors":["Simon De Vos","Christopher Bockel-Rickermann","Stefan Lessmann","Wouter Verbeke"],"pdf_url":"https://arxiv.org/pdf/2412.09232v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2312.12456v2","updated":"2024-12-12T12:38:12Z","published":"2023-12-16T02:27:00Z","title":"PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU","summary":" This paper introduces PowerInfer, a high-speed Large Language Model (LLM)\ninference engine on a personal computer (PC) equipped with a single\nconsumer-grade GPU. The key principle underlying the design of PowerInfer is\nexploiting the high locality inherent in LLM inference, characterized by a\npower-law distribution in neuron activation. This distribution indicates that a\nsmall subset of neurons, termed hot neurons, are consistently activated across\ninputs, while the majority, cold neurons, vary based on specific inputs.\nPowerInfer exploits such an insight to design a GPU-CPU hybrid inference\nengine: hot-activated neurons are preloaded onto the GPU for fast access, while\ncold-activated neurons are computed on the CPU, thus significantly reducing GPU\nmemory demands and CPU-GPU data transfers. PowerInfer further integrates\nadaptive predictors and neuron-aware sparse operators, optimizing the\nefficiency of neuron activation and computational sparsity. The evaluation\nshows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while\nretaining model accuracy across various LLMs (including OPT-175B) on a single\nNVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance\ncomparable to that of a high-end server-grade A100 GPU, reaching 82% of its\ntoken generation rate on a single consumer-grade RTX 4090 GPU.\n","authors":["Yixin Song","Zeyu Mi","Haotong Xie","Haibo Chen"],"pdf_url":"https://arxiv.org/pdf/2312.12456v2.pdf","comment":"SOSP 2024"},{"id":"http://arxiv.org/abs/2406.03231v3","updated":"2024-12-12T12:37:14Z","published":"2024-06-05T13:06:52Z","title":"CommonPower: A Framework for Safe Data-Driven Smart Grid Control","summary":" The growing complexity of power system management has led to an increased\ninterest in reinforcement learning (RL). However, vanilla RL controllers cannot\nthemselves ensure satisfaction of system constraints. Therefore, combining them\nwith formally correct safeguarding mechanisms is an important aspect when\nstudying RL for power system management. Integrating safeguarding into complex\nuse cases requires tool support. To address this need, we introduce the Python\ntool CommonPower. CommonPower's unique contribution lies in its symbolic\nmodeling approach, which enables flexible, model-based safeguarding of RL\ncontrollers. Moreover, CommonPower offers a unified interface for single-agent\nRL, multi-agent RL, and optimal control, with seamless integration of different\nforecasting methods. This allows users to validate the effectiveness of safe RL\ncontrollers across a large variety of case studies and investigate the\ninfluence of specific aspects on overall performance. We demonstrate\nCommonPower's versatility through a numerical case study that compares RL\nagents featuring different safeguards with a model predictive controller in the\ncontext of building energy management.\n","authors":["Michael Eichelbeck","Hannah Markgraf","Matthias Althoff"],"pdf_url":"https://arxiv.org/pdf/2406.03231v3.pdf","comment":"For the corresponding code repository, see\n https://github.com/TUMcps/commonpower"},{"id":"http://arxiv.org/abs/2406.06282v3","updated":"2024-12-12T12:24:18Z","published":"2024-06-10T14:01:21Z","title":"PowerInfer-2: Fast Large Language Model Inference on a Smartphone","summary":" Large language models (LLMs) on smartphones enable real-time AI assistance\nand privacy-preserving, offline operation. However, resource constraints of\nsmartphones limit current deployments to small language models (SLMs),\nsignificantly compromising their capabilities. This paper introduces\nPowerInfer-2, a smartphone-based framework that enables fast inference for LLMs\nexceeding the memory capacity. The key insight is decomposing matrix operations\ninto neuron clusters as the basic processing unit, which enables flexible\nscheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages\nthis neuron-cluster-based design in both computation and storage. For\ncomputation, neuron clusters with dense activations are processed on NPU, while\nsparse clusters use CPU. The storage engine provides a fine-grained pipeline\nmechanism that coordinates cluster-level computation and I/O operations,\nenhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2\nachieves up to a 27.8x speed increase compared to state-of-the-art frameworks.\nPowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving\n11.68 tokens/s. Notably, these performance improvements preserve model quality\nwith negligible accuracy degradation.\n","authors":["Zhenliang Xue","Yixin Song","Zeyu Mi","Xinrui Zheng","Yubin Xia","Haibo Chen"],"pdf_url":"https://arxiv.org/pdf/2406.06282v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05871v2","updated":"2024-12-12T12:07:07Z","published":"2024-10-08T09:58:38Z","title":"A second-order-like optimizer with adaptive gradient scaling for deep\n learning","summary":" In this empirical article, we introduce INNAprop, an optimization algorithm\nthat combines the INNA method with the RMSprop adaptive gradient scaling. It\nleverages second-order information and rescaling while keeping the memory\nrequirements of standard DL methods as AdamW or SGD with momentum. After giving\ngeometrical insights, we evaluate INNAprop on CIFAR-10, Food101, and ImageNet\nwith ResNets, VGG, DenseNet, and ViT, and on GPT-2 (OpenWebText) train from\nscratch and with LoRA fine-tuning (E2E). INNAprop consistently matches or\noutperforms AdamW both in training speed and accuracy, with minimal\nhyperparameter tuning in large-scale settings. Our code is publicly available\nat \\url{https://github.com/innaprop/innaprop}.\n","authors":["Jérôme Bolte","Ryan Boustany","Edouard Pauwels","Andrei Purica"],"pdf_url":"https://arxiv.org/pdf/2410.05871v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2307.03270v2","updated":"2024-12-12T12:05:25Z","published":"2023-07-04T08:29:59Z","title":"A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony\n in Talking Head Generation","summary":" Animating still face images with deep generative models using a speech input\nsignal is an active research topic and has seen important recent\nprogress.However, much of the effort has been put into lip syncing and\nrendering quality while the generation of natural head motion, let alone the\naudio-visual correlation between head motion and speech, has often been\nneglected.In this work, we propose a multi-scale audio-visual synchrony loss\nand a multi-scale autoregressive GAN to better handle short and long-term\ncorrelation between speech and the dynamics of the head and lips.In particular,\nwe train a stack of syncer models on multimodal input pyramids and use these\nmodels as guidance in a multi-scale generator network to produce audio-aligned\nmotion unfolding over diverse time scales.Both the pyramid of audio-visual\nsyncers and the generative models are trained in a low-dimensional space that\nfully preserves dynamics cues.The experiments show significant improvements\nover the state-of-the-art in head motion dynamics quality and especially in\nmulti-scale audio-visual synchrony on a collection of benchmark datasets.\n","authors":["Louis Airale","Dominique Vaufreydaz","Xavier Alameda-Pineda"],"pdf_url":"https://arxiv.org/pdf/2307.03270v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.16048v3","updated":"2024-12-12T12:01:43Z","published":"2024-02-25T10:13:04Z","title":"How Likely Do LLMs with CoT Mimic Human Reasoning?","summary":" Chain-of-thought emerges as a promising technique for eliciting reasoning\ncapabilities from Large Language Models (LLMs). However, it does not always\nimprove task performance or accurately represent reasoning processes, leaving\nunresolved questions about its usage. In this paper, we diagnose the underlying\nmechanism by comparing the reasoning process of LLMs with humans, using causal\nanalysis to understand the relationships between the problem instruction,\nreasoning, and the answer in LLMs. Our empirical study reveals that LLMs often\ndeviate from the ideal causal chain, resulting in spurious correlations and\npotential consistency errors (inconsistent reasoning and answers). We also\nexamine various factors influencing the causal structure, finding that\nin-context learning with examples strengthens it, while post-training\ntechniques like supervised fine-tuning and reinforcement learning on human\nfeedback weaken it. To our surprise, the causal structure cannot be\nstrengthened by enlarging the model size only, urging research on new\ntechniques. We hope that this preliminary study will shed light on\nunderstanding and improving the reasoning process in LLM.\n","authors":["Guangsheng Bao","Hongbo Zhang","Cunxiang Wang","Linyi Yang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2402.16048v3.pdf","comment":"COLING 2025 Camera Version (8 pages, 3 figures, 18 tables)"},{"id":"http://arxiv.org/abs/2412.09195v1","updated":"2024-12-12T11:46:07Z","published":"2024-12-12T11:46:07Z","title":"On the Generation and Removal of Speaker Adversarial Perturbation for\n Voice-Privacy Protection","summary":" Neural networks are commonly known to be vulnerable to adversarial attacks\nmounted through subtle perturbation on the input data. Recent development in\nvoice-privacy protection has shown the positive use cases of the same technique\nto conceal speaker's voice attribute with additive perturbation signal\ngenerated by an adversarial network. This paper examines the reversibility\nproperty where an entity generating the adversarial perturbations is authorized\nto remove them and restore original speech (e.g., the speaker him/herself). A\nsimilar technique could also be used by an investigator to deanonymize a\nvoice-protected speech to restore criminals' identities in security and\nforensic analysis. In this setting, the perturbation generative module is\nassumed to be known in the removal process. To this end, a joint training of\nperturbation generation and removal modules is proposed. Experimental results\non the LibriSpeech dataset demonstrated that the subtle perturbations added to\nthe original speech can be predicted from the anonymized speech while achieving\nthe goal of privacy protection. By removing these perturbations from the\nanonymized sample, the original speech can be restored. Audio samples can be\nfound in \\url{https://voiceprivacy.github.io/Perturbation-Generation-Removal/}.\n","authors":["Chenyang Guo","Liping Chen","Zhuhai Li","Kong Aik Lee","Zhen-Hua Ling","Wu Guo"],"pdf_url":"https://arxiv.org/pdf/2412.09195v1.pdf","comment":"6 pages, 3 figures, published to IEEE SLT Workshop 2024"},{"id":"http://arxiv.org/abs/2402.13516v5","updated":"2024-12-12T11:29:32Z","published":"2024-02-21T03:58:49Z","title":"ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity\n within Large Language Models","summary":" Activation sparsity refers to the existence of considerable\nweakly-contributed elements among activation outputs. As a prevalent property\nof the models using the ReLU activation function, activation sparsity has been\nproven a promising paradigm to boost model inference efficiency. Nevertheless,\nmost large language models (LLMs) adopt activation functions without intrinsic\nactivation sparsity (e.g., GELU and Swish). Some recent efforts have explored\nintroducing ReLU or its variants as the substitutive activation function to\nhelp LLMs achieve activation sparsity and inference acceleration, but few can\nsimultaneously obtain high sparsity and comparable model performance. This\npaper introduces a simple and effective sparsification method named \"ProSparse\"\nto push LLMs for higher activation sparsity while maintaining comparable\nperformance. Specifically, after substituting the activation function of LLMs\nwith ReLU, ProSparse adopts progressive sparsity regularization with a factor\nsmoothly increasing along the multi-stage sine curves. This can enhance\nactivation sparsity and mitigate performance degradation by avoiding radical\nshifts in activation distributions. With ProSparse, we obtain high sparsity of\n89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size\nMiniCPM-1B, respectively, achieving comparable performance to their original\nSwish-activated versions. These present the most sparsely activated models\namong open-source LLaMA versions and competitive end-size models, considerably\nsurpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference\nacceleration experiments further demonstrate the significant practical\nacceleration potential of LLMs with higher activation sparsity, obtaining up to\n4.52$\\times$ inference speedup.\n","authors":["Chenyang Song","Xu Han","Zhengyan Zhang","Shengding Hu","Xiyu Shi","Kuai Li","Chen Chen","Zhiyuan Liu","Guangli Li","Tao Yang","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2402.13516v5.pdf","comment":"19 pages, 4 figures, 9 tables"},{"id":"http://arxiv.org/abs/2412.09183v1","updated":"2024-12-12T11:27:27Z","published":"2024-12-12T11:27:27Z","title":"Dimensionality Reduction Techniques for Global Bayesian Optimisation","summary":" Bayesian Optimisation (BO) is a state-of-the-art global optimisation\ntechnique for black-box problems where derivative information is unavailable,\nand sample efficiency is crucial. However, improving the general scalability of\nBO has proved challenging. Here, we explore Latent Space Bayesian Optimisation\n(LSBO), that applies dimensionality reduction to perform BO in a\nreduced-dimensional subspace. While early LSBO methods used (linear) random\nprojections (Wang et al., 2013), we employ Variational Autoencoders (VAEs) to\nmanage more complex data structures and general DR tasks. Building on Grosnit\net. al. (2021), we analyse the VAE-based LSBO framework, focusing on VAE\nretraining and deep metric loss. We suggest a few key corrections in their\nimplementation, originally designed for tasks such as molecule generation, and\nreformulate the algorithm for broader optimisation purposes. Our numerical\nresults show that structured latent manifolds improve BO performance.\nAdditionally, we examine the use of the Mat\\'{e}rn-$\\frac{5}{2}$ kernel for\nGaussian Processes in this LSBO context. We also integrate Sequential Domain\nReduction (SDR), a standard global optimization efficiency strategy, into BO.\nSDR is included in a GPU-based environment using \\textit{BoTorch}, both in the\noriginal and VAE-generated latent spaces, marking the first application of SDR\nwithin LSBO.\n","authors":["Luo Long","Coralia Cartis","Paz Fink Shustin"],"pdf_url":"https://arxiv.org/pdf/2412.09183v1.pdf","comment":"Accepted at NeurIPS 2024 Workshop OPT for ML: Optimization for\n Machine Learning (Submission Number:67)"},{"id":"http://arxiv.org/abs/2412.04100v2","updated":"2024-12-12T11:12:03Z","published":"2024-12-05T12:10:42Z","title":"Missing Melodies: AI Music Generation and its \"Nearly\" Complete Omission\n of the Global South","summary":" Recent advances in generative AI have sparked renewed interest and expanded\npossibilities for music generation. However, the performance and versatility of\nthese systems across musical genres are heavily influenced by the availability\nof training data. We conducted an extensive analysis of over one million hours\nof audio datasets used in AI music generation research and manually reviewed\nmore than 200 papers from eleven prominent AI and music conferences and\norganizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,\nNeurIPS, NIME, SMC) to identify a critical gap in the fair representation and\ninclusion of the musical genres of the Global South in AI research. Our\nfindings reveal a stark imbalance: approximately 86% of the total dataset hours\nand over 93% of researchers focus primarily on music from the Global North.\nHowever, around 40% of these datasets include some form of non-Western music,\ngenres from the Global South account for only 14.6% of the data. Furthermore,\napproximately 51% of the papers surveyed concentrate on symbolic music\ngeneration, a method that often fails to capture the cultural nuances inherent\nin music from regions such as South Asia, the Middle East, and Africa. As AI\nincreasingly shapes the creation and dissemination of music, the significant\nunderrepresentation of music genres in datasets and research presents a serious\nthreat to global musical diversity. We also propose some important steps to\nmitigate these risks and foster a more inclusive future for AI-driven music\ngeneration.\n","authors":["Atharva Mehta","Shivam Chauhan","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2412.04100v2.pdf","comment":"Submitted to CACM, 12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2404.04108v2","updated":"2024-12-12T10:56:35Z","published":"2024-04-05T14:04:07Z","title":"Large language models as oracles for instantiating ontologies with\n domain-specific knowledge","summary":" Background. Endowing intelligent systems with semantic data commonly requires\ndesigning and instantiating ontologies with domain-specific knowledge.\nEspecially in the early phases, those activities are typically performed\nmanually by human experts possibly leveraging on their own experience. The\nresulting process is therefore time-consuming, error-prone, and often biased by\nthe personal background of the ontology designer. Objective. To mitigate that\nissue, we propose a novel domain-independent approach to automatically\ninstantiate ontologies with domain-specific knowledge, by leveraging on large\nlanguage models (LLMs) as oracles. Method. Starting from (i) an initial schema\ncomposed by inter-related classes and properties and (ii) a set of query\ntemplates, our method queries the LLM multiple times, and generates instances\nfor both classes and properties from its replies. Thus, the ontology is\nautomatically filled with domain-specific knowledge, compliant to the initial\nschema. As a result, the ontology is quickly and automatically enriched with\nmanifold instances, which experts may consider to keep, adjust, discard, or\ncomplement according to their own needs and expertise. Contribution. We\nformalise our method in general way and instantiate it over various LLMs, as\nwell as on a concrete case study. We report experiments rooted in the\nnutritional domain where an ontology of food meals and their ingredients is\nautomatically instantiated from scratch, starting from a categorisation of\nmeals and their relationships. There, we analyse the quality of the generated\nontologies and compare ontologies attained by exploiting different LLMs.\nExperimentally, our approach achieves a quality metric that is up to five times\nhigher than the state-of-the-art, while reducing erroneous entities and\nrelations by up to ten times. Finally, we provide a SWOT analysis of the\nproposed method.\n","authors":["Giovanni Ciatto","Andrea Agiollo","Matteo Magnini","Andrea Omicini"],"pdf_url":"https://arxiv.org/pdf/2404.04108v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09164v1","updated":"2024-12-12T10:49:55Z","published":"2024-12-12T10:49:55Z","title":"$(ε, δ)$-Differentially Private Partial Least Squares\n Regression","summary":" As data-privacy requirements are becoming increasingly stringent and\nstatistical models based on sensitive data are being deployed and used more\nroutinely, protecting data-privacy becomes pivotal. Partial Least Squares (PLS)\nregression is the premier tool for building such models in analytical\nchemistry, yet it does not inherently provide privacy guarantees, leaving\nsensitive (training) data vulnerable to privacy attacks. To address this gap,\nwe propose an $(\\epsilon, \\delta)$-differentially private PLS (edPLS)\nalgorithm, which integrates well-studied and theoretically motivated Gaussian\nnoise-adding mechanisms into the PLS algorithm to ensure the privacy of the\ndata underlying the model. Our approach involves adding carefully calibrated\nGaussian noise to the outputs of four key functions in the PLS algorithm: the\nweights, scores, $X$-loadings, and $Y$-loadings. The noise variance is\ndetermined based on the global sensitivity of each function, ensuring that the\nprivacy loss is controlled according to the $(\\epsilon, \\delta)$-differential\nprivacy framework. Specifically, we derive the sensitivity bounds for each\nfunction and use these bounds to calibrate the noise added to the model\ncomponents. Experimental results demonstrate that edPLS effectively renders\nprivacy attacks, aimed at recovering unique sources of variability in the\ntraining data, ineffective. Application of edPLS to the NIR corn benchmark\ndataset shows that the root mean squared error of prediction (RMSEP) remains\ncompetitive even at strong privacy levels (i.e., $\\epsilon=1$), given proper\npre-processing of the corresponding spectra. These findings highlight the\npractical utility of edPLS in creating privacy-preserving multivariate\ncalibrations and for the analysis of their privacy-utility trade-offs.\n","authors":["Ramin Nikzad-Langerodi","Mohit Kumar","Du Nguyen Duy","Mahtab Alghasi"],"pdf_url":"https://arxiv.org/pdf/2412.09164v1.pdf","comment":"14 pages, 5 figure"},{"id":"http://arxiv.org/abs/2412.08549v2","updated":"2024-12-12T10:49:10Z","published":"2024-12-11T17:10:44Z","title":"Watermarking Training Data of Music Generation Models","summary":" Generative Artificial Intelligence (Gen-AI) models are increasingly used to\nproduce content across domains, including text, images, and audio. While these\nmodels represent a major technical breakthrough, they gain their generative\ncapabilities from being trained on enormous amounts of human-generated content,\nwhich often includes copyrighted material. In this work, we investigate whether\naudio watermarking techniques can be used to detect an unauthorized usage of\ncontent to train a music generation model. We compare outputs generated by a\nmodel trained on watermarked data to a model trained on non-watermarked data.\nWe study factors that impact the model's generation behaviour: the watermarking\ntechnique, the proportion of watermarked samples in the training set, and the\nrobustness of the watermarking technique against the model's tokenizer. Our\nresults show that audio watermarking techniques, including some that are\nimperceptible to humans, can lead to noticeable shifts in the model's outputs.\nWe also study the robustness of a state-of-the-art watermarking technique to\nremoval techniques.\n","authors":["Pascal Epple","Igor Shilov","Bozhidar Stevanoski","Yves-Alexandre de Montjoye"],"pdf_url":"https://arxiv.org/pdf/2412.08549v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.09502v2","updated":"2024-12-12T10:39:16Z","published":"2024-11-14T15:13:13Z","title":"Golden Noise for Diffusion Models: A Learning Framework","summary":" Text-to-image diffusion model is a popular paradigm that synthesizes\npersonalized images by providing a text prompt and a random Gaussian noise.\nWhile people observe that some noises are ``golden noises'' that can achieve\nbetter text-image alignment and higher human preference than others, we still\nlack a machine learning framework to obtain those golden noises. To learn\ngolden noises for diffusion sampling, we mainly make three contributions in\nthis paper. First, we identify a new concept termed the \\textit{noise prompt},\nwhich aims at turning a random Gaussian noise into a golden noise by adding a\nsmall desirable perturbation derived from the text prompt. Following the\nconcept, we first formulate the \\textit{noise prompt learning} framework that\nsystematically learns ``prompted'' golden noise associated with a text prompt\nfor diffusion models. Second, we design a noise prompt data collection pipeline\nand collect a large-scale \\textit{noise prompt dataset}~(NPD) that contains\n100k pairs of random noises and golden noises with the associated text prompts.\nWith the prepared NPD as the training dataset, we trained a small \\textit{noise\nprompt network}~(NPNet) that can directly learn to transform a random noise\ninto a golden noise. The learned golden noise perturbation can be considered as\na kind of prompt for noise, as it is rich in semantic information and tailored\nto the given text prompt. Third, our extensive experiments demonstrate the\nimpressive effectiveness and generalization of NPNet on improving the quality\nof synthesized images across various diffusion models, including SDXL,\nDreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and\nefficient controller that acts as a plug-and-play module with very limited\nadditional inference and computational costs, as it just provides a golden\nnoise instead of a random noise without accessing the original pipeline.\n","authors":["Zikai Zhou","Shitong Shao","Lichen Bai","Zhiqiang Xu","Bo Han","Zeke Xie"],"pdf_url":"https://arxiv.org/pdf/2411.09502v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18917v2","updated":"2024-12-12T10:39:06Z","published":"2024-05-29T09:19:50Z","title":"Causal Action Influence Aware Counterfactual Data Augmentation","summary":" Offline data are both valuable and practical resources for teaching robots\ncomplex behaviors. Ideally, learning agents should not be constrained by the\nscarcity of available demonstrations, but rather generalize beyond the training\ndistribution. However, the complexity of real-world scenarios typically\nrequires huge amounts of data to prevent neural network policies from picking\nup on spurious correlations and learning non-causal relationships. We propose\nCAIAC, a data augmentation method that can create feasible synthetic\ntransitions from a fixed dataset without having access to online environment\ninteractions. By utilizing principled methods for quantifying causal influence,\nwe are able to perform counterfactual reasoning by swapping\n$\\it{action}$-unaffected parts of the state-space between independent\ntrajectories in the dataset. We empirically show that this leads to a\nsubstantial increase in robustness of offline learning algorithms against\ndistributional shift.\n","authors":["Núria Armengol Urpí","Marco Bagatella","Marin Vlastelica","Georg Martius"],"pdf_url":"https://arxiv.org/pdf/2405.18917v2.pdf","comment":"Accepted in 41st International Conference on Machine Learning (ICML\n 2024)"},{"id":"http://arxiv.org/abs/2412.09150v1","updated":"2024-12-12T10:36:26Z","published":"2024-12-12T10:36:26Z","title":"Evaluating Adversarial Attacks on Traffic Sign Classifiers beyond\n Standard Baselines","summary":" Adversarial attacks on traffic sign classification models were among the\nfirst successfully tried in the real world. Since then, the research in this\narea has been mainly restricted to repeating baseline models, such as LISA-CNN\nor GTSRB-CNN, and similar experiment settings, including white and black\npatches on traffic signs. In this work, we decouple model architectures from\nthe datasets and evaluate on further generic models to make a fair comparison.\nFurthermore, we compare two attack settings, inconspicuous and visible, which\nare usually regarded without direct comparison. Our results show that standard\nbaselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than\nthe generic ones. We, therefore, suggest evaluating new attacks on a broader\nspectrum of baselines in the future. Our code is available at\n\\url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.\n","authors":["Svetlana Pavlitska","Leopold Müller","J. Marius Zöllner"],"pdf_url":"https://arxiv.org/pdf/2412.09150v1.pdf","comment":"Accepted for publication at ICMLA 2024"},{"id":"http://arxiv.org/abs/2412.09149v1","updated":"2024-12-12T10:34:26Z","published":"2024-12-12T10:34:26Z","title":"Student-Informed Teacher Training","summary":" Imitation learning with a privileged teacher has proven effective for\nlearning complex control behaviors from high-dimensional inputs, such as\nimages. In this framework, a teacher is trained with privileged task\ninformation, while a student tries to predict the actions of the teacher with\nmore limited observations, e.g., in a robot navigation task, the teacher might\nhave access to distances to nearby obstacles, while the student only receives\nvisual observations of the scene. However, privileged imitation learning faces\na key challenge: the student might be unable to imitate the teacher's behavior\ndue to partial observability. This problem arises because the teacher is\ntrained without considering if the student is capable of imitating the learned\nbehavior. To address this teacher-student asymmetry, we propose a framework for\njoint training of the teacher and student policies, encouraging the teacher to\nlearn behaviors that can be imitated by the student despite the latters'\nlimited access to information and its partial observability. Based on the\nperformance bound in imitation learning, we add (i) the approximated action\ndifference between teacher and student as a penalty term to the reward function\nof the teacher, and (ii) a supervised teacher-student alignment step. We\nmotivate our method with a maze navigation task and demonstrate its\neffectiveness on complex vision-based quadrotor flight and manipulation tasks.\n","authors":["Nico Messikommer","Jiaxu Xing","Elie Aljalbout","Davide Scaramuzza"],"pdf_url":"https://arxiv.org/pdf/2412.09149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09142v1","updated":"2024-12-12T10:27:55Z","published":"2024-12-12T10:27:55Z","title":"A Brief Discussion on KPI Development in Public Administration","summary":" Efficient and effective service delivery in Public Administration (PA) relies\non the development and utilization of key performance indicators (KPIs) for\nevaluating and measuring performance. This paper presents an innovative\nframework for KPI construction within performance evaluation systems,\nleveraging Random Forest algorithms and variable importance analysis. The\nproposed approach identifies key variables that significantly influence PA\nperformance, offering valuable insights into the critical factors driving\norganizational success. By integrating variable importance analysis with expert\nconsultation, relevant KPIs can be systematically developed, ensuring that\nimprovement strategies address performance-critical areas. The framework\nincorporates continuous monitoring mechanisms and adaptive phases to refine\nKPIs in response to evolving administrative needs. This study aims to enhance\nPA performance through the application of machine learning techniques,\nfostering a more agile and results-driven approach to public administration.\n","authors":["Simona Fioretto","Elio Masciari","Enea Vincenzo Napolitano"],"pdf_url":"https://arxiv.org/pdf/2412.09142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.03551v2","updated":"2024-12-12T10:15:41Z","published":"2024-03-06T08:51:09Z","title":"Enhanced Low-Dose CT Image Reconstruction by Domain and Task Shifting\n Gaussian Denoisers","summary":" Computed tomography from a low radiation dose (LDCT) is challenging due to\nhigh noise in the projection data. Popular approaches for LDCT image\nreconstruction are two-stage methods, typically consisting of the filtered\nbackprojection (FBP) algorithm followed by a neural network for LDCT image\nenhancement. Two-stage methods are attractive for their simplicity and\npotential for computational efficiency, typically requiring only a single FBP\nand a neural network forward pass for inference. However, the best\nreconstruction quality is currently achieved by unrolled iterative methods\n(Learned Primal-Dual and ItNet), which are more complex and thus have a higher\ncomputational cost for training and inference. We propose a method combining\nthe simplicity and efficiency of two-stage methods with state-of-the-art\nreconstruction quality. Our strategy utilizes a neural network pretrained for\nGaussian noise removal from natural grayscale images, fine-tuned for LDCT image\nenhancement. We call this method FBP-DTSGD (Domain and Task Shifted Gaussian\nDenoisers) as the fine-tuning is a task shift from Gaussian denoising to\nenhancing LDCT images and a domain shift from natural grayscale to LDCT images.\nAn ablation study with three different pretrained Gaussian denoisers indicates\nthat the performance of FBP-DTSGD does not depend on a specific denoising\narchitecture, suggesting future advancements in Gaussian denoising could\nbenefit the method. The study also shows that pretraining on natural images\nenhances LDCT reconstruction quality, especially with limited training data.\nNotably, pretraining involves no additional cost, as existing pretrained models\nare used. The proposed method currently holds the top mean position in the\nLoDoPaB-CT challenge.\n","authors":["Tim Selig","Thomas März","Martin Storath","Andreas Weinmann"],"pdf_url":"https://arxiv.org/pdf/2403.03551v2.pdf","comment":"13 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.09126v1","updated":"2024-12-12T10:03:46Z","published":"2024-12-12T10:03:46Z","title":"Enhancing Modality Representation and Alignment for Multimodal\n Cold-start Active Learning","summary":" Training multimodal models requires a large amount of labeled data. Active\nlearning (AL) aim to reduce labeling costs. Most AL methods employ warm-start\napproaches, which rely on sufficient labeled data to train a well-calibrated\nmodel that can assess the uncertainty and diversity of unlabeled data. However,\nwhen assembling a dataset, labeled data are often scarce initially, leading to\na cold-start problem. Additionally, most AL methods seldom address multimodal\ndata, highlighting a research gap in this field. Our research addresses these\nissues by developing a two-stage method for Multi-Modal Cold-Start Active\nLearning (MMCSAL).\n Firstly, we observe the modality gap, a significant distance between the\ncentroids of representations from different modalities, when only using\ncross-modal pairing information as self-supervision signals. This modality gap\naffects data selection process, as we calculate both uni-modal and cross-modal\ndistances. To address this, we introduce uni-modal prototypes to bridge the\nmodality gap. Secondly, conventional AL methods often falter in multimodal\nscenarios where alignment between modalities is overlooked. Therefore, we\npropose enhancing cross-modal alignment through regularization, thereby\nimproving the quality of selected multimodal data pairs in AL. Finally, our\nexperiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs\nacross three multimodal datasets.\n","authors":["Meng Shen","Yake Wei","Jianxiong Yin","Deepu Rajan","Di Hu","Simon See"],"pdf_url":"https://arxiv.org/pdf/2412.09126v1.pdf","comment":"11 pages, ACMMM Asia 2024, Oral Presentation"},{"id":"http://arxiv.org/abs/2412.09121v1","updated":"2024-12-12T09:57:10Z","published":"2024-12-12T09:57:10Z","title":"MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk\n Minimization for Autonomous Driving","summary":" We propose MMD-OPT: a sample-efficient approach for minimizing the risk of\ncollision under arbitrary prediction distribution of the dynamic obstacles.\nMMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space\n(RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two\nconcepts can be used to define a sample efficient surrogate for collision risk\nestimate. We perform extensive simulations to validate the effectiveness of\nMMD-OPT on both synthetic and real-world datasets. Importantly, we show that\ntrajectory optimization with our MMD-based collision risk surrogate leads to\nsafer trajectories at low sample regimes than popular alternatives based on\nConditional Value at Risk (CVaR).\n","authors":["Basant Sharma","Arun Kumar Singh"],"pdf_url":"https://arxiv.org/pdf/2412.09121v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09119v1","updated":"2024-12-12T09:54:38Z","published":"2024-12-12T09:54:38Z","title":"The Utility and Complexity of In- and Out-of-Distribution Machine\n Unlearning","summary":" Machine unlearning, the process of selectively removing data from trained\nmodels, is increasingly crucial for addressing privacy concerns and knowledge\ngaps post-deployment. Despite this importance, existing approaches are often\nheuristic and lack formal guarantees. In this paper, we analyze the fundamental\nutility, time, and space complexity trade-offs of approximate unlearning,\nproviding rigorous certification analogous to differential privacy. For\nin-distribution forget data -- data similar to the retain set -- we show that a\nsurprisingly simple and general procedure, empirical risk minimization with\noutput perturbation, achieves tight unlearning-utility-complexity trade-offs,\naddressing a previous theoretical gap on the separation from unlearning \"for\nfree\" via differential privacy, which inherently facilitates the removal of\nsuch data. However, such techniques fail with out-of-distribution forget data\n-- data significantly different from the retain set -- where unlearning time\ncomplexity can exceed that of retraining, even for a single sample. To address\nthis, we propose a new robust and noisy gradient descent variant that provably\namortizes unlearning time complexity without compromising utility.\n","authors":["Youssef Allouah","Joshua Kazdan","Rachid Guerraoui","Sanmi Koyejo"],"pdf_url":"https://arxiv.org/pdf/2412.09119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.02229v5","updated":"2024-12-12T09:53:38Z","published":"2024-02-03T18:19:46Z","title":"Vanilla Bayesian Optimization Performs Great in High Dimensions","summary":" High-dimensional problems have long been considered the Achilles' heel of\nBayesian optimization algorithms. Spurred by the curse of dimensionality, a\nlarge collection of algorithms aim to make it more performant in this setting,\ncommonly by imposing various simplifying assumptions on the objective. In this\npaper, we identify the degeneracies that make vanilla Bayesian optimization\npoorly suited to high-dimensional tasks, and further show how existing\nalgorithms address these degeneracies through the lens of lowering the model\ncomplexity. Moreover, we propose an enhancement to the prior assumptions that\nare typical to vanilla Bayesian optimization algorithms, which reduces the\ncomplexity to manageable levels without imposing structural restrictions on the\nobjective. Our modification - a simple scaling of the Gaussian process\nlengthscale prior with the dimensionality - reveals that standard Bayesian\noptimization works drastically better than previously thought in high\ndimensions, clearly outperforming existing state-of-the-art algorithms on\nmultiple commonly considered real-world high-dimensional tasks.\n","authors":["Carl Hvarfner","Erik Orm Hellsten","Luigi Nardi"],"pdf_url":"https://arxiv.org/pdf/2402.02229v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09118v1","updated":"2024-12-12T09:52:37Z","published":"2024-12-12T09:52:37Z","title":"An Algorithm-Centered Approach To Model Streaming Data","summary":" Besides the classical offline setup of machine learning, stream learning\nconstitutes a well-established setup where data arrives over time in\npotentially non-stationary environments. Concept drift, the phenomenon that the\nunderlying distribution changes over time poses a significant challenge. Yet,\ndespite high practical relevance, there is little to no foundational theory for\nlearning in the drifting setup comparable to classical statistical learning\ntheory in the offline setting. This can be attributed to the lack of an\nunderlying object comparable to a probability distribution as in the classical\nsetup. While there exist approaches to transfer ideas to the streaming setup,\nthese start from a data perspective rather than an algorithmic one. In this\nwork, we suggest a new model of data over time that is aimed at the algorithm's\nperspective. Instead of defining the setup using time points, we utilize a\nwindow-based approach that resembles the inner workings of most stream learning\nalgorithms. We compare our framework to others from the literature on a\ntheoretical basis, showing that in many cases both model the same situation.\nFurthermore, we perform a numerical evaluation and showcase an application in\nthe domain of critical infrastructure.\n","authors":["Fabian Hinder","Valerie Vaquet","David Komnick","Barbara Hammer"],"pdf_url":"https://arxiv.org/pdf/2412.09118v1.pdf","comment":"This manuscript is currently under review at the Symposium on\n Intelligent Data Analysis (IDA 2025)"},{"id":"http://arxiv.org/abs/2412.09116v1","updated":"2024-12-12T09:51:18Z","published":"2024-12-12T09:51:18Z","title":"How to Re-enable PDE Loss for Physical Systems Modeling Under Partial\n Observation","summary":" In science and engineering, machine learning techniques are increasingly\nsuccessful in physical systems modeling (predicting future states of physical\nsystems). Effectively integrating PDE loss as a constraint of system transition\ncan improve the model's prediction by overcoming generalization issues due to\ndata scarcity, especially when data acquisition is costly. However, in many\nreal-world scenarios, due to sensor limitations, the data we can obtain is\noften only partial observation, making the calculation of PDE loss seem to be\ninfeasible, as the PDE loss heavily relies on high-resolution states. We\ncarefully study this problem and propose a novel framework named Re-enable PDE\nLoss under Partial Observation (RPLPO). The key idea is that although enabling\nPDE loss to constrain system transition solely is infeasible, we can re-enable\nPDE loss by reconstructing the learnable high-resolution state and constraining\nsystem transition simultaneously. Specifically, RPLPO combines an encoding\nmodule for reconstructing learnable high-resolution states with a transition\nmodule for predicting future states. The two modules are jointly trained by\ndata and PDE loss. We conduct experiments in various physical systems to\ndemonstrate that RPLPO has significant improvement in generalization, even when\nobservation is sparse, irregular, noisy, and PDE is inaccurate. The code is\navailable on GitHub: RPLPO.\n","authors":["Haodong Feng","Yue Wang","Dixia Fan"],"pdf_url":"https://arxiv.org/pdf/2412.09116v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.09115v1","updated":"2024-12-12T09:49:16Z","published":"2024-12-12T09:49:16Z","title":"Vision CNNs trained to estimate spatial latents learned similar\n ventral-stream-aligned representations","summary":" Studies of the functional role of the primate ventral visual stream have\ntraditionally focused on object categorization, often ignoring -- despite much\nprior evidence -- its role in estimating \"spatial\" latents such as object\nposition and pose. Most leading ventral stream models are derived by optimizing\nnetworks for object categorization, which seems to imply that the ventral\nstream is also derived under such an objective. Here, we explore an alternative\nhypothesis: Might the ventral stream be optimized for estimating spatial\nlatents? And a closely related question: How different -- if at all -- are\nrepresentations learned from spatial latent estimation compared to\ncategorization? To ask these questions, we leveraged synthetic image datasets\ngenerated by a 3D graphic engine and trained convolutional neural networks\n(CNNs) to estimate different combinations of spatial and category latents. We\nfound that models trained to estimate just a few spatial latents achieve neural\nalignment scores comparable to those trained on hundreds of categories, and the\nspatial latent performance of models strongly correlates with their neural\nalignment. Spatial latent and category-trained models have very similar -- but\nnot identical -- internal representations, especially in their early and middle\nlayers. We provide evidence that this convergence is partly driven by\nnon-target latent variability in the training data, which facilitates the\nimplicit learning of representations of those non-target latents. Taken\ntogether, these results suggest that many training objectives, such as spatial\nlatents, can lead to similar models aligned neurally with the ventral stream.\nThus, one should not assume that the ventral stream is optimized for object\ncategorization only. As a field, we need to continue to sharpen our measures of\ncomparing models to brains to better understand the functional roles of the\nventral stream.\n","authors":["Yudi Xie","Weichen Huang","Esther Alter","Jeremy Schwartz","Joshua B. Tenenbaum","James J. DiCarlo"],"pdf_url":"https://arxiv.org/pdf/2412.09115v1.pdf","comment":"29 pages, 20 figures, ICLR 2025"},{"id":"http://arxiv.org/abs/2412.09104v1","updated":"2024-12-12T09:35:47Z","published":"2024-12-12T09:35:47Z","title":"In-Dataset Trajectory Return Regularization for Offline Preference-based\n Reinforcement Learning","summary":" Offline preference-based reinforcement learning (PbRL) typically operates in\ntwo phases: first, use human preferences to learn a reward model and annotate\nrewards for a reward-free offline dataset; second, learn a policy by optimizing\nthe learned reward via offline RL. However, accurately modeling step-wise\nrewards from trajectory-level preference feedback presents inherent challenges.\nThe reward bias introduced, particularly the overestimation of predicted\nrewards, leads to optimistic trajectory stitching, which undermines the\npessimism mechanism critical to the offline RL phase. To address this\nchallenge, we propose In-Dataset Trajectory Return Regularization (DTR) for\noffline PbRL, which leverages conditional sequence modeling to mitigate the\nrisk of learning inaccurate trajectory stitching under reward bias.\nSpecifically, DTR employs Decision Transformer and TD-Learning to strike a\nbalance between maintaining fidelity to the behavior policy with high\nin-dataset trajectory returns and selecting optimal actions based on high\nreward labels. Additionally, we introduce an ensemble normalization technique\nthat effectively integrates multiple reward models, balancing the tradeoff\nbetween reward differentiation and accuracy. Empirical evaluations on various\nbenchmarks demonstrate the superiority of DTR over other state-of-the-art\nbaselines\n","authors":["Songjun Tu","Jingbo Sun","Qichao Zhang","Yaocheng Zhang","Jia Liu","Ke Chen","Dongbin Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.09104v1.pdf","comment":"7 pages, Proceedings of the 39th AAAI Conference on Artificial\n Intelligence (AAAI-25)"},{"id":"http://arxiv.org/abs/2412.09094v1","updated":"2024-12-12T09:22:04Z","published":"2024-12-12T09:22:04Z","title":"Filter-then-Generate: Large Language Models with Structure-Text Adapter\n for Knowledge Graph Completion","summary":" Large Language Models (LLMs) present massive inherent knowledge and superior\nsemantic comprehension capability, which have revolutionized various tasks in\nnatural language processing. Despite their success, a critical gap remains in\nenabling LLMs to perform knowledge graph completion (KGC). Empirical evidence\nsuggests that LLMs consistently perform worse than conventional KGC approaches,\neven through sophisticated prompt design or tailored instruction-tuning.\nFundamentally, applying LLMs on KGC introduces several critical challenges,\nincluding a vast set of entity candidates, hallucination issue of LLMs, and\nunder-exploitation of the graph structure. To address these challenges, we\npropose a novel instruction-tuning-based method, namely FtG. Specifically, we\npresent a \\textit{filter-then-generate} paradigm and formulate the KGC task\ninto a multiple-choice question format. In this way, we can harness the\ncapability of LLMs while mitigating the issue casused by hallucinations.\nMoreover, we devise a flexible ego-graph serialization prompt and employ a\nstructure-text adapter to couple structure and text information in a\ncontextualized manner. Experimental results demonstrate that FtG achieves\nsubstantial performance gain compared to existing state-of-the-art methods. The\ninstruction dataset and code are available at\n\\url{https://github.com/LB0828/FtG}.\n","authors":["Ben Liu","Jihai Zhang","Fangquan Lin","Cheng Yang","Min Peng"],"pdf_url":"https://arxiv.org/pdf/2412.09094v1.pdf","comment":"COLING 2025 Main Conference"},{"id":"http://arxiv.org/abs/2412.09090v1","updated":"2024-12-12T09:17:35Z","published":"2024-12-12T09:17:35Z","title":"Integrated trucks assignment and scheduling problem with mixed service\n mode docks: A Q-learning based adaptive large neighborhood search algorithm","summary":" Mixed service mode docks enhance efficiency by flexibly handling both loading\nand unloading trucks in warehouses. However, existing research often\npredetermines the number and location of these docks prior to planning truck\nassignment and sequencing. This paper proposes a new model integrating dock\nmode decision, truck assignment, and scheduling, thus enabling adaptive dock\nmode arrangements. Specifically, we introduce a Q-learning-based adaptive large\nneighborhood search (Q-ALNS) algorithm to address the integrated problem. The\nalgorithm adjusts dock modes via perturbation operators, while truck assignment\nand scheduling are solved using destroy and repair local search operators.\nQ-learning adaptively selects these operators based on their performance\nhistory and future gains, employing the epsilon-greedy strategy. Extensive\nexperimental results and statistical analysis indicate that the Q-ALNS benefits\nfrom efficient operator combinations and its adaptive mechanism, consistently\noutperforming benchmark algorithms in terms of optimality gap and Pareto front\ndiscovery. In comparison to the predetermined service mode, our adaptive\nstrategy results in lower average tardiness and makespan, highlighting its\nsuperior adaptability to varying demands.\n","authors":["Yueyi Li","Mehrdad Mohammadi","Xiaodong Zhang","Yunxing Lan","Willem van Jaarsveld"],"pdf_url":"https://arxiv.org/pdf/2412.09090v1.pdf","comment":"29 pages, 12 figures, 15 tables"},{"id":"http://arxiv.org/abs/2303.15361v2","updated":"2024-12-12T09:06:56Z","published":"2023-03-27T16:32:21Z","title":"A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts","summary":" Machine learning methods strive to acquire a robust model during the training\nprocess that can effectively generalize to test samples, even in the presence\nof distribution shifts. However, these methods often suffer from performance\ndegradation due to unknown test distributions. Test-time adaptation (TTA), an\nemerging paradigm, has the potential to adapt a pre-trained model to unlabeled\ndata during testing, before making predictions. Recent progress in this\nparadigm has highlighted the significant benefits of using unlabeled data to\ntrain self-adapted models prior to inference. In this survey, we categorize TTA\ninto several distinct groups based on the form of test data, namely, test-time\ndomain adaptation, test-time batch adaptation, and online test-time adaptation.\nFor each category, we provide a comprehensive taxonomy of advanced algorithms\nand discuss various learning scenarios. Furthermore, we analyze relevant\napplications of TTA and discuss open challenges and promising areas for future\nresearch. For a comprehensive list of TTA methods, kindly refer to\n\\url{https://github.com/tim-learn/awesome-test-time-adaptation}.\n","authors":["Jian Liang","Ran He","Tieniu Tan"],"pdf_url":"https://arxiv.org/pdf/2303.15361v2.pdf","comment":"Discussions, comments, and questions are all welcomed in\n \\url{https://github.com/tim-learn/awesome-test-time-adaptation}"},{"id":"http://arxiv.org/abs/2306.10882v3","updated":"2024-12-12T09:03:39Z","published":"2023-06-19T12:22:56Z","title":"AdaStop: adaptive statistical testing for sound comparisons of Deep RL\n agents","summary":" Recently, the scientific community has questioned the statistical\nreproducibility of many empirical results, especially in the field of machine\nlearning. To contribute to the resolution of this reproducibility crisis, we\npropose a theoretically sound methodology for comparing the performance of a\nset of algorithms. We exemplify our methodology in Deep Reinforcement Learning\n(Deep RL). The performance of one execution of a Deep RL algorithm is a random\nvariable. Therefore, several independent executions are needed to evaluate its\nperformance. When comparing algorithms with random performance, a major\nquestion concerns the number of executions to perform to ensure that the result\nof the comparison is theoretically sound. Researchers in Deep RL often use less\nthan 5 independent executions to compare algorithms: we claim that this is not\nenough in general. Moreover, when comparing more than 2 algorithms at once, we\nhave to use a multiple tests procedure to preserve low error guarantees. We\nintroduce AdaStop, a new statistical test based on multiple group sequential\ntests. When used to compare algorithms, AdaStop adapts the number of executions\nto stop as early as possible while ensuring that enough information has been\ncollected to distinguish algorithms that have different score distributions. We\nprove theoretically that AdaStop has a low probability of making a\n(family-wise) error. We illustrate the effectiveness of AdaStop in various\nuse-cases, including toy examples and Deep RL algorithms on challenging Mujoco\nenvironments. AdaStop is the first statistical test fitted to this sort of\ncomparisons: it is both a significant contribution to statistics, and an\nimportant contribution to computational studies performed in reinforcement\nlearning and in other domains.\n","authors":["Timothée Mathieu","Riccardo Della Vecchia","Alena Shilova","Matheus Medeiros Centa","Hector Kohler","Odalric-Ambrym Maillard","Philippe Preux"],"pdf_url":"https://arxiv.org/pdf/2306.10882v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09079v1","updated":"2024-12-12T09:03:31Z","published":"2024-12-12T09:03:31Z","title":"Neural Networks for Threshold Dynamics Reconstruction","summary":" We introduce two convolutional neural network (CNN) architectures, inspired\nby the Merriman-Bence-Osher (MBO) algorithm and by cellular automatons, to\nmodel and learn threshold dynamics for front evolution from video data. The\nfirst model, termed the (single-dynamics) MBO network, learns a specific kernel\nand threshold for each input video without adapting to new dynamics, while the\nsecond, a meta-learning MBO network, generalizes across diverse threshold\ndynamics by adapting its parameters per input. Both models are evaluated on\nsynthetic and real-world videos (ice melting and fire front propagation), with\nperformance metrics indicating effective reconstruction and extrapolation of\nevolving boundaries, even under noisy conditions. Empirical results highlight\nthe robustness of both networks across varied synthetic and real-world\ndynamics.\n","authors":["Elisa Negrini","Almanzo Jiahe Gao","Abigail Bowering","Wei Zhu","Luca Capogna"],"pdf_url":"https://arxiv.org/pdf/2412.09079v1.pdf","comment":"Key words: threshold dynamics, cellular automaton, inverse problems,\n convolutional neural networks, deep learning"},{"id":"http://arxiv.org/abs/2412.09073v1","updated":"2024-12-12T08:58:42Z","published":"2024-12-12T08:58:42Z","title":"SVasP: Self-Versatility Adversarial Style Perturbation for Cross-Domain\n Few-Shot Learning","summary":" Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen\nsource domains to unseen target domains, which is crucial for evaluating the\ngeneralization and robustness of models. Recent studies focus on utilizing\nvisual styles to bridge the domain gap between different domains. However, the\nserious dilemma of gradient instability and local optimization problem occurs\nin those style-based CD-FSL methods. This paper addresses these issues and\nproposes a novel crop-global style perturbation method, called\n\\underline{\\textbf{S}}elf-\\underline{\\textbf{V}}ersatility\n\\underline{\\textbf{A}}dversarial \\underline{\\textbf{S}}tyle\n\\underline{\\textbf{P}}erturbation (\\textbf{SVasP}), which enhances the gradient\nstability and escapes from poor sharp minima jointly. Specifically, SVasP\nsimulates more diverse potential target domain adversarial styles via\ndiversifying input patterns and aggregating localized crop style gradients, to\nserve as global style perturbation stabilizers within one image, a concept we\nrefer to as self-versatility. Then a novel objective function is proposed to\nmaximize visual discrepancy while maintaining semantic consistency between\nglobal, crop, and adversarial features. Having the stabilized global style\nperturbation in the training phase, one can obtain a flattened minima in the\nloss landscape, boosting the transferability of the model to the target\ndomains. Extensive experiments on multiple benchmark datasets demonstrate that\nour method significantly outperforms existing state-of-the-art methods. Our\ncodes are available at https://github.com/liwenqianSEU/SVasP.\n","authors":["Wenqian Li","Pengfei Fang","Hui Xue"],"pdf_url":"https://arxiv.org/pdf/2412.09073v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.12509v4","updated":"2024-12-12T08:54:14Z","published":"2022-11-22T08:01:33Z","title":"SimVPv2: Towards Simple yet Powerful Spatiotemporal Predictive Learning","summary":" Recent years have witnessed remarkable advances in spatiotemporal predictive\nlearning, with methods incorporating auxiliary inputs, complex neural\narchitectures, and sophisticated training strategies. While SimVP has\nintroduced a simpler, CNN-based baseline for this task, it still relies on\nheavy Unet-like architectures for spatial and temporal modeling, which still\nsuffers from high complexity and computational overhead. In this paper, we\npropose SimVPv2, a streamlined model that eliminates the need for Unet\narchitectures and demonstrates that plain stacks of convolutional layers,\nenhanced with an efficient Gated Spatiotemporal Attention mechanism, can\ndeliver state-of-the-art performance. SimVPv2 not only simplifies the model\narchitecture but also improves both performance and computational efficiency.\nOn the standard Moving MNIST benchmark, SimVPv2 achieves superior performance\ncompared to SimVP, with fewer FLOPs, about half the training time, and 60%\nfaster inference efficiency. Extensive experiments across eight diverse\ndatasets, including real-world tasks such as traffic forecasting and climate\nprediction, further demonstrate that SimVPv2 offers a powerful yet\nstraightforward solution, achieving robust generalization across various\nspatiotemporal learning scenarios. We believe the proposed SimVPv2 can serve as\na solid baseline to benefit the spatiotemporal predictive learning community.\n","authors":["Cheng Tan","Zhangyang Gao","Siyuan Li","Stan Z. Li"],"pdf_url":"https://arxiv.org/pdf/2211.12509v4.pdf","comment":"Accepted by TMM"},{"id":"http://arxiv.org/abs/2412.09065v1","updated":"2024-12-12T08:52:27Z","published":"2024-12-12T08:52:27Z","title":"Multi-view Clustering via Unified Multi-kernel Learning and Matrix\n Factorization","summary":" Multi-view clustering has become increasingly important due to the\nmulti-source character of real-world data. Among existing multi-view clustering\nmethods, multi-kernel clustering and matrix factorization-based multi-view\nclustering have gained widespread attention as mainstream approaches. However,\nmulti-kernel clustering tends to learn an optimal kernel and then perform\neigenvalue decomposition on it, which leads to high computational complexity.\nMatrix factorization-based multi-view clustering methods impose orthogonal\nconstraints on individual views. This overly emphasizes the accuracy of\nclustering structures within single views and restricts the learning of\nindividual views. Based on this analysis, we propose a multi-view clustering\nmethod that integrates multi-kernel learning with matrix factorization. This\napproach combines the advantages of both multi-kernel learning and matrix\nfactorization. It removes the orthogonal constraints on individual views and\nimposes orthogonal constraints on the consensus matrix, resulting in an\naccurate final clustering structure. Ultimately, the method is unified into a\nsimple form of multi-kernel clustering, but avoids learning an optimal kernel,\nthus reducing the time complexity. Furthermore, we propose an efficient\nthree-step optimization algorithm to achieve a locally optimal solution.\nExperiments on widely-used real-world datasets demonstrate the effectiveness of\nour proposed method.\n","authors":["Chenxing Jia","Mingjie Cai","Hamido Fujita"],"pdf_url":"https://arxiv.org/pdf/2412.09065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.00919v2","updated":"2024-12-12T08:42:35Z","published":"2023-10-02T06:15:50Z","title":"A simple thinking about the application of the attention mechanism in\n medical ultrasound image segmentation task","summary":" The AI-based assisted diagnosis programs have been widely investigated on\nmedical ultrasound images. Complex scenario of ultrasound image, in which the\ncoupled interference of internal and external factors is severe, brings a\nunique challenge for localize the object region automatically and precisely in\nultrasound images. In this study, we seek to propose a more general and robust\nBenchmark Attention Adaptive Framework (BAAF) to assist doctors segment or\ndiagnose lesions and tissues in ultrasound images more quickly and accurately.\nDifferent from existing attention schemes, the BAAF consists of a parallel\nhybrid attention module (PHAM) and an adaptive calibration mechanism (ACM).\nSpecifically, BAAF first coarsely calibrates the input features from the\nchannel and spatial dimensions, and then adaptively selects more robust lesion\nor tissue characterizations from the coarse-calibrated feature maps. The design\nof BAAF further optimizes the \"what\" and \"where\" focus and selection problems\nin CNNs and seeks to improve the segmentation accuracy of lesions or tissues in\nmedical ultrasound images. The method is evaluated on four medical ultrasound\nsegmentation tasks, and the adequate experimental results demonstrate the\nremarkable performance improvement over existing state-of-the-art methods. In\naddition, the comparison with existing attention mechanisms also demonstrates\nthe superiority of BAAF. This work provides the possibility for automated\nmedical ultrasound assisted diagnosis and reduces reliance on human accuracy\nand precision.\n","authors":["Gongping Chen","Rui Wang","Xiaotao Yin","Liang Cui","Yu Dai"],"pdf_url":"https://arxiv.org/pdf/2310.00919v2.pdf","comment":"10 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.09059v1","updated":"2024-12-12T08:40:22Z","published":"2024-12-12T08:40:22Z","title":"Go With the Flow: Fast Diffusion for Gaussian Mixture Models","summary":" Schr\\\"{o}dinger Bridges (SB) are diffusion processes that steer, in finite\ntime, a given initial distribution to another final one while minimizing a\nsuitable cost functional. Although various methods for computing SBs have\nrecently been proposed in the literature, most of these approaches require\ncomputationally expensive training schemes, even for solving low-dimensional\nproblems. In this work, we propose an analytic parametrization of a set of\nfeasible policies for steering the distribution of a dynamical system from one\nGaussian Mixture Model (GMM) to another. Instead of relying on standard\nnon-convex optimization techniques, the optimal policy within the set can be\napproximated as the solution of a low-dimensional linear program whose\ndimension scales linearly with the number of components in each mixture.\nFurthermore, our method generalizes naturally to more general classes of\ndynamical systems such as controllable Linear Time-Varying systems that cannot\ncurrently be solved using traditional neural SB approaches. We showcase the\npotential of this approach in low-to-moderate dimensional problems such as\nimage-to-image translation in the latent space of an autoencoder, and various\nother examples. We also benchmark our approach on an Entropic Optimal Transport\n(EOT) problem and show that it outperforms state-of-the-art methods in cases\nwhere the boundary distributions are mixture models while requiring virtually\nno training.\n","authors":["George Rapakoulias","Ali Reza Pedram","Panagiotis Tsiotras"],"pdf_url":"https://arxiv.org/pdf/2412.09059v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09053v1","updated":"2024-12-12T08:23:58Z","published":"2024-12-12T08:23:58Z","title":"Safe Active Learning for Gaussian Differential Equations","summary":" Gaussian Process differential equations (GPODE) have recently gained momentum\ndue to their ability to capture dynamics behavior of systems and also represent\nuncertainty in predictions. Prior work has described the process of training\nthe hyperparameters and, thereby, calibrating GPODE to data. How to design\nefficient algorithms to collect data for training GPODE models is still an open\nfield of research. Nevertheless high-quality training data is key for model\nperformance. Furthermore, data collection leads to time-cost and financial-cost\nand might in some areas even be safety critical to the system under test.\nTherefore, algorithms for safe and efficient data collection are central for\nbuilding high quality GPODE models. Our novel Safe Active Learning (SAL) for\nGPODE algorithm addresses this challenge by suggesting a mechanism to propose\nefficient and non-safety-critical data to collect. SAL GPODE does so by\nsequentially suggesting new data, measuring it and updating the GPODE model\nwith the new data. In this way, subsequent data points are iteratively\nsuggested. The core of our SAL GPODE algorithm is a constrained optimization\nproblem maximizing information of new data for GPODE model training constrained\nby the safety of the underlying system. We demonstrate our novel SAL GPODE's\nsuperiority compared to a standard, non-active way of measuring new data on two\nrelevant examples.\n","authors":["Leon Glass","Katharina Ensinger","Christoph Zimmer"],"pdf_url":"https://arxiv.org/pdf/2412.09053v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.03572v3","updated":"2024-12-12T08:22:48Z","published":"2023-08-07T13:24:50Z","title":"Transfer Learning with Partially Observable Offline Data via Causal\n Bounds","summary":" Transfer learning has emerged as an effective approach to accelerate learning\nby integrating knowledge from related source agents. However, challenges arise\ndue to data heterogeneity-such as differences in feature sets or incomplete\ndatasets-which often results in the nonidentifiability of causal effects. In\nthis paper, we investigate transfer learning in partially observable contextual\nbandits, where agents operate with incomplete information and limited access to\nhidden confounders. To address the challenges posed by unobserved confounders,\nwe formulate optimization problems to derive tight bounds on the\nnonidentifiable causal effects. We then propose an efficient method that\ndiscretizes the functional constraints of unknown distributions into linear\nconstraints, allowing us to sample compatible causal models through a\nsequential process of solving linear programs. This method takes into account\nestimation errors and exhibits strong convergence properties, ensuring robust\nand reliable causal bounds. Leveraging these causal bounds, we improve\nclassical bandit algorithms, achieving tighter regret upper and lower bounds\nrelative to the sizes of action sets and function spaces. In tasks involving\nfunction approximation, which are crucial for handling complex context spaces,\nour method significantly improves the dependence on function space size\ncompared to previous work. We formally prove that our causally enhanced\nalgorithms outperform classical bandit algorithms, achieving notably faster\nconvergence rates. The applicability of our approach is further illustrated\nthrough an example of offline pricing policy learning with censored\ndemand.Simulations confirm the superiority of our approach over\nstate-of-the-art methods, demonstrating its potential to enhance contextual\nbandit agents in real-world applications, especially when data is scarce,\ncostly, or restricted due to privacy concerns.\n","authors":["Xueping Gong","Wei You","Jiheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2308.03572v3.pdf","comment":"57 pages"},{"id":"http://arxiv.org/abs/2412.09049v1","updated":"2024-12-12T08:19:01Z","published":"2024-12-12T08:19:01Z","title":"Dial-In LLM: Human-Aligned Dialogue Intent Clustering with\n LLM-in-the-loop","summary":" The discovery of customer intention from dialogue plays an important role in\nautomated support system. However, traditional text clustering methods are\npoorly aligned with human perceptions due to the shift from embedding distance\nto semantic distance, and existing quantitative metrics for text clustering may\nnot accurately reflect the true quality of intent clusters. In this paper, we\nleverage the superior language understanding capabilities of Large Language\nModels (LLMs) for designing better-calibrated intent clustering algorithms. We\nfirst establish the foundation by verifying the robustness of fine-tuned LLM\nutility in semantic coherence evaluation and cluster naming, resulting in an\naccuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled\nground truth. Then, we propose an iterative clustering algorithm that\nfacilitates cluster-level refinement and the continuous discovery of\nhigh-quality intent clusters. Furthermore, we present several LLM-in-the-loop\nsemi-supervised clustering techniques tailored for intent discovery from\ncustomer service dialogue. Experiments on a large-scale industrial dataset\ncomprising 1,507 intent clusters demonstrate the effectiveness of the proposed\ntechniques. The methods outperformed existing counterparts, achieving 6.25%\nimprovement in quantitative metrics and 12% enhancement in application-level\nperformance when constructing an intent classifier.\n","authors":["Mengze Hong","Yuanfeng Song","Di Jiang","Wailing Ng","Yanjie Sun","Chen Jason Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09049v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.10286v2","updated":"2024-12-12T08:05:45Z","published":"2024-08-19T08:23:38Z","title":"GARLIC: GPT-Augmented Reinforcement Learning with Intelligent Control\n for Vehicle Dispatching","summary":" As urban residents demand higher travel quality, vehicle dispatch has become\na critical component of online ride-hailing services. However, current vehicle\ndispatch systems struggle to navigate the complexities of urban traffic\ndynamics, including unpredictable traffic conditions, diverse driver behaviors,\nand fluctuating supply and demand patterns. These challenges have resulted in\ntravel difficulties for passengers in certain areas, while many drivers in\nother areas are unable to secure orders, leading to a decline in the overall\nquality of urban transportation services. To address these issues, this paper\nintroduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with\nIntelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs\nto capture hierarchical traffic states, and learns a dynamic reward function\nthat accounts for individual driving behaviors. The framework further\nintegrates a GPT model trained with a custom loss function to enable\nhigh-precision predictions and optimize dispatching policies in real-world\nscenarios. Experiments conducted on two real-world datasets demonstrate that\nGARLIC effectively aligns with driver behaviors while reducing the empty load\nrate of vehicles.\n","authors":["Xiao Han","Zijian Zhang","Xiangyu Zhao","Guojiang Shen","Xiangjie Kong","Xuetao Wei","Liqiang Nie","Jieping Ye","Yuanshao Zhu"],"pdf_url":"https://arxiv.org/pdf/2408.10286v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2408.14404v2","updated":"2024-12-12T08:01:44Z","published":"2024-08-26T16:47:20Z","title":"Application of Neural Ordinary Differential Equations for ITER Burning\n Plasma Dynamics","summary":" The dynamics of burning plasmas in tokamaks are crucial for advancing\ncontrolled thermonuclear fusion. This study applies the NeuralPlasmaODE, a\nmulti-region multi-timescale transport model, to simulate the complex energy\ntransfer processes in ITER deuterium-tritium (D-T) plasmas. Our model captures\nthe interactions between energetic alpha particles, electrons, and ions, which\nare vital for understanding phenomena such as thermal runaway instability. We\nemploy neural ordinary differential equations (Neural ODEs) for the numerical\nderivation of diffusivity parameters, enabling precise modeling of energy\ninteractions between different plasma regions. By leveraging transfer learning,\nwe utilize model parameters derived from DIII-D experimental data, enhancing\nthe efficiency and accuracy of our simulations without training from scratch.\nApplying this model to ITER's inductive and non-inductive operational\nscenarios, our results demonstrate that radiation and transport processes\neffectively remove excess heat from the core plasma, preventing thermal runaway\ninstability. This study underscores the potential of machine learning in\nadvancing our understanding and control of burning plasma dynamics in fusion\nreactors.\n","authors":["Zefang Liu","Weston M. Stacey"],"pdf_url":"https://arxiv.org/pdf/2408.14404v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09037v1","updated":"2024-12-12T07:53:17Z","published":"2024-12-12T07:53:17Z","title":"Beyond Confusion: A Fine-grained Dialectical Examination of Human\n Activity Recognition Benchmark Datasets","summary":" The research of machine learning (ML) algorithms for human activity\nrecognition (HAR) has made significant progress with publicly available\ndatasets. However, most research prioritizes statistical metrics over examining\nnegative sample details. While recent models like transformers have been\napplied to HAR datasets with limited success from the benchmark metrics, their\ncounterparts have effectively solved problems on similar levels with near 100%\naccuracy. This raises questions about the limitations of current approaches.\nThis paper aims to address these open questions by conducting a fine-grained\ninspection of six popular HAR benchmark datasets. We identified for some parts\nof the data, none of the six chosen state-of-the-art ML methods can correctly\nclassify, denoted as the intersect of false classifications (IFC). Analysis of\nthe IFC reveals several underlying problems, including ambiguous annotations,\nirregularities during recording execution, and misaligned transition periods.\nWe contribute to the field by quantifying and characterizing annotated data\nambiguities, providing a trinary categorization mask for dataset patching, and\nstressing potential improvements for future data collections.\n","authors":["Daniel Geissler","Dominique Nshimyimana","Vitor Fortes Rey","Sungho Suh","Bo Zhou","Paul Lukowicz"],"pdf_url":"https://arxiv.org/pdf/2412.09037v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09035v1","updated":"2024-12-12T07:51:44Z","published":"2024-12-12T07:51:44Z","title":"Pulling the Carpet Below the Learner's Feet: Genetic Algorithm To Learn\n Ensemble Machine Learning Model During Concept Drift","summary":" Data-driven models, in general, and machine learning (ML) models, in\nparticular, have gained popularity over recent years with an increased usage of\nsuch models across the scientific and engineering domains. When using ML models\nin realistic and dynamic environments, users need to often handle the challenge\nof concept drift (CD). In this study, we explore the application of genetic\nalgorithms (GAs) to address the challenges posed by CD in such settings. We\npropose a novel two-level ensemble ML model, which combines a global ML model\nwith a CD detector, operating as an aggregator for a population of ML pipeline\nmodels, each one with an adjusted CD detector by itself responsible for\nre-training its ML model. In addition, we show one can further improve the\nproposed model by utilizing off-the-shelf automatic ML methods. Through\nextensive synthetic dataset analysis, we show that the proposed model\noutperforms a single ML pipeline with a CD algorithm, particularly in scenarios\nwith unknown CD characteristics. Overall, this study highlights the potential\nof ensemble ML and CD models obtained through a heuristic and adaptive\noptimization process such as the GA one to handle complex CD events.\n","authors":["Teddy Lazebnik"],"pdf_url":"https://arxiv.org/pdf/2412.09035v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.07890v2","updated":"2024-12-12T07:45:33Z","published":"2024-07-10T17:57:58Z","title":"Training on the Test Task Confounds Evaluation and Emergence","summary":" We study a fundamental problem in the evaluation of large language models\nthat we call training on the test task. Unlike wrongful practices like training\non the test data, leakage, or data contamination, training on the test task is\nnot a malpractice. Rather, the term describes a growing set of practices that\nutilize knowledge about evaluation tasks at training time. We demonstrate that\ntraining on the test task confounds both relative model evaluations and claims\nabout emergent capabilities. We argue that the seeming superiority of one model\nfamily over another may be explained by a different degree of training on the\ntest task. To this end, we propose an effective method to adjust for the effect\nof training on the test task on benchmark evaluations. Put simply, to fine-tune\neach model under comparison on the same task-relevant data before evaluation.\nWe then show that instances of emergent behavior disappear gradually as models\ntrain on the test task. Our work promotes a new perspective on the evaluation\nof large language models with broad implications for benchmarking and the study\nof emergent capabilities\n","authors":["Ricardo Dominguez-Olmedo","Florian E. Dorner","Moritz Hardt"],"pdf_url":"https://arxiv.org/pdf/2407.07890v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09030v1","updated":"2024-12-12T07:45:17Z","published":"2024-12-12T07:45:17Z","title":"RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell\n Property Prediction","summary":" Organic Solar Cells (OSCs) are a promising technology for sustainable energy\nproduction. However, the identification of molecules with desired OSC\nproperties typically involves laborious experimental research. To accelerate\nprogress in the field, it is crucial to develop machine learning models capable\nof accurately predicting the properties of OSC molecules. While graph\nrepresentation learning has demonstrated success in molecular property\nprediction, it remains underexplored for OSC-specific tasks. Existing methods\nfail to capture the unique structural features of OSC molecules, particularly\nthe intricate ring systems that critically influence OSC properties, leading to\nsuboptimal performance. To fill the gap, we present RingFormer, a novel graph\ntransformer framework specially designed to capture both atom and ring level\nstructural patterns in OSC molecules. RingFormer constructs a hierarchical\ngraph that integrates atomic and ring structures and employs a combination of\nlocal message passing and global attention mechanisms to generate expressive\ngraph representations for accurate OSC property prediction. We evaluate\nRingFormer's effectiveness on five curated OSC molecule datasets through\nextensive experiments. The results demonstrate that RingFormer consistently\noutperforms existing methods, achieving a 22.77% relative improvement over the\nnearest competitor on the CEPDB dataset.\n","authors":["Zhihao Ding","Ting Zhang","Yiran Li","Jieming Shi","Chen Jason Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.09030v1.pdf","comment":"12 pages, 4 figures. This is the extended version of the paper\n accepted at AAAI 2025, which includes all technical appendices and additional\n experimental details"},{"id":"http://arxiv.org/abs/2412.09028v1","updated":"2024-12-12T07:43:27Z","published":"2024-12-12T07:43:27Z","title":"Learning and Current Prediction of PMSM Drive via Differential Neural\n Networks","summary":" Learning models for dynamical systems in continuous time is significant for\nunderstanding complex phenomena and making accurate predictions. This study\npresents a novel approach utilizing differential neural networks (DNNs) to\nmodel nonlinear systems, specifically permanent magnet synchronous motors\n(PMSMs), and to predict their current trajectories. The efficacy of our\napproach is validated through experiments conducted under various load\ndisturbances and no-load conditions. The results demonstrate that our method\neffectively and accurately reconstructs the original systems, showcasing strong\nshort-term and long-term prediction capabilities and robustness. This study\nprovides valuable insights into learning the inherent dynamics of complex\ndynamical data and holds potential for further applications in fields such as\nweather forecasting, robotics, and collective behavior analysis.\n","authors":["Wenjie Mei","Xiaorui Wang","Yanrong Lu","Ke Yu","Shihua Li"],"pdf_url":"https://arxiv.org/pdf/2412.09028v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01268v2","updated":"2024-12-12T07:29:41Z","published":"2024-10-02T06:24:51Z","title":"Deep Learning and Machine Learning, Advancing Big Data Analytics and\n Management: Unveiling AI's Potential Through Tools, Techniques, and\n Applications","summary":" Artificial intelligence (AI), machine learning, and deep learning have become\ntransformative forces in big data analytics and management, enabling\ngroundbreaking advancements across diverse industries. This article delves into\nthe foundational concepts and cutting-edge developments in these fields, with a\nparticular focus on large language models (LLMs) and their role in natural\nlanguage processing, multimodal reasoning, and autonomous decision-making.\nHighlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores\ntheir applications in data analysis, model design, and optimization.\n The integration of advanced algorithms like neural networks, reinforcement\nlearning, and generative models has enhanced the capabilities of AI systems to\nprocess, visualize, and interpret complex datasets. Additionally, the emergence\nof technologies like edge computing and automated machine learning (AutoML)\ndemocratizes access to AI, empowering users across skill levels to engage with\nintelligent systems. This work also underscores the importance of ethical\nconsiderations, transparency, and fairness in the deployment of AI\ntechnologies, paving the way for responsible innovation.\n Through practical insights into hardware configurations, software\nenvironments, and real-world applications, this article serves as a\ncomprehensive resource for researchers and practitioners. By bridging\ntheoretical underpinnings with actionable strategies, it showcases the\npotential of AI and LLMs to revolutionize big data management and drive\nmeaningful advancements across domains such as healthcare, finance, and\nautonomous systems.\n","authors":["Pohsun Feng","Ziqian Bi","Yizhu Wen","Xuanhe Pan","Benji Peng","Ming Liu","Jiawei Xu","Keyu Chen","Junyu Liu","Caitlyn Heqi Yin","Sen Zhang","Jinlang Wang","Qian Niu","Ming Li","Tianyang Wang"],"pdf_url":"https://arxiv.org/pdf/2410.01268v2.pdf","comment":"This book contains 155 pages and 9 figures"},{"id":"http://arxiv.org/abs/2412.09010v1","updated":"2024-12-12T07:22:23Z","published":"2024-12-12T07:22:23Z","title":"Training Physical Neural Networks for Analog In-Memory Computing","summary":" In-memory computing (IMC) architectures mitigate the von Neumann bottleneck\nencountered in traditional deep learning accelerators. Its energy efficiency\ncan realize deep learning-based edge applications. However, because IMC is\nimplemented using analog circuits, inherent non-idealities in the hardware pose\nsignificant challenges. This paper presents physical neural networks (PNNs) for\nconstructing physical models of IMC. PNNs can address the synaptic current's\ndependence on membrane potential, a challenge in charge-domain IMC systems. The\nproposed model is mathematically equivalent to spiking neural networks with\nreversal potentials. With a novel technique called differentiable spike-time\ndiscretization, the PNNs are efficiently trained. We show that hardware\nnon-idealities traditionally viewed as detrimental can enhance the model's\nlearning performance. This bottom-up methodology was validated by designing an\nIMC circuit with non-ideal characteristics using the sky130 process. When\nemploying this bottom-up approach, the modeling error reduced by an order of\nmagnitude compared to conventional top-down methods in post-layout simulations.\n","authors":["Yusuke Sakemi","Yuji Okamoto","Takashi Morie","Sou Nobukawa","Takeo Hosomi","Kazuyuki Aihara"],"pdf_url":"https://arxiv.org/pdf/2412.09010v1.pdf","comment":"53 pages, 20 figures"},{"id":"http://arxiv.org/abs/2412.09009v1","updated":"2024-12-12T07:22:02Z","published":"2024-12-12T07:22:02Z","title":"A physics-informed transformer neural operator for learning generalized\n solutions of initial boundary value problems","summary":" Initial boundary value problems arise commonly in applications with\nengineering and natural systems governed by nonlinear partial differential\nequations (PDEs). Operator learning is an emerging field for solving these\nequations by using a neural network to learn a map between infinite dimensional\ninput and output function spaces. These neural operators are trained using a\ncombination of data (observations or simulations) and PDE-residuals\n(physics-loss). A major drawback of existing neural approaches is the\nrequirement to retrain with new initial/boundary conditions, and the necessity\nfor a large amount of simulation data for training. We develop a\nphysics-informed transformer neural operator (named PINTO) that efficiently\ngeneralizes to unseen initial and boundary conditions, trained in a\nsimulation-free setting using only physics loss. The main innovation lies in\nour new iterative kernel integral operator units, implemented using\ncross-attention, to transform the PDE solution's domain points into an\ninitial/boundary condition-aware representation vector, enabling efficient\nlearning of the solution function for new scenarios. The PINTO architecture is\napplied to simulate the solutions of important equations used in engineering\napplications: advection, Burgers, and steady and unsteady Navier-Stokes\nequations (three flow scenarios). For these five test cases, we show that the\nrelative errors during testing under challenging conditions of unseen\ninitial/boundary conditions are only one-fifth to one-third of other leading\nphysics informed operator learning methods. Moreover, our PINTO model is able\nto accurately solve the advection and Burgers equations at time steps that are\nnot included in the training collocation points. The code is available at\n$\\texttt{https://github.com/quest-lab-iisc/PINTO}$\n","authors":["Sumanth Kumar Boya","Deepak Subramani"],"pdf_url":"https://arxiv.org/pdf/2412.09009v1.pdf","comment":"29 pages, 11 figures, 4 tables"},{"id":"http://arxiv.org/abs/2405.11911v2","updated":"2024-12-12T07:17:19Z","published":"2024-05-20T09:47:22Z","title":"Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning","summary":" Given an edge-incomplete graph, how can we accurately find the missing links?\nThe link prediction in edge-incomplete graphs aims to discover the missing\nrelations between entities when their relationships are represented as a graph.\nEdge-incomplete graphs are prevalent in real-world due to practical\nlimitations, such as not checking all users when adding friends in a social\nnetwork. Addressing the problem is crucial for various tasks, including\nrecommending friends in social networks and finding references in citation\nnetworks. However, previous approaches rely heavily on the given\nedge-incomplete (observed) graph, making it challenging to consider the missing\n(unobserved) links during training. In this paper, we propose PULL\n(PU-Learning-based Link predictor), an accurate link prediction method based on\nthe positive-unlabeled (PU) learning. PULL treats the observed edges in the\ntraining graph as positive examples, and the unconnected node pairs as\nunlabeled ones. PULL effectively prevents the link predictor from overfitting\nto the observed graph by proposing latent variables for every edge, and\nleveraging the expected graph structure with respect to the variables.\nExtensive experiments on five real-world datasets show that PULL consistently\noutperforms the baselines for predicting links in edge-incomplete graphs.\n","authors":["Junghun Kim","Ka Hyun Park","Hoyoung Yoon","U Kang"],"pdf_url":"https://arxiv.org/pdf/2405.11911v2.pdf","comment":"AAAI'25"},{"id":"http://arxiv.org/abs/2412.09006v1","updated":"2024-12-12T07:15:01Z","published":"2024-12-12T07:15:01Z","title":"Motor Imagery Classification for Asynchronous EEG-Based Brain-Computer\n Interfaces","summary":" Motor imagery (MI) based brain-computer interfaces (BCIs) enable the direct\ncontrol of external devices through the imagined movements of various body\nparts. Unlike previous systems that used fixed-length EEG trials for MI\ndecoding, asynchronous BCIs aim to detect the user's MI without explicit\ntriggers. They are challenging to implement, because the algorithm needs to\nfirst distinguish between resting-states and MI trials, and then classify the\nMI trials into the correct task, all without any triggers. This paper proposes\na sliding window prescreening and classification (SWPC) approach for MI-based\nasynchronous BCIs, which consists of two modules: a prescreening module to\nscreen MI trials out of the resting-state, and a classification module for MI\nclassification. Both modules are trained with supervised learning followed by\nself-supervised learning, which refines the feature extractors. Within-subject\nand cross-subject asynchronous MI classifications on four different EEG\ndatasets validated the effectiveness of SWPC, i.e., it always achieved the\nhighest average classification accuracy, and outperformed the best\nstate-of-the-art baseline on each dataset by about 2%.\n","authors":["Huanyu Wu","Siyang Li","Dongrui Wu"],"pdf_url":"https://arxiv.org/pdf/2412.09006v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.05668v2","updated":"2024-12-12T07:13:19Z","published":"2024-09-09T14:38:31Z","title":"Unlearning or Concealment? A Critical Analysis and Evaluation Metrics\n for Unlearning in Diffusion Models","summary":" Recent research has seen significant interest in methods for concept removal\nand targeted forgetting in text-to-image diffusion models. In this paper, we\nconduct a comprehensive white-box analysis showing the vulnerabilities in\nexisting diffusion model unlearning methods. We show that existing unlearning\nmethods lead to decoupling of the targeted concepts (meant to be forgotten) for\nthe corresponding prompts. This is concealment and not actual forgetting, which\nwas the original goal. This paper presents a rigorous theoretical and empirical\nexamination of five commonly used techniques for unlearning in diffusion\nmodels, while showing their potential weaknesses. We introduce two new\nevaluation metrics: Concept Retrieval Score (\\textbf{CRS}) and Concept\nConfidence Score (\\textbf{CCS}). These metrics are based on a successful\nadversarial attack setup that can recover \\textit{forgotten} concepts from\nunlearned diffusion models. \\textbf{CRS} measures the similarity between the\nlatent representations of the unlearned and fully trained models after\nunlearning. It reports the extent of retrieval of the \\textit{forgotten}\nconcepts with increasing amount of guidance. CCS quantifies the confidence of\nthe model in assigning the target concept to the manipulated data. It reports\nthe probability of the \\textit{unlearned} model's generations to be aligned\nwith the original domain knowledge with increasing amount of guidance. The\n\\textbf{CCS} and \\textbf{CRS} enable a more robust evaluation of concept\nerasure methods. Evaluating existing five state-of-the-art methods with our\nmetrics, reveal significant shortcomings in their ability to truly\n\\textit{unlearn}. Source Code:\n\\color{blue}{https://respailab.github.io/unlearning-or-concealment}\n","authors":["Aakash Sen Sharma","Niladri Sarkar","Vikram Chundawat","Ankur A Mali","Murari Mandal"],"pdf_url":"https://arxiv.org/pdf/2409.05668v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19225v3","updated":"2024-12-12T07:12:17Z","published":"2024-05-29T16:05:57Z","title":"Synthetic Potential Outcomes and Causal Mixture Identifiability","summary":" Heterogeneous data from multiple populations, sub-groups, or sources is often\nrepresented as a ``mixture model'' with a single latent class influencing all\nof the observed covariates. Heterogeneity can be resolved at multiple levels by\ngrouping populations according to different notions of similarity. This paper\nproposes grouping with respect to the causal response of an intervention or\nperturbation on the system. This definition is distinct from previous notions,\nsuch as similar covariate values (e.g. clustering) or similar correlations\nbetween covariates (e.g. Gaussian mixture models). To solve the problem, we\n``synthetically sample'' from a counterfactual distribution using higher-order\nmulti-linear moments of the observable data. To understand how these ``causal\nmixtures'' fit in with more classical notions, we develop a hierarchy of\nmixture identifiability.\n","authors":["Bijan Mazaheri","Chandler Squires","Caroline Uhler"],"pdf_url":"https://arxiv.org/pdf/2405.19225v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09002v1","updated":"2024-12-12T07:09:42Z","published":"2024-12-12T07:09:42Z","title":"Stellar parameter prediction and spectral simulation using machine\n learning","summary":" We applied machine learning to the entire data history of ESO's High Accuracy\nRadial Velocity Planet Searcher (HARPS) instrument. Our primary goal was to\nrecover the physical properties of the observed objects, with a secondary\nemphasis on simulating spectra. We systematically investigated the impact of\nvarious factors on the accuracy and fidelity of the results, including the use\nof simulated data, the effect of varying amounts of real training data, network\narchitectures, and learning paradigms. Our approach integrates supervised and\nunsupervised learning techniques within autoencoder frameworks. Our methodology\nleverages an existing simulation model that utilizes a library of existing\nstellar spectra in which the emerging flux is computed from first principles\nrooted in physics and a HARPS instrument model to generate simulated spectra\ncomparable to observational data. We trained standard and variational\nautoencoders on HARPS data to predict spectral parameters and generate spectra.\nOur models excel at predicting spectral parameters and compressing real\nspectra, and they achieved a mean prediction error of approximately 50 K for\neffective temperatures, making them relevant for most astrophysical\napplications. Furthermore, the models predict metallicity ([M/H]) and surface\ngravity (log g) with an accuracy of approximately 0.03 dex and 0.04 dex,\nrespectively, underscoring their broad applicability in astrophysical research.\nThe models' computational efficiency, with processing times of 779.6 ms on CPU\nand 3.97 ms on GPU, makes them valuable for high-throughput applications like\nmassive spectroscopic surveys and large archival studies. By achieving accuracy\ncomparable to classical methods with significantly reduced computation time,\nour methodology enhances the scope and efficiency of spectroscopic analysis.\n","authors":["Vojtěch Cvrček","Martino Romaniello","Radim Šára","Wolfram Freudling","Pascal Ballester"],"pdf_url":"https://arxiv.org/pdf/2412.09002v1.pdf","comment":"Accepted for publication in Astronomy & Astrophysics"},{"id":"http://arxiv.org/abs/2404.16866v4","updated":"2024-12-12T07:05:53Z","published":"2024-04-18T09:37:54Z","title":"Annotation-guided Protein Design with Multi-Level Domain Alignment","summary":" The core challenge of de novo protein design lies in creating proteins with\nspecific functions or properties, guided by certain conditions. Current models\nexplore to generate protein using structural and evolutionary guidance, which\nonly provide indirect conditions concerning functions and properties. However,\ntextual annotations of proteins, especially the annotations for protein\ndomains, which directly describe the protein's high-level functionalities,\nproperties, and their correlation with target amino acid sequences, remain\nunexplored in the context of protein design tasks. In this paper, we propose\nProtein-Annotation Alignment Generation, PAAG, a multi-modality protein design\nframework that integrates the textual annotations extracted from protein\ndatabase for controllable generation in sequence space. Specifically, within a\nmulti-level alignment module, PAAG can explicitly generate proteins containing\nspecific domains conditioned on the corresponding domain annotations, and can\neven design novel proteins with flexible combinations of different kinds of\nannotations. Our experimental results underscore the superiority of the aligned\nprotein representations from PAAG over 7 prediction tasks. Furthermore, PAAG\ndemonstrates a significant increase in generation success rate (24.7% vs 4.7%\nin zinc finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison\nto the existing model. We anticipate that PAAG will broaden the horizons of\nprotein design by leveraging the knowledge from between textual annotation and\nproteins.\n","authors":["Chaohao Yuan","Songyou Li","Geyan Ye","Yikun Zhang","Long-Kai Huang","Wenbing Huang","Wei Liu","Jianhua Yao","Yu Rong"],"pdf_url":"https://arxiv.org/pdf/2404.16866v4.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2402.17363v5","updated":"2024-12-12T06:39:46Z","published":"2024-02-27T09:55:34Z","title":"CGGM: A conditional graph generation model with adaptive sparsity for\n node anomaly detection in IoT networks","summary":" Dynamic graphs are extensively employed for detecting anomalous behavior in\nnodes within the Internet of Things (IoT). Graph generative models are often\nused to address the issue of imbalanced node categories in dynamic graphs.\nNevertheless, the constraints it faces include the monotonicity of adjacency\nrelationships, the difficulty in constructing multi-dimensional features for\nnodes, and the lack of a method for end-to-end generation of multiple\ncategories of nodes. In this paper, we propose a novel graph generation model,\ncalled CGGM, specifically for generating samples belonging to the minority\nclass. The framework consists two core module: a conditional graph generation\nmodule and a graph-based anomaly detection module. The generative module adapts\nto the sparsity of the matrix by downsampling a noise adjacency matrix, and\nincorporates a multi-dimensional feature encoder based on multi-head\nself-attention to capture latent dependencies among features. Additionally, a\nlatent space constraint is combined with the distribution distance to\napproximate the latent distribution of real data. The graph-based anomaly\ndetection module utilizes the generated balanced dataset to predict the node\nbehaviors. Extensive experiments have shown that CGGM outperforms the\nstate-of-the-art methods in terms of accuracy and divergence. The results also\ndemonstrate CGGM can generated diverse data categories, that enhancing the\nperformance of multi-category classification task.\n","authors":["Munan Li","Xianshi Su","Runze Ma","Tongbang Jiang","Zijian Li","Tony Q. S. Quek"],"pdf_url":"https://arxiv.org/pdf/2402.17363v5.pdf","comment":"10 pages, 19 figures"},{"id":"http://arxiv.org/abs/2412.08984v1","updated":"2024-12-12T06:37:32Z","published":"2024-12-12T06:37:32Z","title":"Predicting Emergency Department Visits for Patients with Type II\n Diabetes","summary":" Over 30 million Americans are affected by Type II diabetes (T2D), a treatable\ncondition with significant health risks. This study aims to develop and\nvalidate predictive models using machine learning (ML) techniques to estimate\nemergency department (ED) visits among patients with T2D. Data for these\npatients was obtained from the HealthShare Exchange (HSX), focusing on\ndemographic details, diagnoses, and vital signs. Our sample contained 34,151\npatients diagnosed with T2D which resulted in 703,065 visits overall between\n2017 and 2021. A workflow integrated EMR data with SDoH for ML predictions. A\ntotal of 87 out of 2,555 features were selected for model construction. Various\nmachine learning algorithms, including CatBoost, Ensemble Learning, K-nearest\nNeighbors (KNN), Support Vector Classification (SVC), Random Forest, and\nExtreme Gradient Boosting (XGBoost), were employed with tenfold\ncross-validation to predict whether a patient is at risk of an ED visit. The\nROC curves for Random Forest, XGBoost, Ensemble Learning, CatBoost, KNN, and\nSVC, were 0.82, 0.82, 0.82, 0.81, 0.72, 0.68, respectively. Ensemble Learning\nand Random Forest models demonstrated superior predictive performance in terms\nof discrimination, calibration, and clinical applicability. These models are\nreliable tools for predicting risk of ED visits among patients with T2D. They\ncan estimate future ED demand and assist clinicians in identifying critical\nfactors associated with ED utilization, enabling early interventions to reduce\nsuch visits. The top five important features were age, the difference between\nvisitation gaps, visitation gaps, R10 or abdominal and pelvic pain, and the\nIndex of Concentration at the Extremes (ICE) for income.\n","authors":["Javad M Alizadeh","Jay S Patel","Gabriel Tajeu","Yuzhou Chen","Ilene L Hollin","Mukesh K Patel","Junchao Fei","Huanmei Wu"],"pdf_url":"https://arxiv.org/pdf/2412.08984v1.pdf","comment":"This manuscript has been accepted and presented at AI-PHSS 2024: The\n 2024 International Workshop on AI Applications in Public Health and Social\n Services in conjunction with the 22nd International Conference of Artificial\n Intelligence in Medicine (AIME 2024)"},{"id":"http://arxiv.org/abs/2409.09269v3","updated":"2024-12-12T06:26:09Z","published":"2024-09-14T02:29:36Z","title":"Guiding Vision-Language Model Selection for Visual Question-Answering\n Across Tasks, Domains, and Knowledge Types","summary":" Visual Question-Answering (VQA) has become key to user experience,\nparticularly after improved generalization capabilities of Vision-Language\nModels (VLMs). But evaluating VLMs for an application requirement using a\nstandardized framework in practical settings is still challenging. This paper\naims to solve that using an end-to-end framework. We present VQA360 - a novel\ndataset derived from established VQA benchmarks, annotated with task types,\napplication domains, and knowledge types, for a comprehensive evaluation. We\nalso introduce GoEval, a multimodal evaluation metric developed using GPT-4o,\nachieving a correlation factor of 56.71% with human judgments. Our experiments\nwith state-of-the-art VLMs reveal that no single model excels universally,\nthus, making a right choice a key design decision. Proprietary models such as\nGemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source\nmodels like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive\nstrengths, while providing additional advantages. Our framework can also be\nextended to other tasks.\n","authors":["Neelabh Sinha","Vinija Jain","Aman Chadha"],"pdf_url":"https://arxiv.org/pdf/2409.09269v3.pdf","comment":"Accepted at The First Workshop of Evaluation of Multi-Modal\n Generation (EvalMG) in 31st International Conference on Computational\n Linguistics (COLING), 2025. 8 pages + references + 6 pages of Appendix"},{"id":"http://arxiv.org/abs/2412.08979v1","updated":"2024-12-12T06:26:02Z","published":"2024-12-12T06:26:02Z","title":"A Wander Through the Multimodal Landscape: Efficient Transfer Learning\n via Low-rank Sequence Multimodal Adapter","summary":" Efficient transfer learning methods such as adapter-based methods have shown\ngreat success in unimodal models and vision-language models. However, existing\nmethods have two main challenges in fine-tuning multimodal models. Firstly,\nthey are designed for vision-language tasks and fail to extend to situations\nwhere there are more than two modalities. Secondly, they exhibit limited\nexploitation of interactions between modalities and lack efficiency. To address\nthese issues, in this paper, we propose the loW-rank sequence multimodal\nadapter (Wander). We first use the outer product to fuse the information from\ndifferent modalities in an element-wise way effectively. For efficiency, we use\nCP decomposition to factorize tensors into rank-one components and achieve\nsubstantial parameter reduction. Furthermore, we implement a token-level\nlow-rank decomposition to extract more fine-grained features and sequence\nrelationships between modalities. With these designs, Wander enables\ntoken-level interactions between sequences of different modalities in a\nparameter-efficient way. We conduct extensive experiments on datasets with\ndifferent numbers of modalities, where Wander outperforms state-of-the-art\nefficient transfer learning methods consistently. The results fully demonstrate\nthe effectiveness, efficiency and universality of Wander.\n","authors":["Zirun Guo","Xize Cheng","Yangyang Wu","Tao Jin"],"pdf_url":"https://arxiv.org/pdf/2412.08979v1.pdf","comment":"Accepted at AAAI 2025"},{"id":"http://arxiv.org/abs/2409.18417v2","updated":"2024-12-12T06:18:36Z","published":"2024-09-27T03:15:07Z","title":"VickreyFeedback: Cost-efficient Data Construction for Reinforcement\n Learning from Human Feedback","summary":" This paper addresses the cost-efficiency aspect of Reinforcement Learning\nfrom Human Feedback (RLHF). RLHF leverages datasets of human preferences over\noutputs of large language models (LLM)s to instill human expectations into\nLLMs. Although preference annotation comes with a monetized cost, the economic\nutility of a preference dataset has not been considered by far. What\nexacerbates this situation is that, given complex intransitive or cyclic\nrelationships in preference datasets, existing algorithms for fine-tuning LLMs\nare still far from capturing comprehensive preferences. This raises severe\ncost-efficiency concerns in production environments, where preference data\naccumulate over time. In this paper, we discuss the fine-tuning of LLMs as a\nmonetized economy and introduce an auction mechanism to improve the efficiency\nof preference data collection in dollar terms. We show that introducing an\nauction mechanism can play an essential role in enhancing the cost-efficiency\nof RLHF, while maintaining satisfactory model performance. Experimental results\ndemonstrate that our proposed auction-based protocol is cost-effective for\nfine-tuning LLMs concentrating on high-quality feedback.\n","authors":["Guoxi Zhang","Jiuding Duan"],"pdf_url":"https://arxiv.org/pdf/2409.18417v2.pdf","comment":"16 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.08976v1","updated":"2024-12-12T06:13:32Z","published":"2024-12-12T06:13:32Z","title":"Enhancing Facial Consistency in Conditional Video Generation via Facial\n Landmark Transformation","summary":" Landmark-guided character animation generation is an important field.\nGenerating character animations with facial features consistent with a\nreference image remains a significant challenge in conditional video\ngeneration, especially involving complex motions like dancing. Existing methods\noften fail to maintain facial feature consistency due to mismatches between the\nfacial landmarks extracted from source videos and the target facial features in\nthe reference image. To address this problem, we propose a facial landmark\ntransformation method based on the 3D Morphable Model (3DMM). We obtain\ntransformed landmarks that align with the target facial features by\nreconstructing 3D faces from the source landmarks and adjusting the 3DMM\nparameters to match the reference image. Our method improves the facial\nconsistency between the generated videos and the reference images, effectively\nimproving the facial feature mismatch problem.\n","authors":["Lianrui Mu","Xingze Zhou","Wenjie Zheng","Jiangnan Ye","Xiaoyu Liang","Yuchen Yang","Jianhong Bai","Jiedong Zhuang","Haoji Hu"],"pdf_url":"https://arxiv.org/pdf/2412.08976v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08969v1","updated":"2024-12-12T06:04:20Z","published":"2024-12-12T06:04:20Z","title":"Deep Learning Model Security: Threats and Defenses","summary":" Deep learning has transformed AI applications but faces critical security\nchallenges, including adversarial attacks, data poisoning, model theft, and\nprivacy leakage. This survey examines these vulnerabilities, detailing their\nmechanisms and impact on model integrity and confidentiality. Practical\nimplementations, including adversarial examples, label flipping, and backdoor\nattacks, are explored alongside defenses such as adversarial training,\ndifferential privacy, and federated learning, highlighting their strengths and\nlimitations.\n Advanced methods like contrastive and self-supervised learning are presented\nfor enhancing robustness. The survey concludes with future directions,\nemphasizing automated defenses, zero-trust architectures, and the security\nchallenges of large AI models. A balanced approach to performance and security\nis essential for developing reliable deep learning systems.\n","authors":["Tianyang Wang","Ziqian Bi","Yichao Zhang","Ming Liu","Weiche Hsieh","Pohsun Feng","Lawrence K. Q. Yan","Yizhu Wen","Benji Peng","Junyu Liu","Keyu Chen","Sen Zhang","Ming Li","Chuanqi Jiang","Xinyuan Song","Junjie Yang","Bowen Jing","Jintao Ren","Junhao Song","Hong-Ming Tseng","Silin Chen","Yunze Wang","Chia Xin Liang","Jiawei Xu","Xuanhe Pan","Jinlang Wang","Qian Niu"],"pdf_url":"https://arxiv.org/pdf/2412.08969v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08961v1","updated":"2024-12-12T05:48:34Z","published":"2024-12-12T05:48:34Z","title":"Belted and Ensembled Neural Network for Linear and Nonlinear Sufficient\n Dimension Reduction","summary":" We introduce a unified, flexible, and easy-to-implement framework of\nsufficient dimension reduction that can accommodate both linear and nonlinear\ndimension reduction, and both the conditional distribution and the conditional\nmean as the targets of estimation. This unified framework is achieved by a\nspecially structured neural network -- the Belted and Ensembled Neural Network\n(BENN) -- that consists of a narrow latent layer, which we call the belt, and a\nfamily of transformations of the response, which we call the ensemble. By\nstrategically placing the belt at different layers of the neural network, we\ncan achieve linear or nonlinear sufficient dimension reduction, and by choosing\nthe appropriate transformation families, we can achieve dimension reduction for\nthe conditional distribution or the conditional mean. Moreover, thanks to the\nadvantage of the neural network, the method is very fast to compute, overcoming\na computation bottleneck of the traditional sufficient dimension reduction\nestimators, which involves the inversion of a matrix of dimension either p or\nn. We develop the algorithm and convergence rate of our method, compare it with\nexisting sufficient dimension reduction methods, and apply it to two data\nexamples.\n","authors":["Yin Tang","Bing Li"],"pdf_url":"https://arxiv.org/pdf/2412.08961v1.pdf","comment":"35 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.08951v1","updated":"2024-12-12T05:33:23Z","published":"2024-12-12T05:33:23Z","title":"Stochastic Learning of Non-Conjugate Variational Posterior for Image\n Classification","summary":" Large scale Bayesian nonparametrics (BNP) learner such as stochastic\nvariational inference (SVI) can handle datasets with large class number and\nlarge training size at fractional cost. Like its predecessor, SVI rely on the\nassumption of conjugate variational posterior to approximate the true\nposterior. A more challenging problem is to consider large scale learning on\nnon-conjugate posterior. Recent works in this direction are mostly associated\nwith using Monte Carlo methods for approximating the learner. However, these\nworks are usually demonstrated on non-BNP related task and less complex models\nsuch as logistic regression, due to higher computational complexity. In order\nto overcome the issue faced by SVI, we develop a novel approach based on the\nrecently proposed variational maximization-maximization (VMM) learner to allow\nlarge scale learning on non-conjugate posterior. Unlike SVI, our VMM learner\ndoes not require closed-form expression for the variational posterior\nexpectatations. Our only requirement is that the variational posterior is\ndifferentiable. In order to ensure convergence in stochastic settings, SVI rely\non decaying step-sizes to slow its learning. Inspired by SVI and Adam, we\npropose the novel use of decaying step-sizes on both gradient and ascent\ndirection in our VMM to significantly improve its learning. We show that our\nproposed methods is compatible with ResNet features when applied to large class\nnumber datasets such as MIT67 and SUN397. Finally, we compare our proposed\nlearner with several recent works such as deep clustering algorithms and showed\nwe were able to produce on par or outperform the state-of-the-art methods in\nterms of clustering measures.\n","authors":["Kart-Leong Lim"],"pdf_url":"https://arxiv.org/pdf/2412.08951v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.06126v4","updated":"2024-12-12T05:28:56Z","published":"2024-02-09T01:18:16Z","title":"Learn To be Efficient: Build Structured Sparsity in Large Language\n Models","summary":" Large Language Models (LLMs) have achieved remarkable success with their\nbillion-level parameters, yet they incur high inference overheads. The\nemergence of activation sparsity in LLMs provides a natural approach to reduce\nthis cost by involving only parts of the parameters for inference. However,\nexisting methods only focus on utilizing this naturally formed activation\nsparsity in a post-training setting, overlooking the potential for further\namplifying this inherent sparsity. In this paper, we hypothesize that LLMs can\nlearn to be efficient by achieving more structured activation sparsity. To\nachieve this, we introduce a novel training algorithm, Learn-To-be-Efficient\n(LTE), designed to train efficiency-aware LLMs to learn to activate fewer\nneurons and achieve a better trade-off between sparsity and performance.\nFurthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based\nmodels, LTE can also be applied to LLMs like LLaMA using non-ReLU activations.\nExtensive evaluation on language understanding, language generation, and\ninstruction tuning tasks show that LTE consistently outperforms SOTA baselines.\nAlong with our hardware-aware custom kernel implementation, LTE reduces\nLLaMA2-7B inference latency by 25% at 50% sparsity.\n","authors":["Haizhong Zheng","Xiaoyan Bai","Xueshen Liu","Z. Morley Mao","Beidi Chen","Fan Lai","Atul Prakash"],"pdf_url":"https://arxiv.org/pdf/2402.06126v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08946v1","updated":"2024-12-12T05:22:49Z","published":"2024-12-12T05:22:49Z","title":"MoSLD: An Extremely Parameter-Efficient Mixture-of-Shared LoRAs for\n Multi-Task Learning","summary":" Recently, LoRA has emerged as a crucial technique for fine-tuning large\npre-trained models, yet its performance in multi-task learning scenarios often\nfalls short. In contrast, the MoE architecture presents a natural solution to\nthis issue. However, it introduces challenges such as mutual interference of\ndata across multiple domains and knowledge forgetting of various tasks.\nAdditionally, MoE significantly increases the number of parameters, posing a\ncomputational cost challenge. Therefore, in this paper, we propose MoSLD, a\nmixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these\nchallenges by sharing the upper projection matrix in LoRA among different\nexperts, encouraging the model to learn general knowledge across tasks, while\nstill allowing the lower projection matrix to focus on the unique features of\neach task. The application of dropout alleviates the imbalanced update of\nparameter matrix and mitigates parameter overfitting in LoRA. Extensive\nexperiments demonstrate that our model exhibits excellent performance in both\nsingle-task and multi-task scenarios, with robust out-of-domain generalization\ncapabilities.\n","authors":["Lulu Zhao","Weihao Zeng","Xiaofeng Shi","Hua Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.08946v1.pdf","comment":"Accept by COLING 2025"},{"id":"http://arxiv.org/abs/2402.12683v2","updated":"2024-12-12T05:19:43Z","published":"2024-02-20T03:14:47Z","title":"TorchCP: A Python Library for Conformal Prediction","summary":" Conformal Prediction (CP) has attracted great attention from the research\ncommunity due to its strict theoretical guarantees. However, researchers and\ndevelopers still face challenges of applicability and efficiency when applying\nCP algorithms to deep learning models. In this paper, we introduce \\torchcp, a\ncomprehensive PyTorch-based toolkit to strengthen the usability of CP for deep\nlearning models. \\torchcp implements a wide range of post-hoc and training\nmethods of conformal prediction for various machine learning tasks, including\nclassification, regression, GNN, and LLM. Moreover, we provide user-friendly\ninterfaces and extensive evaluations to easily integrate CP algorithms into\nspecific tasks. Our \\torchcp toolkit, built entirely with PyTorch, enables\nhigh-performance GPU acceleration for deep learning models and mini-batch\ncomputation on large-scale datasets. With the LGPL license, the code is\nopen-sourced at \\url{https://github.com/ml-stat-Sustech/TorchCP} and will be\ncontinuously updated.\n","authors":["Jianguo Huang","Jianqing Song","Xuanning Zhou","Bingyi Jing","Hongxin Wei"],"pdf_url":"https://arxiv.org/pdf/2402.12683v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.09608v1","updated":"2024-12-12T18:59:34Z","published":"2024-12-12T18:59:34Z","title":"Representing Long Volumetric Video with Temporal Gaussian Hierarchy","summary":" This paper aims to address the challenge of reconstructing long volumetric\nvideos from multi-view RGB videos. Recent dynamic view synthesis methods\nleverage powerful 4D representations, like feature grids or point cloud\nsequences, to achieve high-quality rendering results. However, they are\ntypically limited to short (1~2s) video clips and often suffer from large\nmemory footprints when dealing with longer videos. To solve this issue, we\npropose a novel 4D representation, named Temporal Gaussian Hierarchy, to\ncompactly model long volumetric videos. Our key observation is that there are\ngenerally various degrees of temporal redundancy in dynamic scenes, which\nconsist of areas changing at different speeds. Motivated by this, our approach\nbuilds a multi-level hierarchy of 4D Gaussian primitives, where each level\nseparately describes scene regions with different degrees of content change,\nand adaptively shares Gaussian primitives to represent unchanged scene content\nover different temporal segments, thus effectively reducing the number of\nGaussian primitives. In addition, the tree-like structure of the Gaussian\nhierarchy allows us to efficiently represent the scene at a particular moment\nwith a subset of Gaussian primitives, leading to nearly constant GPU memory\nusage during the training or rendering regardless of the video length.\nExtensive experimental results demonstrate the superiority of our method over\nalternative methods in terms of training cost, rendering speed, and storage\nusage. To our knowledge, this work is the first approach capable of efficiently\nhandling minutes of volumetric video data while maintaining state-of-the-art\nrendering quality. Our project page is available at:\nhttps://zju3dv.github.io/longvolcap.\n","authors":["Zhen Xu","Yinghao Xu","Zhiyuan Yu","Sida Peng","Jiaming Sun","Hujun Bao","Xiaowei Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.09608v1.pdf","comment":"SIGGRAPH Asia 2024 (TOG). Project page:\n https://zju3dv.github.io/longvolcap"},{"id":"http://arxiv.org/abs/2412.09501v1","updated":"2024-12-12T17:50:39Z","published":"2024-12-12T17:50:39Z","title":"Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition","summary":" As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond\nsingle-domain capabilities is essential to meet the demands for more versatile\nand efficient AI. However, previous omni-models have insufficiently explored\nspeech, neglecting its integration with multi-modality. We introduce Lyra, an\nefficient MLLM that enhances multimodal abilities, including advanced\nlong-speech comprehension, sound understanding, cross-modality efficiency, and\nseamless speech interaction. To achieve efficiency and speech-centric\ncapabilities, Lyra employs three strategies: (1) leveraging existing\nopen-source large models and a proposed multi-modality LoRA to reduce training\ncosts and data requirements; (2) using a latent multi-modality regularizer and\nextractor to strengthen the relationship between speech and other modalities,\nthereby enhancing model performance; and (3) constructing a high-quality,\nextensive dataset that includes 1.5M multi-modal (language, vision, audio) data\nsamples and 12K long speech samples, enabling Lyra to handle complex long\nspeech inputs and achieve more robust omni-cognition. Compared to other\nomni-methods, Lyra achieves state-of-the-art performance on various\nvision-language, vision-speech, and speech-language benchmarks, while also\nusing fewer computational resources and less training data.\n","authors":["Zhisheng Zhong","Chengyao Wang","Yuqi Liu","Senqiao Yang","Longxiang Tang","Yuechen Zhang","Jingyao Li","Tianyuan Qu","Yanwei Li","Yukang Chen","Shaozuo Yu","Sitong Wu","Eric Lo","Shu Liu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2412.09501v1.pdf","comment":"Tech report"},{"id":"http://arxiv.org/abs/2412.09492v1","updated":"2024-12-12T17:41:49Z","published":"2024-12-12T17:41:49Z","title":"Video Seal: Open and Efficient Video Watermarking","summary":" The proliferation of AI-generated content and sophisticated video editing\ntools has made it both important and challenging to moderate digital platforms.\nVideo watermarking addresses these challenges by embedding imperceptible\nsignals into videos, allowing for identification. However, the rare open tools\nand methods often fall short on efficiency, robustness, and flexibility. To\nreduce these gaps, this paper introduces Video Seal, a comprehensive framework\nfor neural video watermarking and a competitive open-sourced model. Our\napproach jointly trains an embedder and an extractor, while ensuring the\nwatermark robustness by applying transformations in-between, e.g., video\ncodecs. This training is multistage and includes image pre-training, hybrid\npost-training and extractor fine-tuning. We also introduce temporal watermark\npropagation, a technique to convert any image watermarking model to an\nefficient video watermarking model without the need to watermark every\nhigh-resolution frame. We present experimental results demonstrating the\neffectiveness of the approach in terms of speed, imperceptibility, and\nrobustness. Video Seal achieves higher robustness compared to strong baselines\nespecially under challenging distortions combining geometric transformations\nand video compression. Additionally, we provide new insights such as the impact\nof video compression during training, and how to compare methods operating on\ndifferent payloads. Contributions in this work - including the codebase,\nmodels, and a public demo - are open-sourced under permissive licenses to\nfoster further research and development in the field.\n","authors":["Pierre Fernandez","Hady Elsahar","I. Zeki Yalniz","Alexandre Mourachko"],"pdf_url":"https://arxiv.org/pdf/2412.09492v1.pdf","comment":"Code available at https://github.com/facebookresearch/videoseal"},{"id":"http://arxiv.org/abs/2412.09428v1","updated":"2024-12-12T16:33:21Z","published":"2024-12-12T16:33:21Z","title":"Multimodal Music Generation with Explicit Bridges and Retrieval\n Augmentation","summary":" Multimodal music generation aims to produce music from diverse input\nmodalities, including text, videos, and images. Existing methods use a common\nembedding space for multimodal fusion. Despite their effectiveness in other\nmodalities, their application in multimodal music generation faces challenges\nof data scarcity, weak cross-modal alignment, and limited controllability. This\npaper addresses these issues by using explicit bridges of text and music for\nmultimodal alignment. We introduce a novel method named Visuals Music Bridge\n(VMB). Specifically, a Multimodal Music Description Model converts visual\ninputs into detailed textual descriptions to provide the text bridge; a\nDual-track Music Retrieval module that combines broad and targeted retrieval\nstrategies to provide the music bridge and enable user control. Finally, we\ndesign an Explicitly Conditioned Music Generation framework to generate music\nbased on the two bridges. We conduct experiments on video-to-music,\nimage-to-music, text-to-music, and controllable music generation tasks, along\nwith experiments on controllability. The results demonstrate that VMB\nsignificantly enhances music quality, modality, and customization alignment\ncompared to previous methods. VMB sets a new standard for interpretable and\nexpressive multimodal music generation with applications in various multimedia\nfields. Demos and code are available at https://github.com/wbs2788/VMB.\n","authors":["Baisen Wang","Le Zhuo","Zhaokai Wang","Chenxi Bao","Wu Chengjing","Xuecheng Nie","Jiao Dai","Jizhong Han","Yue Liao","Si Liu"],"pdf_url":"https://arxiv.org/pdf/2412.09428v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09353v1","updated":"2024-12-12T15:22:03Z","published":"2024-12-12T15:22:03Z","title":"Causal Graphical Models for Vision-Language Compositional Understanding","summary":" Recent work has empirically shown that Vision-Language Models (VLMs) struggle\nto fully understand the compositional properties of the human language, usually\nmodeling an image caption as a \"bag of words\". As a result, they perform poorly\non compositional tasks, which require a deeper understanding of the different\nentities of a sentence (subject, verb, etc.) jointly with their mutual\nrelationships in order to be solved. In this paper, we model the dependency\nrelations among textual and visual tokens using a Causal Graphical Model (CGM),\nbuilt using a dependency parser, and we train a decoder conditioned by the VLM\nvisual encoder. Differently from standard autoregressive or parallel\npredictions, our decoder's generative process is partially-ordered following\nthe CGM structure. This structure encourages the decoder to learn only the main\ncausal dependencies in a sentence discarding spurious correlations. Using\nextensive experiments on five compositional benchmarks, we show that our method\nsignificantly outperforms all the state-of-the-art compositional approaches by\na large margin, and it also improves over methods trained using much larger\ndatasets.\n","authors":["Fiorenzo Parascandolo","Nicholas Moratelli","Enver Sangineto","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.09353v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.09329v1","updated":"2024-12-12T14:53:16Z","published":"2024-12-12T14:53:16Z","title":"Towards Open-Vocabulary Video Semantic Segmentation","summary":" Semantic segmentation in videos has been a focal point of recent research.\nHowever, existing models encounter challenges when faced with unfamiliar\ncategories. To address this, we introduce the Open Vocabulary Video Semantic\nSegmentation (OV-VSS) task, designed to accurately segment every pixel across a\nwide range of open-vocabulary categories, including those that are novel or\npreviously unexplored. To enhance OV-VSS performance, we propose a robust\nbaseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing\nthe model to utilize temporal relationships across consecutive frames.\nAdditionally, we incorporate a random frame enhancement module, broadening the\nmodel's understanding of semantic context throughout the entire video sequence.\nOur approach also includes video text encoding, which strengthens the model's\ncapability to interpret textual information within the video context.\nComprehensive evaluations on benchmark datasets such as VSPW and Cityscapes\nhighlight OV-VSS's zero-shot generalization capabilities, especially in\nhandling novel categories. The results validate OV2VSS's effectiveness,\ndemonstrating improved performance in semantic segmentation tasks across\ndiverse video datasets.\n","authors":["Xinhao Li","Yun Liu","Guolei Sun","Min Wu","Le Zhang","Ce Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.09329v1.pdf","comment":"13 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.09317v1","updated":"2024-12-12T14:42:10Z","published":"2024-12-12T14:42:10Z","title":"Multimodal Sentiment Analysis based on Video and Audio Inputs","summary":" Despite the abundance of current researches working on the sentiment analysis\nfrom videos and audios, finding the best model that gives the highest accuracy\nrate is still considered a challenge for researchers in this field. The main\nobjective of this paper is to prove the usability of emotion recognition models\nthat take video and audio inputs. The datasets used to train the models are the\nCREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned\nmodels that been used are: Facebook/wav2vec2-large for audio and the\nGoogle/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for\neach emotion generated by the two previous models is utilized in the decision\nmaking framework. After disparity in the results, if one of the models gets\nmuch higher accuracy, another test framework is created. The methods used are\nthe Weighted Average method, the Confidence Level Threshold method, the Dynamic\nWeighting Based on Confidence method, and the Rule-Based Logic method. This\nlimited approach gives encouraging results that make future research into these\nmethods viable.\n","authors":["Antonio Fernandez","Suzan Awinat"],"pdf_url":"https://arxiv.org/pdf/2412.09317v1.pdf","comment":"Presented as a full paper in the 15th International Conference on\n Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October\n 28-30, 2024, Leuven, Belgium"},{"id":"http://arxiv.org/abs/2408.13712v2","updated":"2024-12-12T11:18:51Z","published":"2024-08-25T03:21:48Z","title":"Riemann-based Multi-scale Attention Reasoning Network for Text-3D\n Retrieval","summary":" Due to the challenges in acquiring paired Text-3D data and the inherent\nirregularity of 3D data structures, combined representation learning of 3D\npoint clouds and text remains unexplored. In this paper, we propose a novel\nRiemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D\nretrieval. Specifically, the extracted text and point cloud features are\nrefined by their respective Adaptive Feature Refiner (AFR). Furthermore, we\nintroduce the innovative Riemann Local Similarity (RLS) module and the Global\nPooling Similarity (GPS) module. However, as 3D point cloud data and text data\noften possess complex geometric structures in high-dimensional space, the\nproposed RLS employs a novel Riemann Attention Mechanism to reflect the\nintrinsic geometric relationships of the data. Without explicitly defining the\nmanifold, RMARN learns the manifold parameters to better represent the\ndistances between text-point cloud samples. To address the challenges of\nlacking paired text-3D data, we have created the large-scale Text-3D Retrieval\ndataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud\ndata. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained\nChinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs,\nrespectively. Experiments on our custom datasets demonstrate the superior\nperformance of the proposed method. Our code and proposed datasets are\navailable at \\url{https://github.com/liwrui/RMARN}.\n","authors":["Wenrui Li","Wei Han","Yandu Chen","Yeyu Chai","Yidan Lu","Xingtao Wang","Xiaopeng Fan"],"pdf_url":"https://arxiv.org/pdf/2408.13712v2.pdf","comment":"Accepted by AAAI25"},{"id":"http://arxiv.org/abs/2412.09168v1","updated":"2024-12-12T10:55:57Z","published":"2024-12-12T10:55:57Z","title":"YingSound: Video-Guided Sound Effects Generation with Multi-modal\n Chain-of-Thought Controls","summary":" Generating sound effects for product-level videos, where only a small amount\nof labeled data is available for diverse scenes, requires the production of\nhigh-quality sounds in few-shot settings. To tackle the challenge of limited\nlabeled data in real-world scenes, we introduce YingSound, a foundation model\ndesigned for video-guided sound generation that supports high-quality audio\ngeneration in few-shot settings. Specifically, YingSound consists of two major\nmodules. The first module uses a conditional flow matching transformer to\nachieve effective semantic alignment in sound generation across audio and\nvisual modalities. This module aims to build a learnable audio-visual\naggregator (AVA) that integrates high-resolution visual features with\ncorresponding audio features at multiple stages. The second module is developed\nwith a proposed multi-modal visual-audio chain-of-thought (CoT) approach to\ngenerate finer sound effects in few-shot settings. Finally, an\nindustry-standard video-to-audio (V2A) dataset that encompasses various\nreal-world scenarios is presented. We show that YingSound effectively generates\nhigh-quality synchronized sounds across diverse conditional inputs through\nautomated evaluations and human studies. Project Page:\n\\url{https://giantailab.github.io/yingsound/}\n","authors":["Zihao Chen","Haomin Zhang","Xinhan Di","Haoyu Wang","Sizhe Shan","Junjie Zheng","Yunming Liang","Yihan Fan","Xinfa Zhu","Wenjie Tian","Yihua Wang","Chaofan Ding","Lei Xie"],"pdf_url":"https://arxiv.org/pdf/2412.09168v1.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.09126v1","updated":"2024-12-12T10:03:46Z","published":"2024-12-12T10:03:46Z","title":"Enhancing Modality Representation and Alignment for Multimodal\n Cold-start Active Learning","summary":" Training multimodal models requires a large amount of labeled data. Active\nlearning (AL) aim to reduce labeling costs. Most AL methods employ warm-start\napproaches, which rely on sufficient labeled data to train a well-calibrated\nmodel that can assess the uncertainty and diversity of unlabeled data. However,\nwhen assembling a dataset, labeled data are often scarce initially, leading to\na cold-start problem. Additionally, most AL methods seldom address multimodal\ndata, highlighting a research gap in this field. Our research addresses these\nissues by developing a two-stage method for Multi-Modal Cold-Start Active\nLearning (MMCSAL).\n Firstly, we observe the modality gap, a significant distance between the\ncentroids of representations from different modalities, when only using\ncross-modal pairing information as self-supervision signals. This modality gap\naffects data selection process, as we calculate both uni-modal and cross-modal\ndistances. To address this, we introduce uni-modal prototypes to bridge the\nmodality gap. Secondly, conventional AL methods often falter in multimodal\nscenarios where alignment between modalities is overlooked. Therefore, we\npropose enhancing cross-modal alignment through regularization, thereby\nimproving the quality of selected multimodal data pairs in AL. Finally, our\nexperiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs\nacross three multimodal datasets.\n","authors":["Meng Shen","Yake Wei","Jianxiong Yin","Deepu Rajan","Di Hu","Simon See"],"pdf_url":"https://arxiv.org/pdf/2412.09126v1.pdf","comment":"11 pages, ACMMM Asia 2024, Oral Presentation"},{"id":"http://arxiv.org/abs/2412.09008v1","updated":"2024-12-12T07:20:32Z","published":"2024-12-12T07:20:32Z","title":"MS2Mesh-XR: Multi-modal Sketch-to-Mesh Generation in XR Environments","summary":" We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline\nthat enables users to create realistic 3D objects in extended reality (XR)\nenvironments using hand-drawn sketches assisted by voice inputs. In specific,\nusers can intuitively sketch objects using natural hand movements in mid-air\nwithin a virtual environment. By integrating voice inputs, we devise ControlNet\nto infer realistic images based on the drawn sketches and interpreted text\nprompts. Users can then review and select their preferred image, which is\nsubsequently reconstructed into a detailed 3D mesh using the Convolutional\nReconstruction Model. In particular, our proposed pipeline can generate a\nhigh-quality 3D mesh in less than 20 seconds, allowing for immersive\nvisualization and manipulation in run-time XR scenes. We demonstrate the\npracticability of our pipeline through two use cases in XR settings. By\nleveraging natural user inputs and cutting-edge generative AI capabilities, our\napproach can significantly facilitate XR-based creative production and enhance\nuser experiences. Our code and demo will be available at:\nhttps://yueqiu0911.github.io/MS2Mesh-XR/\n","authors":["Yuqi Tong","Yue Qiu","Ruiyang Li","Shi Qiu","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2412.09008v1.pdf","comment":"IEEE AIxVR 2025"},{"id":"http://arxiv.org/abs/2412.08988v1","updated":"2024-12-12T06:39:49Z","published":"2024-12-12T06:39:49Z","title":"EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing","summary":" Given a piece of text, a video clip, and a reference audio, the movie dubbing\ntask aims to generate speech that aligns with the video while cloning the\ndesired voice. The existing methods have two primary deficiencies: (1) They\nstruggle to simultaneously hold audio-visual sync and achieve clear\npronunciation; (2) They lack the capacity to express user-defined emotions. To\naddress these problems, we propose EmoDubber, an emotion-controllable dubbing\narchitecture that allows users to specify emotion type and emotional intensity\nwhile satisfying high-quality lip sync and pronunciation. Specifically, we\nfirst design Lip-related Prosody Aligning (LPA), which focuses on learning the\ninherent consistency between lip motion and prosody variation by duration level\ncontrastive learning to incorporate reasonable alignment. Then, we design\nPronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences\nby efficient conformer to improve speech intelligibility. Next, the speaker\nidentity adapting module aims to decode acoustics prior and inject the speaker\nstyle embedding. After that, the proposed Flow-based User Emotion Controlling\n(FUEC) is used to synthesize waveform by flow matching prediction network\nconditioned on acoustics prior. In this process, the FUEC determines the\ngradient direction and guidance scale based on the user's emotion instructions\nby the positive and negative guidance mechanism, which focuses on amplifying\nthe desired emotion while suppressing others. Extensive experimental results on\nthree benchmark datasets demonstrate favorable performance compared to several\nstate-of-the-art methods.\n","authors":["Gaoxiang Cong","Jiadong Pan","Liang Li","Yuankai Qi","Yuxin Peng","Anton van den Hengel","Jian Yang","Qingming Huang"],"pdf_url":"https://arxiv.org/pdf/2412.08988v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2310.02422v2","updated":"2024-12-12T03:59:45Z","published":"2023-10-03T20:36:03Z","title":"OneAdapt: Fast Configuration Adaptation for Video Analytics Applications\n via Backpropagation","summary":" Deep learning inference on streaming media data, such as object detection in\nvideo or LiDAR feeds and text extraction from audio waves, is now ubiquitous.\nTo achieve high inference accuracy, these applications typically require\nsignificant network bandwidth to gather high-fidelity data and extensive GPU\nresources to run deep neural networks (DNNs). While the high demand for network\nbandwidth and GPU resources could be substantially reduced by optimally\nadapting the configuration knobs, such as video resolution and frame rate,\ncurrent adaptation techniques fail to meet three requirements simultaneously:\nadapt configurations (i) with minimum extra GPU or bandwidth overhead; (ii) to\nreach near-optimal decisions based on how the data affects the final DNN's\naccuracy, and (iii) do so for a range of configuration knobs. This paper\npresents OneAdapt, which meets these requirements by leveraging a\ngradient-ascent strategy to adapt configuration knobs. The key idea is to\nembrace DNNs' differentiability to quickly estimate the accuracy's gradient to\neach configuration knob, called AccGrad. Specifically, OneAdapt estimates\nAccGrad by multiplying two gradients: InputGrad (i.e. how each configuration\nknob affects the input to the DNN) and DNNGrad (i.e. how the DNN input affects\nthe DNN inference output). We evaluate OneAdapt across five types of\nconfigurations, four analytic tasks, and five types of input data. Compared to\nstate-of-the-art adaptation schemes, OneAdapt cuts bandwidth usage and GPU\nusage by 15-59% while maintaining comparable accuracy or improves accuracy by\n1-5% while using equal or fewer resources.\n","authors":["Kuntai Du","Yuhan Liu","Yitian Hao","Qizheng Zhang","Haodong Wang","Yuyang Huang","Ganesh Ananthanarayanan","Junchen Jiang"],"pdf_url":"https://arxiv.org/pdf/2310.02422v2.pdf","comment":"SoCC' 23"},{"id":"http://arxiv.org/abs/2412.06465v3","updated":"2024-12-12T03:56:01Z","published":"2024-12-09T13:10:28Z","title":"Agent Journey Beyond RGB: Unveiling Hybrid Semantic-Spatial\n Environmental Representations for Vision-and-Language Navigation","summary":" Navigating unseen environments based on natural language instructions remains\ndifficult for egocentric agents in Vision-and-Language Navigation (VLN). While\nrecent advancements have yielded promising outcomes, they primarily rely on RGB\nimages for environmental representation, often overlooking the underlying\nsemantic knowledge and spatial cues. Intuitively, humans inherently ground\ntextual semantics within the spatial layout during indoor navigation. Inspired\nby this, we propose a versatile Semantic Understanding and Spatial Awareness\n(SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic\nUnderstanding (TSU) module, which narrows the modality gap between instructions\nand environments by generating and associating the descriptions of\nenvironmental landmarks in the agent's immediate surroundings. Additionally, a\nDepth-based Spatial Perception (DSP) module incrementally constructs a depth\nexploration map, enabling a more nuanced comprehension of environmental\nlayouts. Experimental results demonstrate that SUSA hybrid semantic-spatial\nrepresentations effectively enhance navigation performance, setting new\nstate-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and\nSOON). The source code will be publicly available.\n","authors":["Xuesong Zhang","Yunbo Xu","Jia Li","Zhenzhen Hu","Richnag Hong"],"pdf_url":"https://arxiv.org/pdf/2412.06465v3.pdf","comment":"A technical report consisting of 16 pages, 12 figures, 10 tables"},{"id":"http://arxiv.org/abs/2412.08912v1","updated":"2024-12-12T03:49:22Z","published":"2024-12-12T03:49:22Z","title":"Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K\n Video Restoration under Codec Compression","summary":" In this paper, we introduce DiQP; a novel Transformer-Diffusion model for\nrestoring 8K video quality degraded by codec compression. To the best of our\nknowledge, our model is the first to consider restoring the artifacts\nintroduced by various codecs (AV1, HEVC) by Denoising Diffusion without\nconsidering additional noise. This approach allows us to model the complex,\nnon-Gaussian nature of compression artifacts, effectively learning to reverse\nthe degradation. Our architecture combines the power of Transformers to capture\nlong-range dependencies with an enhanced windowed mechanism that preserves\nspatiotemporal context within groups of pixels across frames. To further\nenhance restoration, the model incorporates auxiliary \"Look Ahead\" and \"Look\nAround\" modules, providing both future and surrounding frame information to aid\nin reconstructing fine details and enhancing overall visual quality. Extensive\nexperiments on different datasets demonstrate that our model outperforms\nstate-of-the-art methods, particularly for high-resolution videos such as 4K\nand 8K, showcasing its effectiveness in restoring perceptually pleasing videos\nfrom highly compressed sources.\n","authors":["Ali Mollaahmadi Dehaghi","Reza Razavi","Mohammad Moshirpour"],"pdf_url":"https://arxiv.org/pdf/2412.08912v1.pdf","comment":"12 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.07689v2","updated":"2024-12-12T02:47:24Z","published":"2024-12-10T17:27:32Z","title":"DriveMM: All-in-One Large Multimodal Model for Autonomous Driving","summary":" Large Multimodal Models (LMMs) have demonstrated exceptional comprehension\nand interpretation capabilities in Autonomous Driving (AD) by incorporating\nlarge language models. Despite the advancements, current data-driven AD\napproaches tend to concentrate on a single dataset and specific tasks,\nneglecting their overall capabilities and ability to generalize. To bridge\nthese gaps, we propose DriveMM, a general large multimodal model designed to\nprocess diverse data inputs, such as images and multi-view videos, while\nperforming a broad spectrum of AD tasks, including perception, prediction, and\nplanning. Initially, the model undergoes curriculum pre-training to process\nvaried visual signals and perform basic visual comprehension and perception\ntasks. Subsequently, we augment and standardize various AD-related datasets to\nfine-tune the model, resulting in an all-in-one LMM for autonomous driving. To\nassess the general capabilities and generalization ability, we conduct\nevaluations on six public benchmarks and undertake zero-shot transfer on an\nunseen dataset, where DriveMM achieves state-of-the-art performance across all\ntasks. We hope DriveMM as a promising solution for future end-to-end autonomous\ndriving applications in the real world. Project page with code:\nhttps://github.com/zhijian11/DriveMM.\n","authors":["Zhijian Huang","Chengjian Feng","Feng Yan","Baihui Xiao","Zequn Jie","Yujie Zhong","Xiaodan Liang","Lin Ma"],"pdf_url":"https://arxiv.org/pdf/2412.07689v2.pdf","comment":null}]},"2024-12-11T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.08821v1","updated":"2024-12-11T23:36:20Z","published":"2024-12-11T23:36:20Z","title":"Large Concept Models: Language Modeling in a Sentence Representation\n Space","summary":" LLMs have revolutionized the field of artificial intelligence and have\nemerged as the de-facto tool for many tasks. The current established technology\nof LLMs is to process input and generate output at the token level. This is in\nsharp contrast to humans who operate at multiple levels of abstraction, well\nbeyond single words, to analyze information and to generate creative content.\nIn this paper, we present an attempt at an architecture which operates on an\nexplicit higher-level semantic representation, which we name a concept.\nConcepts are language- and modality-agnostic and represent a higher level idea\nor action in a flow. Hence, we build a \"Large Concept Model\". In this study, as\nproof of feasibility, we assume that a concept corresponds to a sentence, and\nuse an existing sentence embedding space, SONAR, which supports up to 200\nlanguages in both text and speech modalities.\n The Large Concept Model is trained to perform autoregressive sentence\nprediction in an embedding space. We explore multiple approaches, namely MSE\nregression, variants of diffusion-based generation, and models operating in a\nquantized SONAR space. These explorations are performed using 1.6B parameter\nmodels and training data in the order of 1.3T tokens. We then scale one\narchitecture to a model size of 7B parameters and training data of about 2.7T\ntokens. We perform an experimental evaluation on several generative tasks,\nnamely summarization and a new task of summary expansion. Finally, we show that\nour model exhibits impressive zero-shot generalization performance to many\nlanguages, outperforming existing LLMs of the same size. The training code of\nour models is freely available.\n","authors":[" The LCM team","Loïc Barrault","Paul-Ambroise Duquenne","Maha Elbayad","Artyom Kozhevnikov","Belen Alastruey","Pierre Andrews","Mariano Coria","Guillaume Couairon","Marta R. Costa-jussà","David Dale","Hady Elsahar","Kevin Heffernan","João Maria Janeiro","Tuan Tran","Christophe Ropers","Eduardo Sánchez","Robin San Roman","Alexandre Mourachko","Safiyyah Saleem","Holger Schwenk"],"pdf_url":"https://arxiv.org/pdf/2412.08821v1.pdf","comment":"49 pages"},{"id":"http://arxiv.org/abs/2405.05688v3","updated":"2024-12-11T23:21:26Z","published":"2024-05-09T11:38:23Z","title":"Evaluating Dialect Robustness of Language Models via Conversation\n Understanding","summary":" With an evergrowing number of LLMs reporting superlative performance for\nEnglish, their ability to perform equitably for different dialects of English\n($\\textit{i.e.}$, dialect robustness) needs to be ascertained. Specifically, we\nuse English language (US English or Indian English) conversations between\nhumans who play the word-guessing game of 'taboo'. We formulate two evaluative\ntasks: target word prediction (TWP) ($\\textit{i.e.}$, predict the masked target\nword in a conversation) and target word selection (TWS) ($\\textit{i.e.}$,\nselect the most likely masked target word in a conversation, from among a set\nof candidate words). Extending MD3, an existing dialectic dataset of\ntaboo-playing conversations, we introduce M-MD3, a target-word-masked version\nof MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where\nen-US is transformed to include dialectal information) and en-TR (where\ndialectal information is removed from en-IN). We evaluate one open-source\n(Llama3) and two closed-source (GPT-4/3.5) LLMs. LLMs perform significantly\nbetter for US English than Indian English for both TWP and TWS tasks, for all\nsettings, exhibiting marginalisation against the Indian dialect of English.\nWhile GPT-based models perform the best, the comparatively smaller models work\nmore equitably after fine-tuning. Our error analysis shows that the LLMs can\nunderstand the dialect better after fine-tuning using dialectal data. Our\nevaluation methodology exhibits a novel way to examine attributes of language\nmodels using pre-existing dialogue datasets.\n","authors":["Dipankar Srirag","Nihar Ranjan Sahoo","Aditya Joshi"],"pdf_url":"https://arxiv.org/pdf/2405.05688v3.pdf","comment":"SUMEval@COLING'25"},{"id":"http://arxiv.org/abs/2411.15387v2","updated":"2024-12-11T23:00:55Z","published":"2024-11-23T00:02:21Z","title":"From Jack of All Trades to Master of One: Specializing LLM-based\n Autoraters to a Test Set","summary":" As LLMs continue to become more powerful and versatile, human evaluation has\nquickly become intractable at scale and reliance on automatic metrics has\nbecome the norm. Recently, it has been shown that LLMs are themselves\nstate-of-the-art evaluators for many tasks. These Autoraters are typically\ndesigned so that they generalize to new systems and test sets. In practice,\nhowever, evaluation is performed on a small set of fixed, canonical test sets,\nwhich are carefully curated to measure certain capabilities of interest and are\nnot changed frequently. In this work, we design a method which specializes a\nprompted Autorater to a given test set, by leveraging historical ratings on the\ntest set to construct in-context learning (ICL) examples. We evaluate our\nSpecialist method on the task of fine-grained machine translation evaluation,\nand show that it dramatically outperforms the state-of-the-art XCOMET metric by\n54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform\nextensive analyses to understand the representations learned by our Specialist\nmetrics, and how variability in rater behavior affects their performance. We\nalso verify the generalizability and robustness of our Specialist method for\ndesigning automatic metrics across different numbers of ICL examples, LLM\nbackbones, systems to evaluate, and evaluation tasks.\n","authors":["Mara Finkelstein","Dan Deutsch","Parker Riley","Juraj Juraska","Geza Kovacs","Markus Freitag"],"pdf_url":"https://arxiv.org/pdf/2411.15387v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.06592v2","updated":"2024-12-11T22:59:10Z","published":"2024-06-05T19:25:40Z","title":"Improve Mathematical Reasoning in Language Models by Automated Process\n Supervision","summary":" Complex multi-step reasoning tasks, such as solving mathematical problems or\ngenerating code, remain a significant hurdle for even the most advanced large\nlanguage models (LLMs). Verifying LLM outputs with an Outcome Reward Model\n(ORM) is a standard inference-time technique aimed at enhancing the reasoning\nperformance of LLMs. However, this still proves insufficient for reasoning\ntasks with a lengthy or multi-hop reasoning chain, where the intermediate\noutcomes are neither properly rewarded nor penalized. Process supervision\naddresses this limitation by assigning intermediate rewards during the\nreasoning process. To date, the methods used to collect process supervision\ndata have relied on either human annotation or per-step Monte Carlo estimation,\nboth prohibitively expensive to scale, thus hindering the broad application of\nthis technique. In response to this challenge, we propose a novel\ndivide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named\n\\textit{OmegaPRM} for the efficient collection of high-quality process\nsupervision data. This algorithm swiftly identifies the first error in the\nChain of Thought (CoT) with binary search and balances the positive and\nnegative examples, thereby ensuring both efficiency and quality. As a result,\nwe are able to collect over 1.5 million process supervision annotations to\ntrain Process Reward Models (PRMs). This fully automated process supervision\nalongside the weighted self-consistency algorithm is able to enhance LLMs' math\nreasoning performances. We improved the success rates of the instruction-tuned\nGemini Pro model from 51\\% to 69.4\\% on MATH500 and from 86.4\\% to 93.6\\% on\nGSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3\\% to\n58.2\\% on MATH500 and from 74.0\\% to 92.2\\% on GSM8K. The entire process\noperates without any human intervention or supervision, making our method both\nfinancially and ...\n","authors":["Liangchen Luo","Yinxiao Liu","Rosanne Liu","Samrat Phatale","Meiqi Guo","Harsh Lara","Yunxuan Li","Lei Shu","Yun Zhu","Lei Meng","Jiao Sun","Abhinav Rastogi"],"pdf_url":"https://arxiv.org/pdf/2406.06592v2.pdf","comment":"17 pages, 5 figures, 2 table"},{"id":"http://arxiv.org/abs/2302.04865v3","updated":"2024-12-11T22:55:09Z","published":"2023-02-09T18:59:41Z","title":"ELBA: Learning by Asking for Embodied Visual Navigation and Task\n Completion","summary":" The research community has shown increasing interest in designing intelligent\nembodied agents that can assist humans in accomplishing tasks. Although there\nhave been significant advancements in related vision-language benchmarks, most\nprior work has focused on building agents that follow instructions rather than\nendowing agents the ability to ask questions to actively resolve ambiguities\narising naturally in embodied environments. To address this gap, we propose an\nEmbodied Learning-By-Asking (ELBA) model that learns when and what questions to\nask to dynamically acquire additional information for completing the task. We\nevaluate ELBA on the TEACh vision-dialog navigation and task completion\ndataset. Experimental results show that the proposed method achieves improved\ntask performance compared to baseline models without question-answering\ncapabilities.\n","authors":["Ying Shen","Daniel Bis","Cynthia Lu","Ismini Lourentzou"],"pdf_url":"https://arxiv.org/pdf/2302.04865v3.pdf","comment":"14 pages, 10 figures, WACV 2025"},{"id":"http://arxiv.org/abs/2412.08802v1","updated":"2024-12-11T22:28:12Z","published":"2024-12-11T22:28:12Z","title":"jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images","summary":" Contrastive Language-Image Pretraining (CLIP) is a highly effective method\nfor aligning images and texts in a shared embedding space. These models are\nwidely used for tasks such as cross-modal information retrieval and multi-modal\nunderstanding. However, CLIP models often struggle with text-only tasks,\nunderperforming compared to specialized text models. This performance disparity\nforces retrieval systems to rely on separate models for text-only and\nmulti-modal tasks. In this work, we build upon our previous model,\njina-clip-v1, by introducing a refined framework that utilizes multi-task,\nmulti-stage contrastive learning across multiple languages, coupled with an\nimproved training recipe to enhance text-only retrieval. The resulting model,\njina-clip-v2, outperforms its predecessor on text-only and multimodal tasks,\nwhile adding multilingual support, better understanding of complex visual\ndocuments and efficiency gains thanks to Matryoshka Representation Learning and\nvector truncation. The model performs comparably to the state-of-the-art in\nboth multilingual-multimodal and multilingual text retrieval benchmarks,\naddressing the challenge of unifying text-only and multi-modal retrieval\nsystems.\n","authors":["Andreas Koukounas","Georgios Mastrapas","Bo Wang","Mohammad Kalim Akram","Sedigheh Eslami","Michael Günther","Isabelle Mohr","Saba Sturua","Scott Martens","Nan Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2412.08802v1.pdf","comment":"21 pages, 1-10 main paper, 10-12 refs, 12-21 benchmarks"},{"id":"http://arxiv.org/abs/2412.08795v1","updated":"2024-12-11T22:01:30Z","published":"2024-12-11T22:01:30Z","title":"Coverage-based Fairness in Multi-document Summarization","summary":" Fairness in multi-document summarization (MDS) measures whether a system can\ngenerate a summary fairly representing information from documents with\ndifferent social attribute values. Fairness in MDS is crucial since a fair\nsummary can offer readers a comprehensive view. Previous works focus on\nquantifying summary-level fairness using Proportional Representation, a\nfairness measure based on Statistical Parity. However, Proportional\nRepresentation does not consider redundancy in input documents and overlooks\ncorpus-level unfairness. In this work, we propose a new summary-level fairness\nmeasure, Equal Coverage, which is based on coverage of documents with different\nsocial attribute values and considers the redundancy within documents. To\ndetect the corpus-level unfairness, we propose a new corpus-level measure,\nCoverage Parity. Our human evaluations show that our measures align more with\nour definition of fairness. Using our measures, we evaluate the fairness of\nthirteen different LLMs. We find that Claude3-sonnet is the fairest among all\nevaluated LLMs. We also find that almost all LLMs overrepresent different\nsocial attribute values.\n","authors":["Haoyuan Li","Yusen Zhang","Rui Zhang","Snigdha Chaturvedi"],"pdf_url":"https://arxiv.org/pdf/2412.08795v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08313v3","updated":"2024-12-11T21:42:14Z","published":"2024-08-15T17:59:57Z","title":"Can Large Language Models Understand Symbolic Graphics Programs?","summary":" Against the backdrop of enthusiasm for large language models (LLMs), there is\nan urgent need to scientifically assess their capabilities and shortcomings.\nThis is nontrivial in part because it is difficult to find tasks which the\nmodels have not encountered during training. Utilizing symbolic graphics\nprograms, we propose a domain well-suited to test multiple spatial-semantic\nreasoning skills of LLMs. Popular in computer graphics, these programs\nprocedurally generate visual data. While LLMs exhibit impressive skills in\ngeneral program synthesis and analysis, symbolic graphics programs offer a new\nlayer of evaluation: they allow us to test an LLM's ability to answer\ndifferent-grained semantic-level questions of the images or 3D geometries\nwithout a vision encoder. To semantically understand the symbolic programs,\nLLMs would need to possess the ability to \"imagine\" and reason how the\ncorresponding graphics content would look with only the symbolic description.\nWe use this task to evaluate LLMs by creating a large benchmark for the\nsemantic visual understanding of symbolic graphics programs, built procedurally\nwith minimal human effort. Particular emphasis is placed on transformations of\nimages that leave the image level semantics invariant while introducing\nsignificant changes to the underlying program. We evaluate commercial and\nopen-source LLMs on our benchmark to assess their ability to reason about\nvisual output of programs, finding that LLMs considered stronger at reasoning\ngenerally perform better. Lastly, we introduce a novel method to improve this\nability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned\nwith pre-collected instruction data on symbolic graphics programs.\nInterestingly, we find that SIT not only improves LLM's understanding on\nsymbolic programs, but it also improves general reasoning ability on various\nother benchmarks.\n","authors":["Zeju Qiu","Weiyang Liu","Haiwen Feng","Zhen Liu","Tim Z. Xiao","Katherine M. Collins","Joshua B. Tenenbaum","Adrian Weller","Michael J. Black","Bernhard Schölkopf"],"pdf_url":"https://arxiv.org/pdf/2408.08313v3.pdf","comment":"Technical Report v3 (47 pages, 26 figures, project page:\n https://sgp-bench.github.io/, added visual illusion examples)"},{"id":"http://arxiv.org/abs/2409.19257v2","updated":"2024-12-11T21:21:11Z","published":"2024-09-28T06:20:20Z","title":"LISTN: Lexicon induction with socio-temporal nuance","summary":" In-group language is an important signifier of group dynamics. This paper\nproposes a novel method for inducing lexicons of in-group language, which\nincorporates its socio-temporal context. Existing methods for lexicon induction\ndo not capture the evolving nature of in-group language, nor the social\nstructure of the community. Using dynamic word and user embeddings trained on\nconversations from online anti-women communities, our approach outperforms\nprior methods for lexicon induction. We develop a test set for the task of\nlexicon induction and a new lexicon of manosphere language, validated by human\nexperts, which quantifies the relevance of each term to a specific\nsub-community at a given point in time. Finally, we present novel insights on\nin-group language which illustrate the utility of this approach.\n","authors":["Christine de Kock"],"pdf_url":"https://arxiv.org/pdf/2409.19257v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.18762v2","updated":"2024-12-11T21:19:54Z","published":"2024-06-26T21:17:20Z","title":"Categorical Syllogisms Revisited: A Review of the Logical Reasoning\n Abilities of LLMs for Analyzing Categorical Syllogism","summary":" There have been a huge number of benchmarks proposed to evaluate how large\nlanguage models (LLMs) behave for logic inference tasks. However, it remains an\nopen question how to properly evaluate this ability. In this paper, we provide\na systematic overview of prior works on the logical reasoning ability of LLMs\nfor analyzing categorical syllogisms. We first investigate all the possible\nvariations for the categorical syllogisms from a purely logical perspective and\nthen examine the underlying configurations (i.e., mood and figure) tested by\nthe existing datasets. Our results indicate that compared to template-based\nsynthetic datasets, crowdsourcing approaches normally sacrifice the coverage of\nconfigurations (i.e., mood and figure) of categorical syllogisms for more\nlanguage variations, thus bringing challenges to fully testing LLMs under\ndifferent situations. We then proceed to summarize the findings and\nobservations for the performances of LLMs to infer the validity of syllogisms\nfrom the current literature. The error rate breakdown analyses suggest that the\ninterpretation of the quantifiers seems to be the current bottleneck that\nlimits the performances of the LLMs and is thus worth more attention. Finally,\nwe discuss several points that might be worth considering when researchers plan\non the future release of categorical syllogism datasets. We hope our work will\nnot only provide a timely review of the current literature regarding\ncategorical syllogisms, but also motivate more interdisciplinary research\nbetween communities, specifically computational linguists and logicians.\n","authors":["Shi Zong","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2406.18762v2.pdf","comment":"camera-ready version"},{"id":"http://arxiv.org/abs/2412.03681v3","updated":"2024-12-11T20:08:44Z","published":"2024-12-04T19:23:37Z","title":"Acquired TASTE: Multimodal Stance Detection with Textual and Structural\n Embeddings","summary":" Stance detection plays a pivotal role in enabling an extensive range of\ndownstream applications, from discourse parsing to tracing the spread of fake\nnews and the denial of scientific facts. While most stance classification\nmodels rely on textual representation of the utterance in question, prior work\nhas demonstrated the importance of the conversational context in stance\ndetection. In this work we introduce TASTE -- a multimodal architecture for\nstance detection that harmoniously fuses Transformer-based content embedding\nwith unsupervised structural embedding. Through the fine-tuning of a pretrained\ntransformer and the amalgamation with social embedding via a Gated Residual\nNetwork (GRN) layer, our model adeptly captures the complex interplay between\ncontent and conversational structure in determining stance. TASTE achieves\nstate-of-the-art results on common benchmarks, significantly outperforming an\narray of strong baselines. Comparative evaluations underscore the benefits of\nsocial grounding -- emphasizing the criticality of concurrently harnessing both\ncontent and structure for enhanced stance detection.\n","authors":["Guy Barel","Oren Tsur","Dan Vilenchik"],"pdf_url":"https://arxiv.org/pdf/2412.03681v3.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.08753v1","updated":"2024-12-11T19:50:37Z","published":"2024-12-11T19:50:37Z","title":"BDA: Bangla Text Data Augmentation Framework","summary":" Data augmentation involves generating synthetic samples that resemble those\nin a given dataset. In resource-limited fields where high-quality data is\nscarce, augmentation plays a crucial role in increasing the volume of training\ndata. This paper introduces a Bangla Text Data Augmentation (BDA) Framework\nthat uses both pre-trained models and rule-based methods to create new variants\nof the text. A filtering process is included to ensure that the new text keeps\nthe same meaning as the original while also adding variety in the words used.\nWe conduct a comprehensive evaluation of the framework's effectiveness in\nBangla text classification tasks. Our framework achieved significant\nimprovement in F1 scores across five distinct datasets, delivering performance\nequivalent to models trained on 100\\% of the data while utilizing only 50\\% of\nthe training dataset. Additionally, we explore the impact of data scarcity by\nprogressively reducing the training data and augmenting it through BDA,\nresulting in notable F1 score enhancements. The study offers a thorough\nexamination of BDA's performance, identifying key factors for optimal results\nand addressing its limitations through detailed analysis.\n","authors":["Md. Tariquzzaman","Audwit Nafi Anam","Naimul Haque","Mohsinul Kabir","Hasan Mahmud","Md Kamrul Hasan"],"pdf_url":"https://arxiv.org/pdf/2412.08753v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.01141v2","updated":"2024-12-11T19:37:05Z","published":"2024-10-02T00:43:10Z","title":"Evaluating Deduplication Techniques for Economic Research Paper Titles\n with a Focus on Semantic Similarity using NLP and LLMs","summary":" This study investigates efficient deduplication techniques for a large NLP\ndataset of economic research paper titles. We explore various pairing methods\nalongside established distance measures (Levenshtein distance, cosine\nsimilarity) and a sBERT model for semantic evaluation. Our findings suggest a\npotentially low prevalence of duplicates based on the observed semantic\nsimilarity across different methods. Further exploration with a human-annotated\nground truth set is completed for a more conclusive assessment. The result\nsupports findings from the NLP, LLM based distance metrics.\n","authors":["Doohee You","Samuel Fraiberger"],"pdf_url":"https://arxiv.org/pdf/2410.01141v2.pdf","comment":"6 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.08742v1","updated":"2024-12-11T19:29:36Z","published":"2024-12-11T19:29:36Z","title":"In-Context Learning with Topological Information for Knowledge Graph\n Completion","summary":" Knowledge graphs (KGs) are crucial for representing and reasoning over\nstructured information, supporting a wide range of applications such as\ninformation retrieval, question answering, and decision-making. However, their\neffectiveness is often hindered by incompleteness, limiting their potential for\nreal-world impact. While knowledge graph completion (KGC) has been extensively\nstudied in the literature, recent advances in generative AI models,\nparticularly large language models (LLMs), have introduced new opportunities\nfor innovation. In-context learning has recently emerged as a promising\napproach for leveraging pretrained knowledge of LLMs across a range of natural\nlanguage processing tasks and has been widely adopted in both academia and\nindustry. However, how to utilize in-context learning for effective KGC remains\nrelatively underexplored. We develop a novel method that incorporates\ntopological information through in-context learning to enhance KGC performance.\nBy integrating ontological knowledge and graph structure into the context of\nLLMs, our approach achieves strong performance in the transductive setting\ni.e., nodes in the test graph dataset are present in the training graph\ndataset. Furthermore, we apply our approach to KGC in the more challenging\ninductive setting, i.e., nodes in the training graph dataset and test graph\ndataset are disjoint, leveraging the ontology to infer useful information about\nmissing nodes which serve as contextual cues for the LLM during inference. Our\nmethod demonstrates superior performance compared to baselines on the\nILPC-small and ILPC-large datasets.\n","authors":["Udari Madhushani Sehwag","Kassiani Papasotiriou","Jared Vann","Sumitra Ganesh"],"pdf_url":"https://arxiv.org/pdf/2412.08742v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.05348v2","updated":"2024-12-11T19:28:47Z","published":"2024-06-08T04:24:16Z","title":"Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study\n on Two Materials Datasets","summary":" We explore the ability of GPT-4 to perform ad-hoc schema based information\nextraction from scientific literature. We assess specifically whether it can,\nwith a basic prompting approach, replicate two existing material science\ndatasets, given the manuscripts from which they were originally manually\nextracted. We employ materials scientists to perform a detailed manual error\nanalysis to assess where the model struggles to faithfully extract the desired\ninformation, and draw on their insights to suggest research directions to\naddress this broadly important task.\n","authors":["Satanu Ghosh","Neal R. Brodnik","Carolina Frey","Collin Holgate","Tresa M. Pollock","Samantha Daly","Samuel Carton"],"pdf_url":"https://arxiv.org/pdf/2406.05348v2.pdf","comment":"Update on 12/11/2024: Added some relevant literature that we missed\n in previous version of the paper"},{"id":"http://arxiv.org/abs/2412.08737v1","updated":"2024-12-11T19:12:13Z","published":"2024-12-11T19:12:13Z","title":"Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity\n Visual Descriptions","summary":" Multimodal large language models (MLLMs) have made rapid progress in recent\nyears, yet continue to struggle with low-level visual perception (LLVP) --\nparticularly the ability to accurately describe the geometric details of an\nimage. This capability is crucial for applications in areas such as robotics,\nmedical image analysis, and manufacturing. In this paper, we first introduce\nGeoperception, a benchmark designed to evaluate an MLLM's ability to accurately\ntranscribe 2D geometric information from an image. Using this benchmark, we\ndemonstrate the limitations of leading MLLMs, and then conduct a comprehensive\nempirical study to explore strategies for improving their performance on\ngeometric tasks. Our findings highlight the benefits of certain model\narchitectures, training techniques, and data strategies, including the use of\nhigh-fidelity synthetic data and multi-stage training with a data curriculum.\nNotably, we find that a data curriculum enables models to learn challenging\ngeometry understanding tasks which they fail to learn from scratch. Leveraging\nthese insights, we develop Euclid, a family of models specifically optimized\nfor strong low-level geometric perception. Although purely trained on synthetic\nmultimodal data, Euclid shows strong generalization ability to novel geometry\nshapes. For instance, Euclid outperforms the best closed-source model,\nGemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and\n10.65% on average across all tasks.\n","authors":["Jiarui Zhang","Ollie Liu","Tianyu Yu","Jinyi Hu","Willie Neiswanger"],"pdf_url":"https://arxiv.org/pdf/2412.08737v1.pdf","comment":"33 pages, 22 figures, 5 tables, 7 algorithms"},{"id":"http://arxiv.org/abs/2412.06845v2","updated":"2024-12-11T19:03:58Z","published":"2024-12-08T02:01:46Z","title":"Fully Open Source Moxin-7B Technical Report","summary":" Recently, Large Language Models (LLMs) have undergone a significant\ntransformation, marked by a rapid rise in both their popularity and\ncapabilities. Leading this evolution are proprietary LLMs like GPT-4 and\nGPT-o1, which have captured widespread attention in the AI community due to\ntheir remarkable performance and versatility. Simultaneously, open-source LLMs,\nsuch as LLaMA and Mistral, have made great contributions to the ever-increasing\npopularity of LLMs due to the ease to customize and deploy the models across\ndiverse applications. Although open-source LLMs present unprecedented\nopportunities for innovation and research, the commercialization of LLMs has\nraised concerns about transparency, reproducibility, and safety. Many\nopen-source LLMs fail to meet fundamental transparency requirements by\nwithholding essential components like training code and data, and some use\nrestrictive licenses whilst claiming to be \"open-source,\" which may hinder\nfurther innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a\nfully open-source LLM developed in accordance with the Model Openness Framework\n(MOF), a ranked classification system that evaluates AI models based on model\ncompleteness and openness, adhering to principles of open science, open source,\nopen data, and open access. Our model achieves the highest MOF classification\nlevel of \"open science\" through the comprehensive release of pre-training code\nand configurations, training and fine-tuning datasets, and intermediate and\nfinal checkpoints. Experiments show that our model achieves superior\nperformance in zero-shot evaluation compared with popular 7B models and\nperforms competitively in few-shot evaluation.\n","authors":["Pu Zhao","Xuan Shen","Zhenglun Kong","Yixin Shen","Sung-En Chang","Timothy Rupprecht","Lei Lu","Enfu Nan","Changdi Yang","Yumei He","Xingchen Xu","Yu Huang","Wei Wang","Yue Chen","Yong He","Yanzhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.06845v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08686v1","updated":"2024-12-11T18:59:33Z","published":"2024-12-11T18:59:33Z","title":"LatentQA: Teaching LLMs to Decode Activations Into Natural Language","summary":" Interpretability methods seek to understand language model representations,\nyet the outputs of most such methods -- circuits, vectors, scalars -- are not\nimmediately human-interpretable. In response, we introduce LatentQA, the task\nof answering open-ended questions about model activations in natural language.\nTowards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which\nfinetunes a decoder LLM on a dataset of activations and associated\nquestion-answer pairs, similar to how visual instruction tuning trains on\nquestion-answer pairs associated with images. We use the decoder for diverse\nreading applications, such as extracting relational knowledge from\nrepresentations or uncovering system prompts governing model behavior. Our\ndecoder also specifies a differentiable loss that we use to control models,\nsuch as debiasing models on stereotyped sentences and controlling the sentiment\nof generations. Finally, we extend LatentQA to reveal harmful model\ncapabilities, such as generating recipes for bioweapons and code for hacking.\n","authors":["Alexander Pan","Lijie Chen","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2412.08686v1.pdf","comment":"Project page is at https://latentqa.github.io"},{"id":"http://arxiv.org/abs/2412.08639v1","updated":"2024-12-11T18:58:41Z","published":"2024-12-11T18:58:41Z","title":"Fast Prompt Alignment for Text-to-Image Generation","summary":" Text-to-image generation has advanced rapidly, yet aligning complex textual\nprompts with generated visuals remains challenging, especially with intricate\nobject relationships and fine-grained details. This paper introduces Fast\nPrompt Alignment (FPA), a prompt optimization framework that leverages a\none-pass approach, enhancing text-to-image alignment efficiency without the\niterative overhead typical of current methods like OPT2I. FPA uses large\nlanguage models (LLMs) for single-iteration prompt paraphrasing, followed by\nfine-tuning or in-context learning with optimized prompts to enable real-time\ninference, reducing computational demands while preserving alignment fidelity.\nExtensive evaluations on the COCO Captions and PartiPrompts datasets\ndemonstrate that FPA achieves competitive text-image alignment scores at a\nfraction of the processing time, as validated through both automated metrics\n(TIFA, VQA) and human evaluation. A human study with expert annotators further\nreveals a strong correlation between human alignment judgments and automated\nscores, underscoring the robustness of FPA's improvements. The proposed method\nshowcases a scalable, efficient alternative to iterative prompt optimization,\nenabling broader applicability in real-time, high-demand settings. The codebase\nis provided to facilitate further research:\nhttps://github.com/tiktok/fast_prompt_alignment\n","authors":["Khalil Mrini","Hanlin Lu","Linjie Yang","Weilin Huang","Heng Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08639v1.pdf","comment":"TikTok Technical Report"},{"id":"http://arxiv.org/abs/2412.08635v1","updated":"2024-12-11T18:57:32Z","published":"2024-12-11T18:57:32Z","title":"Multimodal Latent Language Modeling with Next-Token Diffusion","summary":" Multimodal generative models require a unified approach to handle both\ndiscrete data (e.g., text and code) and continuous data (e.g., image, audio,\nvideo). In this work, we propose Latent Language Modeling (LatentLM), which\nseamlessly integrates continuous and discrete data using causal Transformers.\nSpecifically, we employ a variational autoencoder (VAE) to represent continuous\ndata as latent vectors and introduce next-token diffusion for autoregressive\ngeneration of these vectors. Additionally, we develop $\\sigma$-VAE to address\nthe challenges of variance collapse, which is crucial for autoregressive\nmodeling. Extensive experiments demonstrate the effectiveness of LatentLM\nacross various modalities. In image generation, LatentLM surpasses Diffusion\nTransformers in both performance and scalability. When integrated into\nmultimodal large language models, LatentLM provides a general-purpose interface\nthat unifies multimodal generation and understanding. Experimental results show\nthat LatentLM achieves favorable performance compared to Transfusion and vector\nquantized models in the setting of scaling up training tokens. In\ntext-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2\nmodel in speaker similarity and robustness, while requiring 10x fewer decoding\nsteps. The results establish LatentLM as a highly effective and scalable\napproach to advance large multimodal models.\n","authors":["Yutao Sun","Hangbo Bao","Wenhui Wang","Zhiliang Peng","Li Dong","Shaohan Huang","Jianyong Wang","Furu Wei"],"pdf_url":"https://arxiv.org/pdf/2412.08635v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08615v1","updated":"2024-12-11T18:37:56Z","published":"2024-12-11T18:37:56Z","title":"Exploiting the Index Gradients for Optimization-Based Jailbreaking on\n Large Language Models","summary":" Despite the advancements in training Large Language Models (LLMs) with\nalignment techniques to enhance the safety of generated content, these models\nremain susceptible to jailbreak, an adversarial attack method that exposes\nsecurity vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG)\nmethod has demonstrated the ability to automatically generate adversarial\nsuffixes that jailbreak state-of-the-art LLMs. However, the optimization\nprocess involved in GCG is highly time-consuming, rendering the jailbreaking\npipeline inefficient. In this paper, we investigate the process of GCG and\nidentify an issue of Indirect Effect, the key bottleneck of the GCG\noptimization. To this end, we propose the Model Attack Gradient Index GCG\n(MAGIC), that addresses the Indirect Effect by exploiting the gradient\ninformation of the suffix tokens, thereby accelerating the procedure by having\nless computation and fewer iterations. Our experiments on AdvBench show that\nMAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates\n(ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of\n74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on\nGPT-3.5. Code is available at https://github.com/jiah-li/magic.\n","authors":["Jiahui Li","Yongchang Hao","Haoyu Xu","Xing Wang","Yu Hong"],"pdf_url":"https://arxiv.org/pdf/2412.08615v1.pdf","comment":"13 pages,2 figures, accepted by The 31st International Conference on\n Computational Linguistics"},{"id":"http://arxiv.org/abs/2412.08599v1","updated":"2024-12-11T18:18:07Z","published":"2024-12-11T18:18:07Z","title":"Der Effizienz- und Intelligenzbegriff in der Lexikographie und\n kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte\n nachbilden?","summary":" By means of pilot experiments for the language pair German and Galician, this\npaper examines the concept of efficiency and intelligence in lexicography and\nartificial intelligence, AI. The aim of the experiments is to gain empirically\nand statistically based insights into the lexicographical text type,dictionary\narticle, in the responses of ChatGPT 3.5, as well as into the lexicographical\ndata on which this chatbot was trained. Both quantitative and qualitative\nmethods are used for this purpose. The analysis is based on the evaluation of\nthe outputs of several sessions with the same prompt in ChatGPT 3.5. On the one\nhand, the algorithmic performance of intelligent systems is evaluated in\ncomparison with data from lexicographical works. On the other hand, the ChatGPT\ndata supplied is analysed using specific text passages of the aforementioned\nlexicographical text type. The results of this study not only help to evaluate\nthe efficiency of this chatbot regarding the creation of dictionary articles,\nbut also to delve deeper into the concept of intelligence, the thought\nprocesses and the actions to be carried out in both disciplines.\n","authors":["Ivan Arias-Arias","Maria Jose Dominguez Vazquez","Carlos Valcarcel Riveiro"],"pdf_url":"https://arxiv.org/pdf/2412.08599v1.pdf","comment":"25 pages, in German language"},{"id":"http://arxiv.org/abs/2403.12151v3","updated":"2024-12-11T18:12:43Z","published":"2024-03-18T18:08:44Z","title":"Fusing Domain-Specific Content from Large Language Models into Knowledge\n Graphs for Enhanced Zero Shot Object State Classification","summary":" Domain-specific knowledge can significantly contribute to addressing a wide\nvariety of vision tasks. However, the generation of such knowledge entails\nconsiderable human labor and time costs. This study investigates the potential\nof Large Language Models (LLMs) in generating and providing domain-specific\ninformation through semantic embeddings. To achieve this, an LLM is integrated\ninto a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors\nin the context of the Vision-based Zero-shot Object State Classification task.\nWe thoroughly examine the behavior of the LLM through an extensive ablation\nstudy. Our findings reveal that the integration of LLM-based embeddings, in\ncombination with general-purpose pre-trained embeddings, leads to substantial\nperformance improvements. Drawing insights from this ablation study, we conduct\na comparative analysis against competing models, thereby highlighting the\nstate-of-the-art performance achieved by the proposed approach.\n","authors":["Filippos Gouidis","Katerina Papantoniou","Konstantinos Papoutsakis","Theodore Patkos","Antonis Argyros","Dimitris Plexousakis"],"pdf_url":"https://arxiv.org/pdf/2403.12151v3.pdf","comment":"Accepted at the AAAI-MAKE 2024"},{"id":"http://arxiv.org/abs/2402.16822v3","updated":"2024-12-11T18:07:25Z","published":"2024-02-26T18:47:27Z","title":"Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts","summary":" As large language models (LLMs) become increasingly prevalent across many\nreal-world applications, understanding and enhancing their robustness to\nadversarial attacks is of paramount importance. Existing methods for\nidentifying adversarial prompts tend to focus on specific domains, lack\ndiversity, or require extensive human annotations. To address these\nlimitations, we present Rainbow Teaming, a novel black-box approach for\nproducing a diverse collection of adversarial prompts. Rainbow Teaming casts\nadversarial prompt generation as a quality-diversity problem and uses\nopen-ended search to generate prompts that are both effective and diverse.\nFocusing on the safety domain, we use Rainbow Teaming to target various\nstate-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach\nreveals hundreds of effective adversarial prompts, with an attack success rate\nexceeding 90% across all tested models. Furthermore, we demonstrate that\nprompts generated by Rainbow Teaming are highly transferable and that\nfine-tuning models with synthetic data generated by our method significantly\nenhances their safety without sacrificing general performance or helpfulness.\nWe additionally explore the versatility of Rainbow Teaming by applying it to\nquestion answering and cybersecurity, showcasing its potential to drive robust\nopen-ended self-improvement in a wide range of applications.\n","authors":["Mikayel Samvelyan","Sharath Chandra Raparthy","Andrei Lupu","Eric Hambro","Aram H. Markosyan","Manish Bhatt","Yuning Mao","Minqi Jiang","Jack Parker-Holder","Jakob Foerster","Tim Rocktäschel","Roberta Raileanu"],"pdf_url":"https://arxiv.org/pdf/2402.16822v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08587v1","updated":"2024-12-11T18:06:44Z","published":"2024-12-11T18:06:44Z","title":"Advancing Single- and Multi-task Text Classification through Large\n Language Model Fine-tuning","summary":" Both encoder-only models (e.g., BERT, RoBERTa) and large language models\n(LLMs, e.g., Llama3) have been widely used for text classification tasks.\nHowever, there is a lack of systematic studies comparing the performance of\nencoder-based models and LLMs in text classification, particularly when\nfine-tuning is involved. This study employed a diverse range of models and\nmethods, varying in size and architecture, and including both fine-tuned and\npre-trained approaches. We first assessed the performances of these LLMs on the\n20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only\nRoBERTa models. Additionally, we explored the multi-task capabilities of both\nmodel types by combining multiple classification tasks, including intent\ndetection and slot-filling, into a single model using data from both datasets.\nOur results indicate that fully fine-tuned Llama3-70B models outperform\nRoBERTa-large and other decoder LLMs across various classification tasks and\ndatasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the\nperformance of dual-model setups in both tasks across both datasets. Overall,\nour study provides a comprehensive benchmark of encoder-only and LLM models on\ntext classification tasks and demonstrates a method to combine two or more\nfully fine-tuned decoder LLMs for reduced latency and equivalent performance.\n","authors":["Hang Zhao","Qile P. Chen","Yijing Barry Zhang","Gang Yang"],"pdf_url":"https://arxiv.org/pdf/2412.08587v1.pdf","comment":"9 pages, 3 tables"},{"id":"http://arxiv.org/abs/2412.08578v1","updated":"2024-12-11T17:54:01Z","published":"2024-12-11T17:54:01Z","title":"Machine Learning Information Retrieval and Summarisation to Support\n Systematic Review on Outcomes Based Contracting","summary":" As academic literature proliferates, traditional review methods are\nincreasingly challenged by the sheer volume and diversity of available\nresearch. This article presents a study that aims to address these challenges\nby enhancing the efficiency and scope of systematic reviews in the social\nsciences through advanced machine learning (ML) and natural language processing\n(NLP) tools. In particular, we focus on automating stages within the systematic\nreviewing process that are time-intensive and repetitive for human annotators\nand which lend themselves to immediate scalability through tools such as\ninformation retrieval and summarisation guided by expert advice. The article\nconcludes with a summary of lessons learnt regarding the integrated approach\ntowards systematic reviews and future directions for improvement, including\nexplainability.\n","authors":["Iman Munire Bilal","Zheng Fang","Miguel Arana-Catania","Felix-Anselm van Lier","Juliana Outes Velarde","Harry Bregazzi","Eleanor Carter","Mara Airoldi","Rob Procter"],"pdf_url":"https://arxiv.org/pdf/2412.08578v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08564v1","updated":"2024-12-11T17:32:21Z","published":"2024-12-11T17:32:21Z","title":"Can We Generate Visual Programs Without Prompting LLMs?","summary":" Visual programming prompts LLMs (large language mod-els) to generate\nexecutable code for visual tasks like visual question answering (VQA).\nPrompt-based methods are difficult to improve while also being unreliable and\ncostly in both time and money. Our goal is to develop an efficient visual\nprogramming system without 1) using prompt-based LLMs at inference time and 2)\na large set of program and answer annotations. We develop a synthetic data\naugmentation approach and alternative program generation method based on\ndecoupling programs into higher-level skills called templates and the\ncorresponding arguments. Our results show that with data augmentation,\nprompt-free smaller LLMs ($\\approx$ 1B parameters) are competitive with\nstate-of-the art models with the added benefit of much faster inference\n","authors":["Michal Shlapentokh-Rothman","Yu-Xiong Wang","Derek Hoiem"],"pdf_url":"https://arxiv.org/pdf/2412.08564v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08548v1","updated":"2024-12-11T17:06:12Z","published":"2024-12-11T17:06:12Z","title":"Bilevel Joint Unsupervised and Supervised Training for Automatic Speech\n Recognition","summary":" In this paper, we propose a bilevel joint unsupervised and supervised\ntraining (BL-JUST) framework for automatic speech recognition. Compared to the\nconventional pre-training and fine-tuning strategy which is a disconnected\ntwo-stage process, BL-JUST tries to optimize an acoustic model such that it\nsimultaneously minimizes both the unsupervised and supervised loss functions.\nBecause BL-JUST seeks matched local optima of both loss functions, acoustic\nrepresentations learned by the acoustic model strike a good balance between\nbeing generic and task-specific. We solve the BL-JUST problem using\npenalty-based bilevel gradient descent and evaluate the trained deep neural\nnetwork acoustic models on various datasets with a variety of architectures and\nloss functions. We show that BL-JUST can outperform the widely-used\npre-training and fine-tuning strategy and some other popular semi-supervised\ntechniques.\n","authors":["Xiaodong Cui","A F M Saif","Songtao Lu","Lisha Chen","Tianyi Chen","Brian Kingsbury","George Saon"],"pdf_url":"https://arxiv.org/pdf/2412.08548v1.pdf","comment":"Accepted by IEEE/ACM Transactions on Audio, Speech and Language\n Processing"},{"id":"http://arxiv.org/abs/2412.08542v1","updated":"2024-12-11T16:59:31Z","published":"2024-12-11T16:59:31Z","title":"MaestroMotif: Skill Design from Artificial Intelligence Feedback","summary":" Describing skills in natural language has the potential to provide an\naccessible way to inject human knowledge about decision-making into an AI\nsystem. We present MaestroMotif, a method for AI-assisted skill design, which\nyields high-performing and adaptable agents. MaestroMotif leverages the\ncapabilities of Large Language Models (LLMs) to effectively create and reuse\nskills. It first uses an LLM's feedback to automatically design rewards\ncorresponding to each skill, starting from their natural language description.\nThen, it employs an LLM's code generation abilities, together with\nreinforcement learning, for training the skills and combining them to implement\ncomplex behaviors specified in language. We evaluate MaestroMotif using a suite\nof complex tasks in the NetHack Learning Environment (NLE), demonstrating that\nit surpasses existing approaches in both performance and usability.\n","authors":["Martin Klissarov","Mikael Henaff","Roberta Raileanu","Shagun Sodhani","Pascal Vincent","Amy Zhang","Pierre-Luc Bacon","Doina Precup","Marlos C. Machado","Pierluca D'Oro"],"pdf_url":"https://arxiv.org/pdf/2412.08542v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08529v1","updated":"2024-12-11T16:38:48Z","published":"2024-12-11T16:38:48Z","title":"TECO: Improving Multimodal Intent Recognition with Text Enhancement\n through Commonsense Knowledge Extraction","summary":" The objective of multimodal intent recognition (MIR) is to leverage various\nmodalities-such as text, video, and audio-to detect user intentions, which is\ncrucial for understanding human language and context in dialogue systems.\nDespite advances in this field, two main challenges persist: (1) effectively\nextracting and utilizing semantic information from robust textual features; (2)\naligning and fusing non-verbal modalities with verbal ones effectively. This\npaper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO)\nto address these challenges. We begin by extracting relations from both\ngenerated and retrieved knowledge to enrich the contextual information in the\ntext modality. Subsequently, we align and integrate visual and acoustic\nrepresentations with these enhanced text features to form a cohesive multimodal\nrepresentation. Our experimental results show substantial improvements over\nexisting baseline methods.\n","authors":["Quynh-Mai Thi Nguyen","Lan-Nhi Thi Nguyen","Cam-Van Thi Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08529v1.pdf","comment":"Accepted at PACLIC 2024"},{"id":"http://arxiv.org/abs/2412.08528v1","updated":"2024-12-11T16:38:34Z","published":"2024-12-11T16:38:34Z","title":"Continual Learning for Encoder-only Language Models via a Discrete\n Key-Value Bottleneck","summary":" Continual learning remains challenging across various natural language\nunderstanding tasks. When models are updated with new training data, they risk\ncatastrophic forgetting of prior knowledge. In the present work, we introduce a\ndiscrete key-value bottleneck for encoder-only language models, allowing for\nefficient continual learning by requiring only localized updates. Inspired by\nthe success of a discrete key-value bottleneck in vision, we address new and\nNLP-specific challenges. We experiment with different bottleneck architectures\nto find the most suitable variants regarding language, and present a generic\ndiscrete key initialization technique for NLP that is task independent. We\nevaluate the discrete key-value bottleneck in four continual learning NLP\nscenarios and demonstrate that it alleviates catastrophic forgetting. We\nshowcase that it offers competitive performance to other popular continual\nlearning methods, with lower computational costs.\n","authors":["Andor Diera","Lukas Galke","Fabian Karl","Ansgar Scherp"],"pdf_url":"https://arxiv.org/pdf/2412.08528v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14654v2","updated":"2024-12-11T16:38:01Z","published":"2024-11-22T00:59:25Z","title":"Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis\n Perspective","summary":" Large Language Models (LLMs) have revolutionized natural language processing\n(NLP) by delivering state-of-the-art performance across a variety of tasks.\nAmong these, Transformer-based models like BERT and GPT rely on pooling layers\nto aggregate token-level embeddings into sentence-level representations. Common\npooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in\nthis aggregation process. Despite their widespread use, the comparative\nperformance of these strategies on different LLM architectures remains\nunderexplored. To address this gap, this paper investigates the effects of\nthese pooling mechanisms on two prominent LLM families -- BERT and GPT, in the\ncontext of sentence-level sentiment analysis. Comprehensive experiments reveal\nthat each pooling mechanism exhibits unique strengths and weaknesses depending\non the task's specific requirements. Our findings underline the importance of\nselecting pooling methods tailored to the demands of particular applications,\nprompting a re-evaluation of common assumptions regarding pooling operations.\nBy offering actionable insights, this study contributes to the optimization of\nLLM-based models for downstream tasks.\n","authors":["Jinming Xing","Ruilin Xing","Yan Sun"],"pdf_url":"https://arxiv.org/pdf/2411.14654v2.pdf","comment":"4 figures"},{"id":"http://arxiv.org/abs/2412.08521v1","updated":"2024-12-11T16:35:13Z","published":"2024-12-11T16:35:13Z","title":"EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache\n Compression Based on Global-Local Importance","summary":" As large language models (LLMs) continue to advance, the demand for higher\nquality and faster processing of long contexts across various applications is\ngrowing. KV cache is widely adopted as it stores previously generated key and\nvalue tokens, effectively reducing redundant computations during inference.\nHowever, as memory overhead becomes a significant concern, efficient\ncompression of KV cache has gained increasing attention. Most existing methods\nperform compression from two perspectives: identifying important tokens and\ndesigning compression strategies. However, these approaches often produce\nbiased distributions of important tokens due to the influence of accumulated\nattention scores or positional encoding. Furthermore, they overlook the\nsparsity and redundancy across different heads, which leads to difficulties in\npreserving the most effective information at the head level. To this end, we\npropose EMS to overcome these limitations, while achieving better KV cache\ncompression under extreme compression ratios. Specifically, we introduce a\nGlobal-Local score that combines accumulated attention scores from both global\nand local KV tokens to better identify the token importance. For the\ncompression strategy, we design an adaptive and unified Evict-then-Merge\nframework that accounts for the sparsity and redundancy of KV tokens across\ndifferent heads. Additionally, we implement the head-wise parallel compression\nthrough a zero-class mechanism to enhance efficiency. Extensive experiments\ndemonstrate our SOTA performance even under extreme compression ratios. EMS\nconsistently achieves the lowest perplexity, improves scores by over 1.28\npoints across four LLMs on LongBench under a 256 cache budget, and preserves\n95% retrieval accuracy with a cache budget less than 2% of the context length\nin the Needle-in-a-Haystack task.\n","authors":["Yingxin Li","Ye Li","Yuan Meng","Xinzhu Ma","Zihan Geng","Shutao Xia","Zhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08521v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08520v1","updated":"2024-12-11T16:34:23Z","published":"2024-12-11T16:34:23Z","title":"GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek","summary":" We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP)\ntoolkit developed specifically for modern Greek. The toolkit provides\nstate-of-the-art performance in five core NLP tasks, namely part-of-speech\ntagging, morphological tagging, dependency parsing, named entity recognition,\nand Greeklishto-Greek transliteration. The toolkit is based on pre-trained\nTransformers, it is freely available, and can be easily installed in Python\n(pip install gr-nlp-toolkit). It is also accessible through a demonstration\nplatform on HuggingFace, along with a publicly available API for non-commercial\nuse. We discuss the functionality provided for each task, the underlying\nmethods, experiments against comparable open-source toolkits, and future\npossible enhancements. The toolkit is available at:\nhttps://github.com/nlpaueb/gr-nlp-toolkit\n","authors":["Lefteris Loukas","Nikolaos Smyrnioudis","Chrysa Dikonomaki","Spyros Barbakos","Anastasios Toumazatos","John Koutsikakis","Manolis Kyriakakis","Mary Georgiou","Stavros Vassos","John Pavlopoulos","Ion Androutsopoulos"],"pdf_url":"https://arxiv.org/pdf/2412.08520v1.pdf","comment":"Accepted Demo Paper @ COLING 2025 (Github:\n https://github.com/nlpaueb/gr-nlp-toolkit/, Demo:\n https://huggingface.co/spaces/AUEB-NLP/greek-nlp-toolkit-demo, API:\n https://huggingface.co/spaces/AUEB-NLP/The-Greek-NLP-API)"},{"id":"http://arxiv.org/abs/2412.08519v1","updated":"2024-12-11T16:32:41Z","published":"2024-12-11T16:32:41Z","title":"Bridging Relevance and Reasoning: Rationale Distillation in\n Retrieval-Augmented Generation","summary":" The reranker and generator are two critical components in the\nRetrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking\nrelevant documents and generating responses. However, due to differences in\npre-training data and objectives, there is an inevitable gap between the\ndocuments ranked as relevant by the reranker and those required by the\ngenerator to support answering the query. To address this gap, we propose\nRADIO, a novel and practical preference alignment framework with RAtionale\nDIstillatiOn. Specifically, We first propose a rationale extraction method that\nleverages the reasoning capabilities of Large Language Models (LLMs) to extract\nthe rationales necessary for answering the query. Subsequently, a\nrationale-based alignment process is designed to rerank the documents based on\nthe extracted rationales, and fine-tune the reranker to align the preferences.\nWe conduct extensive experiments on two tasks across three datasets to\ndemonstrate the effectiveness of our approach compared to baseline methods. Our\ncode is released online to ease reproduction.\n","authors":["Pengyue Jia","Derong Xu","Xiaopeng Li","Zhaocheng Du","Xiangyang Li","Xiangyu Zhao","Yichao Wang","Yuhao Wang","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2412.08519v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2207.01772v3","updated":"2024-12-11T16:30:19Z","published":"2022-07-05T02:18:49Z","title":"Vision-and-Language Pretraining","summary":" With the burgeoning amount of data of image-text pairs and diversity of\nVision-and-Language (V\\&L) tasks, scholars have introduced an abundance of deep\nlearning models in this research domain. Furthermore, in recent years, transfer\nlearning has also shown tremendous success in Computer Vision for tasks such as\nImage Classification, Object Detection, etc., and in Natural Language\nProcessing for Question Answering, Machine Translation, etc. Inheriting the\nspirit of Transfer Learning, research works in V\\&L have devised multiple\npretraining techniques on large-scale datasets in order to enhance the\nperformance of downstream tasks. The aim of this article is to provide a\ncomprehensive revision of contemporary V\\&L pretraining models. In particular,\nwe categorize and delineate pretraining approaches, along with the summary of\nstate-of-the-art vision-and-language pretrained models. Moreover, a list of\ntraining datasets and downstream tasks is supplied to further polish the\nperspective into V\\&L pretraining. Lastly, we decided to take a further step to\ndiscuss numerous directions for future research.\n","authors":["Thong Nguyen","Cong-Duy Nguyen","Xiaobao Wu","See-Kiong Ng","Anh Tuan Luu"],"pdf_url":"https://arxiv.org/pdf/2207.01772v3.pdf","comment":"The content of the paper has been outdated. I would like to rewrite a\n new version with completely new information."},{"id":"http://arxiv.org/abs/2412.08508v1","updated":"2024-12-11T16:18:52Z","published":"2024-12-11T16:18:52Z","title":"Comparative Opinion Mining in Product Reviews: Multi-perspective\n Prompt-based Learning","summary":" Comparative reviews are pivotal in understanding consumer preferences and\ninfluencing purchasing decisions. Comparative Quintuple Extraction (COQE) aims\nto identify five key components in text: the target entity, compared entities,\ncompared aspects, opinions on these aspects, and polarity. Extracting precise\ncomparative information from product reviews is challenging due to nuanced\nlanguage and sequential task errors in traditional methods. To mitigate these\nproblems, we propose MTP-COQE, an end-to-end model designed for COQE.\nLeveraging multi-perspective prompt-based learning, MTP-COQE effectively guides\nthe generative model in comparative opinion mining tasks. Evaluation on the\nCamera-COQE (English) and VCOM (Vietnamese) datasets demonstrates MTP-COQE's\nefficacy in automating COQE, achieving superior performance with a 1.41% higher\nF1 score than the previous baseline models on the English dataset.\nAdditionally, we designed a strategy to limit the generative model's creativity\nto ensure the output meets expectations. We also performed data augmentation to\naddress data imbalance and to prevent the model from becoming biased towards\ndominant samples.\n","authors":["Hai-Yen Thi Nguyen","Cam-Van Thi Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08508v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17196v3","updated":"2024-12-11T15:45:21Z","published":"2024-10-22T17:15:20Z","title":"VoiceBench: Benchmarking LLM-Based Voice Assistants","summary":" Building on the success of large language models (LLMs), recent advancements\nsuch as GPT-4o have enabled real-time speech interactions through LLM-based\nvoice assistants, offering a significantly improved user experience compared to\ntraditional text-based interactions. However, the absence of benchmarks\ndesigned to evaluate these speech interaction capabilities has hindered\nprogress of LLM-based voice assistants development. Current evaluations focus\nprimarily on automatic speech recognition (ASR) or general knowledge evaluation\nwith clean speeches, neglecting the more intricate, real-world scenarios that\ninvolve diverse speaker characteristics, environmental and content factors. To\naddress this, we introduce VoiceBench, the first benchmark designed to provide\na multi-faceted evaluation of LLM-based voice assistants. VoiceBench also\nincludes both real and synthetic spoken instructions that incorporate the above\nthree key real-world variations. Extensive experiments reveal the limitations\nof current LLM-based voice assistant models and offer valuable insights for\nfuture research and development in this field.\n","authors":["Yiming Chen","Xianghu Yue","Chen Zhang","Xiaoxue Gao","Robby T. Tan","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2410.17196v3.pdf","comment":"Work in progress. Data is available at\n https://github.com/MatthewCYM/VoiceBench"},{"id":"http://arxiv.org/abs/2412.08473v1","updated":"2024-12-11T15:42:22Z","published":"2024-12-11T15:42:22Z","title":"Multi-perspective Alignment for Increasing Naturalness in Neural Machine\n Translation","summary":" Neural machine translation (NMT) systems amplify lexical biases present in\ntheir training data, leading to artificially impoverished language in output\ntranslations. These language-level characteristics render automatic\ntranslations different from text originally written in a language and human\ntranslations, which hinders their usefulness in for example creating evaluation\ndatasets. Attempts to increase naturalness in NMT can fall short in terms of\ncontent preservation, where increased lexical diversity comes at the cost of\ntranslation accuracy. Inspired by the reinforcement learning from human\nfeedback framework, we introduce a novel method that rewards both naturalness\nand content preservation. We experiment with multiple perspectives to produce\nmore natural translations, aiming at reducing machine and human translationese.\nWe evaluate our method on English-to-Dutch literary translation, and find that\nour best model produces translations that are lexically richer and exhibit more\nproperties of human-written language, without loss in translation accuracy.\n","authors":["Huiyuan Lai","Esther Ploeger","Rik van Noord","Antonio Toral"],"pdf_url":"https://arxiv.org/pdf/2412.08473v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.12910v2","updated":"2024-12-11T15:40:54Z","published":"2024-05-21T16:30:25Z","title":"Topic Classification of Case Law Using a Large Language Model and a New\n Taxonomy for UK Law: AI Insights into Summary Judgment","summary":" This paper addresses a critical gap in legal analytics by developing and\napplying a novel taxonomy for topic classification of summary judgment cases in\nthe United Kingdom. Using a curated dataset of summary judgment cases, we use\nthe Large Language Model Claude 3 Opus to explore functional topics and trends.\nWe find that Claude 3 Opus correctly classified the topic with an accuracy of\n87.13% and an F1 score of 0.87. The analysis reveals distinct patterns in the\napplication of summary judgments across various legal domains. As case law in\nthe United Kingdom is not originally labelled with keywords or a topic\nfiltering option, the findings not only refine our understanding of the\nthematic underpinnings of summary judgments but also illustrate the potential\nof combining traditional and AI-driven approaches in legal classification.\nTherefore, this paper provides a new and general taxonomy for UK law. The\nimplications of this work serve as a foundation for further research and policy\ndiscussions in the field of judicial administration and computational legal\nresearch methodologies.\n","authors":["Holli Sargeant","Ahmed Izzidien","Felix Steffek"],"pdf_url":"https://arxiv.org/pdf/2405.12910v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08467v1","updated":"2024-12-11T15:32:24Z","published":"2024-12-11T15:32:24Z","title":"Bootstrapping Language-Guided Navigation Learning with Self-Refining\n Data Flywheel","summary":" Creating high-quality data for training robust language-instructed agents is\na long-lasting challenge in embodied AI. In this paper, we introduce a\nSelf-Refining Data Flywheel (SRDF) that generates high-quality and large-scale\nnavigational instruction-trajectory pairs by iteratively refining the data pool\nthrough the collaboration between two models, the instruction generator and the\nnavigator, without any human-in-the-loop annotation. Specifically, SRDF starts\nwith using a base generator to create an initial data pool for training a base\nnavigator, followed by applying the trained navigator to filter the data pool.\nThis leads to higher-fidelity data to train a better generator, which can, in\nturn, produce higher-quality data for training the next-round navigator. Such a\nflywheel establishes a data self-refining process, yielding a continuously\nimproved and highly effective dataset for large-scale language-guided\nnavigation learning. Our experiments demonstrate that after several flywheel\nrounds, the navigator elevates the performance boundary from 70% to 78% SPL on\nthe classic R2R test set, surpassing human performance (76%) for the first\ntime. Meanwhile, this process results in a superior generator, evidenced by a\nSPICE increase from 23.5 to 26.2, better than all previous VLN instruction\ngeneration methods. Finally, we demonstrate the scalability of our method\nthrough increasing environment and instruction diversity, and the\ngeneralization ability of our pre-trained navigator across various downstream\nnavigation tasks, surpassing state-of-the-art methods by a large margin in all\ncases.\n","authors":["Zun Wang","Jialu Li","Yicong Hong","Songze Li","Kunchang Li","Shoubin Yu","Yi Wang","Yu Qiao","Yali Wang","Mohit Bansal","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08467v1.pdf","comment":"28 pages, Code and data are available at\n https://github.com/wz0919/VLN-SRDF"},{"id":"http://arxiv.org/abs/2410.07095v3","updated":"2024-12-11T15:02:22Z","published":"2024-10-09T17:34:27Z","title":"MLE-bench: Evaluating Machine Learning Agents on Machine Learning\n Engineering","summary":" We introduce MLE-bench, a benchmark for measuring how well AI agents perform\nat machine learning engineering. To this end, we curate 75 ML\nengineering-related competitions from Kaggle, creating a diverse set of\nchallenging tasks that test real-world ML engineering skills such as training\nmodels, preparing datasets, and running experiments. We establish human\nbaselines for each competition using Kaggle's publicly available leaderboards.\nWe use open-source agent scaffolds to evaluate several frontier language models\non our benchmark, finding that the best-performing setup--OpenAI's o1-preview\nwith AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in\n16.9% of competitions. In addition to our main results, we investigate various\nforms of resource scaling for AI agents and the impact of contamination from\npre-training. We open-source our benchmark code (github.com/openai/mle-bench/)\nto facilitate future research in understanding the ML engineering capabilities\nof AI agents.\n","authors":["Jun Shern Chan","Neil Chowdhury","Oliver Jaffe","James Aung","Dane Sherburn","Evan Mays","Giulio Starace","Kevin Liu","Leon Maksin","Tejal Patwardhan","Lilian Weng","Aleksander Mądry"],"pdf_url":"https://arxiv.org/pdf/2410.07095v3.pdf","comment":"10 pages, 17 pages appendix. Equal contribution by first seven\n authors, authors randomized. Corrected footnote 4"},{"id":"http://arxiv.org/abs/2412.08434v1","updated":"2024-12-11T14:55:48Z","published":"2024-12-11T14:55:48Z","title":"Mitigating Out-of-Entity Errors in Named Entity Recognition: A\n Sentence-Level Strategy","summary":" Many previous models of named entity recognition (NER) suffer from the\nproblem of Out-of-Entity (OOE), i.e., the tokens in the entity mentions of the\ntest samples have not appeared in the training samples, which hinders the\nachievement of satisfactory performance. To improve OOE-NER performance, in\nthis paper, we propose a new framework, namely S+NER, which fully leverages\nsentence-level information. Our S+NER achieves better OOE-NER performance\nmainly due to the following two particular designs. 1) It first exploits the\npre-trained language model's capability of understanding the target entity's\nsentence-level context with a template set. 2) Then, it refines the\nsentence-level representation based on the positive and negative templates,\nthrough a contrastive learning strategy and template pooling method, to obtain\nbetter NER results. Our extensive experiments on five benchmark datasets have\ndemonstrated that, our S+NER outperforms some state-of-the-art OOE-NER models.\n","authors":["Guochao Jiang","Ziqin Luo","Chengwei Hu","Zepeng Ding","Deqing Yang"],"pdf_url":"https://arxiv.org/pdf/2412.08434v1.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.08430v1","updated":"2024-12-11T14:51:13Z","published":"2024-12-11T14:51:13Z","title":"Assessing Personalized AI Mentoring with Large Language Models in the\n Computing Field","summary":" This paper provides an in-depth evaluation of three state-of-the-art Large\nLanguage Models (LLMs) for personalized career mentoring in the computing\nfield, using three distinct student profiles that consider gender, race, and\nprofessional levels. We evaluated the performance of GPT-4, LLaMA 3, and Palm 2\nusing a zero-shot learning approach without human intervention. A quantitative\nevaluation was conducted through a custom natural language processing analytics\npipeline to highlight the uniqueness of the responses and to identify words\nreflecting each student's profile, including race, gender, or professional\nlevel. The analysis of frequently used words in the responses indicates that\nGPT-4 offers more personalized mentoring compared to the other two LLMs.\nAdditionally, a qualitative evaluation was performed to see if human experts\nreached similar conclusions. The analysis of survey responses shows that GPT-4\noutperformed the other two LLMs in delivering more accurate and useful\nmentoring while addressing specific challenges with encouragement languages.\nOur work establishes a foundation for developing personalized mentoring tools\nbased on LLMs, incorporating human mentors in the process to deliver a more\nimpactful and tailored mentoring experience.\n","authors":["Xiao Luo","Sean O'Connell","Shamima Mithun"],"pdf_url":"https://arxiv.org/pdf/2412.08430v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07303v2","updated":"2024-12-11T14:43:31Z","published":"2024-12-10T08:31:52Z","title":"Filipino Benchmarks for Measuring Sexist and Homophobic Bias in\n Multilingual Language Models from Southeast Asia","summary":" Bias studies on multilingual models confirm the presence of gender-related\nstereotypes in masked models processing languages with high NLP resources. We\nexpand on this line of research by introducing Filipino CrowS-Pairs and\nFilipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in\npretrained language models (PLMs) handling texts in Filipino, a low-resource\nlanguage from the Philippines. The benchmarks consist of 7,074 new challenge\npairs resulting from our cultural adaptation of English bias evaluation\ndatasets, a process that we document in detail to guide similar forthcoming\nefforts. We apply the Filipino benchmarks on masked and causal multilingual\nmodels, including those pretrained on Southeast Asian data, and find that they\ncontain considerable amounts of bias. We also find that for multilingual\nmodels, the extent of bias learned for a particular language is influenced by\nhow much pretraining data in that language a model was exposed to. Our\nbenchmarks and insights can serve as a foundation for future work analyzing and\nmitigating bias in multilingual models.\n","authors":["Lance Calvin Lim Gamboa","Mark Lee"],"pdf_url":"https://arxiv.org/pdf/2412.07303v2.pdf","comment":"Accepted for presentation at The First Workshop on Language Models\n for Low-Resource Languages (LoResLM) at The 31st International Conference on\n Computational Linguistics (COLING 2025)"},{"id":"http://arxiv.org/abs/2412.08414v1","updated":"2024-12-11T14:31:39Z","published":"2024-12-11T14:31:39Z","title":"Detecting Conversational Mental Manipulation with Intent-Aware Prompting","summary":" Mental manipulation severely undermines mental wellness by covertly and\nnegatively distorting decision-making. While there is an increasing interest in\nmental health care within the natural language processing community, progress\nin tackling manipulation remains limited due to the complexity of detecting\nsubtle, covert tactics in conversations. In this paper, we propose Intent-Aware\nPrompting (IAP), a novel approach for detecting mental manipulations using\nlarge language models (LLMs), providing a deeper understanding of manipulative\ntactics by capturing the underlying intents of participants. Experimental\nresults on the MentalManip dataset demonstrate superior effectiveness of IAP\nagainst other advanced prompting strategies. Notably, our approach\nsubstantially reduces false negatives, helping detect more instances of mental\nmanipulation with minimal misjudgment of positive cases. The code of this paper\nis available at https://github.com/Anton-Jiayuan-MA/Manip-IAP.\n","authors":["Jiayuan Ma","Hongbin Na","Zimu Wang","Yining Hua","Yue Liu","Wei Wang","Ling Chen"],"pdf_url":"https://arxiv.org/pdf/2412.08414v1.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.08802v1","updated":"2024-12-11T22:28:12Z","published":"2024-12-11T22:28:12Z","title":"jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images","summary":" Contrastive Language-Image Pretraining (CLIP) is a highly effective method\nfor aligning images and texts in a shared embedding space. These models are\nwidely used for tasks such as cross-modal information retrieval and multi-modal\nunderstanding. However, CLIP models often struggle with text-only tasks,\nunderperforming compared to specialized text models. This performance disparity\nforces retrieval systems to rely on separate models for text-only and\nmulti-modal tasks. In this work, we build upon our previous model,\njina-clip-v1, by introducing a refined framework that utilizes multi-task,\nmulti-stage contrastive learning across multiple languages, coupled with an\nimproved training recipe to enhance text-only retrieval. The resulting model,\njina-clip-v2, outperforms its predecessor on text-only and multimodal tasks,\nwhile adding multilingual support, better understanding of complex visual\ndocuments and efficiency gains thanks to Matryoshka Representation Learning and\nvector truncation. The model performs comparably to the state-of-the-art in\nboth multilingual-multimodal and multilingual text retrieval benchmarks,\naddressing the challenge of unifying text-only and multi-modal retrieval\nsystems.\n","authors":["Andreas Koukounas","Georgios Mastrapas","Bo Wang","Mohammad Kalim Akram","Sedigheh Eslami","Michael Günther","Isabelle Mohr","Saba Sturua","Scott Martens","Nan Wang","Han Xiao"],"pdf_url":"https://arxiv.org/pdf/2412.08802v1.pdf","comment":"21 pages, 1-10 main paper, 10-12 refs, 12-21 benchmarks"},{"id":"http://arxiv.org/abs/2412.08780v1","updated":"2024-12-11T21:16:37Z","published":"2024-12-11T21:16:37Z","title":"Reducing Popularity Influence by Addressing Position Bias","summary":" Position bias poses a persistent challenge in recommender systems, with much\nof the existing research focusing on refining ranking relevance and driving\nuser engagement. However, in practical applications, the mitigation of position\nbias does not always result in detectable short-term improvements in ranking\nrelevance. This paper provides an alternative, practically useful view of what\nposition bias reduction methods can achieve. It demonstrates that position\ndebiasing can spread visibility and interactions more evenly across the\nassortment, effectively reducing a skew in the popularity of items induced by\nthe position bias through a feedback loop. We offer an explanation of how\nposition bias affects item popularity. This includes an illustrative model of\nthe item popularity histogram and the effect of the position bias on its\nskewness. Through offline and online experiments on our large-scale e-commerce\nplatform, we show that position debiasing can significantly improve assortment\nutilization, without any degradation in user engagement or financial metrics.\nThis makes the ranking fairer and helps attract more partners or content\nproviders, benefiting the customers and the business in the long term.\n","authors":["Andrii Dzhoha","Alexey Kurennoy","Vladimir Vlasov","Marjan Celikik"],"pdf_url":"https://arxiv.org/pdf/2412.08780v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.05348v2","updated":"2024-12-11T19:28:47Z","published":"2024-06-08T04:24:16Z","title":"Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study\n on Two Materials Datasets","summary":" We explore the ability of GPT-4 to perform ad-hoc schema based information\nextraction from scientific literature. We assess specifically whether it can,\nwith a basic prompting approach, replicate two existing material science\ndatasets, given the manuscripts from which they were originally manually\nextracted. We employ materials scientists to perform a detailed manual error\nanalysis to assess where the model struggles to faithfully extract the desired\ninformation, and draw on their insights to suggest research directions to\naddress this broadly important task.\n","authors":["Satanu Ghosh","Neal R. Brodnik","Carolina Frey","Collin Holgate","Tresa M. Pollock","Samantha Daly","Samuel Carton"],"pdf_url":"https://arxiv.org/pdf/2406.05348v2.pdf","comment":"Update on 12/11/2024: Added some relevant literature that we missed\n in previous version of the paper"},{"id":"http://arxiv.org/abs/2412.08604v1","updated":"2024-12-11T18:26:55Z","published":"2024-12-11T18:26:55Z","title":"Preference Discerning with LLM-Enhanced Generative Retrieval","summary":" Sequential recommendation systems aim to provide personalized recommendations\nfor users based on their interaction history. To achieve this, they often\nincorporate auxiliary information, such as textual descriptions of items and\nauxiliary tasks, like predicting user preferences and intent. Despite numerous\nefforts to enhance these models, they still suffer from limited\npersonalization. To address this issue, we propose a new paradigm, which we\nterm preference discerning. In preference dscerning, we explicitly condition a\ngenerative sequential recommendation system on user preferences within its\ncontext. To this end, we generate user preferences using Large Language Models\n(LLMs) based on user reviews and item-specific data. To evaluate preference\ndiscerning capabilities of sequential recommendation systems, we introduce a\nnovel benchmark that provides a holistic evaluation across various scenarios,\nincluding preference steering and sentiment following. We assess current\nstate-of-the-art methods using our benchmark and show that they struggle to\naccurately discern user preferences. Therefore, we propose a new method named\nMender ($\\textbf{M}$ultimodal Prefer$\\textbf{en}$ce\n$\\textbf{d}$iscern$\\textbf{er}$), which improves upon existing methods and\nachieves state-of-the-art performance on our benchmark. Our results show that\nMender can be effectively guided by human preferences even though they have not\nbeen observed during training, paving the way toward more personalized\nsequential recommendation systems. We will open-source the code and benchmarks\nupon publication.\n","authors":["Fabian Paischer","Liu Yang","Linfeng Liu","Shuai Shao","Kaveh Hassani","Jiacheng Li","Ricky Chen","Zhang Gabriel Li","Xialo Gao","Wei Shao","Xue Feng","Nima Noorshams","Sem Park","Bo Long","Hamid Eghbalzadeh"],"pdf_url":"https://arxiv.org/pdf/2412.08604v1.pdf","comment":"11 pages + references and appendix"},{"id":"http://arxiv.org/abs/2412.08593v1","updated":"2024-12-11T18:11:39Z","published":"2024-12-11T18:11:39Z","title":"Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based\n Automated Requirement Traceability and Compliance Checks","summary":" Ensuring that Software Requirements Specifications (SRS) align with\nhigher-level organizational or national requirements is vital, particularly in\nregulated environments such as finance and aerospace. In these domains,\nmaintaining consistency, adhering to regulatory frameworks, minimizing errors,\nand meeting critical expectations are essential for the reliable functioning of\nsystems. The widespread adoption of large language models (LLMs) highlights\ntheir immense potential, yet there remains considerable scope for improvement\nin retrieving relevant information and enhancing reasoning capabilities. This\nstudy demonstrates that integrating a robust Graph-RAG framework with advanced\nprompt engineering techniques, such as Chain of Thought and Tree of Thought,\ncan significantly enhance performance. Compared to baseline RAG methods and\nsimple prompting strategies, this approach delivers more accurate and\ncontext-aware results. While this method demonstrates significant improvements\nin performance, it comes with challenges. It is both costly and more complex to\nimplement across diverse contexts, requiring careful adaptation to specific\nscenarios. Additionally, its effectiveness heavily relies on having complete\nand accurate input data, which may not always be readily available, posing\nfurther limitations to its scalability and practicality.\n","authors":["Arsalan Masoudifard","Mohammad Mowlavi Sorond","Moein Madadi","Mohammad Sabokrou","Elahe Habibi"],"pdf_url":"https://arxiv.org/pdf/2412.08593v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06954v2","updated":"2024-12-11T16:46:25Z","published":"2024-12-09T20:01:59Z","title":"CURE: A dataset for Clinical Understanding & Retrieval Evaluation","summary":" Given the dominance of dense retrievers that do not generalize well beyond\ntheir training dataset distributions, domain-specific test sets are essential\nin evaluating retrieval. There are few test datasets for retrieval systems\nintended for use by healthcare providers in a point-of-care setting. To fill\nthis gap we have collaborated with medical professionals to create CURE, an\nad-hoc retrieval test dataset for passage ranking with 2000 queries spanning 10\nmedical domains with a monolingual (English) and two cross-lingual\n(French/Spanish -> English) conditions. In this paper, we describe how CURE was\nconstructed and provide baseline results to showcase its effectiveness as an\nevaluation tool. CURE is published with a Creative Commons Attribution Non\nCommercial 4.0 license and can be accessed on Hugging Face.\n","authors":["Nadia Sheikh","Anne-Laure Jousse","Daniel Buades Marcos","Akintunde Oladipo","Olivier Rousseau","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2412.06954v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08516v1","updated":"2024-12-11T16:28:18Z","published":"2024-12-11T16:28:18Z","title":"AltFS: Agency-light Feature Selection with Large Language Models in Deep\n Recommender Systems","summary":" Feature selection is crucial in recommender systems for improving model\nefficiency and predictive performance. Traditional methods rely on agency\nmodels, such as decision trees or neural networks, to estimate feature\nimportance. However, this approach is inherently limited, as the agency models\nmay fail to learn effectively in all scenarios due to suboptimal training\nconditions (e.g., feature collinearity, high-dimensional sparsity, and data\ninsufficiency). In this paper, we propose AltFS, an Agency-light Feature\nSelection method for deep recommender systems. AltFS integrates semantic\nreasoning from Large Language Models (LLMs) with task-specific learning from\nagency models. Initially, LLMs will generate a semantic ranking of feature\nimportance, which is then refined by an agency model, combining world knowledge\nwith task-specific insights. Extensive experiments on three public datasets\nfrom real-world recommender platforms demonstrate the effectiveness of AltFS.\nOur code is publicly available for reproducibility.\n","authors":["Pengyue Jia","Zhaocheng Du","Yichao Wang","Xiangyu Zhao","Xiaopeng Li","Yuhao Wang","Qidong Liu","Huifeng Guo","Ruiming Tang"],"pdf_url":"https://arxiv.org/pdf/2412.08516v1.pdf","comment":"under review"},{"id":"http://arxiv.org/abs/2412.08480v1","updated":"2024-12-11T15:47:11Z","published":"2024-12-11T15:47:11Z","title":"InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models","summary":" As one of the most successful generative models, diffusion models have\ndemonstrated remarkable efficacy in synthesizing high-quality images. These\nmodels learn the underlying high-dimensional data distribution in an\nunsupervised manner. Despite their success, diffusion models are highly\ndata-driven and prone to inheriting the imbalances and biases present in\nreal-world data. Some studies have attempted to address these issues by\ndesigning text prompts for known biases or using bias labels to construct\nunbiased data. While these methods have shown improved results, real-world\nscenarios often contain various unknown biases, and obtaining bias labels is\nparticularly challenging. In this paper, we emphasize the necessity of\nmitigating bias in pre-trained diffusion models without relying on auxiliary\nbias annotations. To tackle this problem, we propose a framework, InvDiff,\nwhich aims to learn invariant semantic information for diffusion guidance.\nSpecifically, we propose identifying underlying biases in the training data and\ndesigning a novel debiasing training objective. Then, we employ a lightweight\ntrainable module that automatically preserves invariant semantic information\nand uses it to guide the diffusion model's sampling process toward unbiased\noutcomes simultaneously. Notably, we only need to learn a small number of\nparameters in the lightweight learnable module without altering the pre-trained\ndiffusion model. Furthermore, we provide a theoretical guarantee that the\nimplementation of InvDiff is equivalent to reducing the error upper bound of\ngeneralization. Extensive experimental results on three publicly available\nbenchmarks demonstrate that InvDiff effectively reduces biases while\nmaintaining the quality of image generation. Our code is available at\nhttps://github.com/Hundredl/InvDiff.\n","authors":["Min Hou","Yueying Wu","Chang Xu","Yu-Hao Huang","Chenxi Bai","Le Wu","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2412.08480v1.pdf","comment":"KDD 2025"},{"id":"http://arxiv.org/abs/2207.11759v2","updated":"2024-12-11T14:47:01Z","published":"2022-07-24T15:13:45Z","title":"Spatial-Temporal Federated Learning for Lifelong Person\n Re-identification on Distributed Edges","summary":" Data drift is a thorny challenge when deploying person re-identification\n(ReID) models into real-world devices, where the data distribution is\nsignificantly different from that of the training environment and keeps\nchanging. To tackle this issue, we propose a federated spatial-temporal\nincremental learning approach, named FedSTIL, which leverages both lifelong\nlearning and federated learning to continuously optimize models deployed on\nmany distributed edge clients. Unlike previous efforts, FedSTIL aims to mine\nspatial-temporal correlations among the knowledge learnt from different edge\nclients. Specifically, the edge clients first periodically extract general\nrepresentations of drifted data to optimize their local models. Then, the\nlearnt knowledge from edge clients will be aggregated by centralized parameter\nserver, where the knowledge will be selectively and attentively distilled from\nspatial- and temporal-dimension with carefully designed mechanisms. Finally,\nthe distilled informative spatial-temporal knowledge will be sent back to\ncorrelated edge clients to further improve the recognition accuracy of each\nedge client with a lifelong learning method. Extensive experiments on a mixture\nof five real-world datasets demonstrate that our method outperforms others by\nnearly 4% in Rank-1 accuracy, while reducing communication cost by 62%. All\nimplementation codes are publicly available on\nhttps://github.com/MSNLAB/Federated-Lifelong-Person-ReID\n","authors":["Lei Zhang","Guanyu Gao","Huaizheng Zhang"],"pdf_url":"https://arxiv.org/pdf/2207.11759v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08385v1","updated":"2024-12-11T13:50:17Z","published":"2024-12-11T13:50:17Z","title":"NyayaAnumana & INLegalLlama: The Largest Indian Legal Judgment\n Prediction Dataset and Specialized Language Model for Enhanced Decision\n Analysis","summary":" The integration of artificial intelligence (AI) in legal judgment prediction\n(LJP) has the potential to transform the legal landscape, particularly in\njurisdictions like India, where a significant backlog of cases burdens the\nlegal system. This paper introduces NyayaAnumana, the largest and most diverse\ncorpus of Indian legal cases compiled for LJP, encompassing a total of 7,02,945\npreprocessed cases. NyayaAnumana, which combines the words \"Nyay\" (judgment)\nand \"Anuman\" (prediction or inference) respectively for most major Indian\nlanguages, includes a wide range of cases from the Supreme Court, High Courts,\nTribunal Courts, District Courts, and Daily Orders and, thus, provides\nunparalleled diversity and coverage. Our dataset surpasses existing datasets\nlike PredEx and ILDC, offering a comprehensive foundation for advanced AI\nresearch in the legal domain.\n In addition to the dataset, we present INLegalLlama, a domain-specific\ngenerative large language model (LLM) tailored to the intricacies of the Indian\nlegal system. It is developed through a two-phase training approach over a base\nLLaMa model. First, Indian legal documents are injected using continual\npretraining. Second, task-specific supervised finetuning is done. This method\nallows the model to achieve a deeper understanding of legal contexts.\n Our experiments demonstrate that incorporating diverse court data\nsignificantly boosts model accuracy, achieving approximately 90% F1-score in\nprediction tasks. INLegalLlama not only improves prediction accuracy but also\noffers comprehensible explanations, addressing the need for explainability in\nAI-assisted legal decisions.\n","authors":["Shubham Kumar Nigam","Balaramamahanthi Deepak Patnaik","Shivam Mishra","Noel Shallum","Kripabandhu Ghosh","Arnab Bhattacharya"],"pdf_url":"https://arxiv.org/pdf/2412.08385v1.pdf","comment":"Accepted on COLING 2025"},{"id":"http://arxiv.org/abs/2412.08300v1","updated":"2024-12-11T11:29:15Z","published":"2024-12-11T11:29:15Z","title":"Augmenting Sequential Recommendation with Balanced Relevance and\n Diversity","summary":" By generating new yet effective data, data augmentation has become a\npromising method to mitigate the data sparsity problem in sequential\nrecommendation. Existing works focus on augmenting the original data but rarely\nexplore the issue of imbalanced relevance and diversity for augmented data,\nleading to semantic drift problems or limited performance improvements. In this\npaper, we propose a novel Balanced data Augmentation Plugin for Sequential\nRecommendation (BASRec) to generate data that balance relevance and diversity.\nBASRec consists of two modules: Single-sequence Augmentation and Cross-sequence\nAugmentation. The former leverages the randomness of the heuristic operators to\ngenerate diverse sequences for a single user, after which the diverse and the\noriginal sequences are fused at the representation level to obtain relevance.\nFurther, we devise a reweighting strategy to enable the model to learn the\npreferences based on the two properties adaptively. The Cross-sequence\nAugmentation performs nonlinear mixing between different sequence\nrepresentations from two directions. It produces virtual sequence\nrepresentations that are diverse enough but retain the vital semantics of the\noriginal sequences. These two modules enhance the model to discover\nfine-grained preferences knowledge from single-user and cross-user\nperspectives. Extensive experiments verify the effectiveness of BASRec. The\naverage improvement is up to 72.0% on GRU4Rec, 33.8% on SASRec, and 68.5% on\nFMLP-Rec. We demonstrate that BASRec generates data with a better balance\nbetween relevance and diversity than existing methods. The source code is\navailable at https://github.com/KingGugu/BASRec.\n","authors":["Yizhou Dang","Jiahui Zhang","Yuting Liu","Enneng Yang","Yuliang Liang","Guibing Guo","Jianzhe Zhao","Xingwei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.08300v1.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.08258v1","updated":"2024-12-11T10:11:41Z","published":"2024-12-11T10:11:41Z","title":"Large Language Models for Scholarly Ontology Generation: An Extensive\n Analysis in the Engineering Field","summary":" Ontologies of research topics are crucial for structuring scientific\nknowledge, enabling scientists to navigate vast amounts of research, and\nforming the backbone of intelligent systems such as search engines and\nrecommendation systems. However, manual creation of these ontologies is\nexpensive, slow, and often results in outdated and overly general\nrepresentations. As a solution, researchers have been investigating ways to\nautomate or semi-automate the process of generating these ontologies. This\npaper offers a comprehensive analysis of the ability of large language models\n(LLMs) to identify semantic relationships between different research topics,\nwhich is a critical step in the development of such ontologies. To this end, we\ndeveloped a gold standard based on the IEEE Thesaurus to evaluate the task of\nidentifying four types of relationships between pairs of topics: broader,\nnarrower, same-as, and other. Our study evaluates the performance of seventeen\nLLMs, which differ in scale, accessibility (open vs. proprietary), and model\ntype (full vs. quantised), while also assessing four zero-shot reasoning\nstrategies. Several models have achieved outstanding results, including\nMixtral-8x7B, Dolphin-Mistral-7B, and Claude 3 Sonnet, with F1-scores of 0.847,\n0.920, and 0.967, respectively. Furthermore, our findings demonstrate that\nsmaller, quantised models, when optimised through prompt engineering, can\ndeliver performance comparable to much larger proprietary models, while\nrequiring significantly fewer computational resources.\n","authors":["Tanay Aggarwal","Angelo Salatino","Francesco Osborne","Enrico Motta"],"pdf_url":"https://arxiv.org/pdf/2412.08258v1.pdf","comment":"submitted to Information Processing & Management"},{"id":"http://arxiv.org/abs/2310.15950v5","updated":"2024-12-11T08:40:48Z","published":"2023-10-24T15:51:13Z","title":"Representation Learning with Large Language Models for Recommendation","summary":" Recommender systems have seen significant advancements with the influence of\ndeep learning and graph neural networks, particularly in capturing complex\nuser-item relationships. However, these graph-based recommenders heavily depend\non ID-based data, potentially disregarding valuable textual information\nassociated with users and items, resulting in less informative learned\nrepresentations. Moreover, the utilization of implicit feedback data introduces\npotential noise and bias, posing challenges for the effectiveness of user\npreference learning. While the integration of large language models (LLMs) into\ntraditional ID-based recommenders has gained attention, challenges such as\nscalability issues, limitations in text-only reliance, and prompt input\nconstraints need to be addressed for effective implementation in practical\nrecommender systems. To address these challenges, we propose a model-agnostic\nframework RLMRec that aims to enhance existing recommenders with LLM-empowered\nrepresentation learning. It proposes a recommendation paradigm that integrates\nrepresentation learning with LLMs to capture intricate semantic aspects of user\nbehaviors and preferences. RLMRec incorporates auxiliary textual signals,\ndevelops a user/item profiling paradigm empowered by LLMs, and aligns the\nsemantic space of LLMs with the representation space of collaborative\nrelational signals through a cross-view alignment framework. This work further\nestablish a theoretical foundation demonstrating that incorporating textual\nsignals through mutual information maximization enhances the quality of\nrepresentations. In our evaluation, we integrate RLMRec with state-of-the-art\nrecommender models, while also analyzing its efficiency and robustness to noise\ndata. Our implementation codes are available at\nhttps://github.com/HKUDS/RLMRec.\n","authors":["Xubin Ren","Wei Wei","Lianghao Xia","Lixin Su","Suqi Cheng","Junfeng Wang","Dawei Yin","Chao Huang"],"pdf_url":"https://arxiv.org/pdf/2310.15950v5.pdf","comment":"Published as a WWW'24 full paper"},{"id":"http://arxiv.org/abs/2412.08185v1","updated":"2024-12-11T08:24:15Z","published":"2024-12-11T08:24:15Z","title":"Exploring Multidimensional Checkworthiness: Designing AI-assisted Claim\n Prioritization for Human Fact-checkers","summary":" Given the massive volume of potentially false claims circulating online,\nclaim prioritization is essential in allocating limited human resources\navailable for fact-checking. In this study, we perceive claim prioritization as\nan information retrieval (IR) task: just as multidimensional IR relevance, with\nmany factors influencing which search results a user deems relevant,\ncheckworthiness is also multi-faceted, subjective, and even personal, with many\nfactors influencing how fact-checkers triage and select which claims to check.\nOur study investigates both the multidimensional nature of checkworthiness and\neffective tool support to assist fact-checkers in claim prioritization.\nMethodologically, we pursue Research through Design combined with mixed-method\nevaluation. We develop an AI-assisted claim prioritization prototype as a probe\nto explore how fact-checkers use multidimensional checkworthiness factors in\nclaim prioritization, simultaneously probing fact-checker needs while also\nexploring the design space to meet those needs.\n Our study with 16 professional fact-checkers investigates: 1) how\nparticipants assessed the relative importance of different checkworthy\ndimensions and apply different priorities in claim selection; 2) how they\ncreated customized GPT-based search filters and the corresponding benefits and\nlimitations; and 3) their overall user experiences with our prototype. Our work\nmakes a conceptual contribution between multidimensional IR relevance and\nfact-checking checkworthiness, with findings demonstrating the value of\ncorresponding tooling support. Specifically, we uncovered a hierarchical\nprioritization strategy fact-checkers implicitly use, revealing an\nunderexplored aspect of their workflow, with actionable design recommendations\nfor improving claim triage across multi-dimensional checkworthiness and\ntailoring this process with LLM integration.\n","authors":["Houjiang Liu","Jacek Gwizdka","Matthew Lease"],"pdf_url":"https://arxiv.org/pdf/2412.08185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08103v1","updated":"2024-12-11T05:08:19Z","published":"2024-12-11T05:08:19Z","title":"Multimodal Difference Learning for Sequential Recommendation","summary":" Sequential recommendations have drawn significant attention in modeling the\nuser's historical behaviors to predict the next item. With the booming\ndevelopment of multimodal data (e.g., image, text) on internet platforms,\nsequential recommendation also benefits from the incorporation of multimodal\ndata. Most methods introduce modal features of items as side information and\nsimply concatenates them to learn unified user interests. Nevertheless, these\nmethods encounter the limitation in modeling multimodal differences. We argue\nthat user interests and item relationships vary across different modalities. To\naddress this problem, we propose a novel Multimodal Difference Learning\nframework for Sequential Recommendation, MDSRec for brevity. Specifically, we\nfirst explore the differences in item relationships by constructing modal-aware\nitem relation graphs with behavior signal to enhance item representations.\nThen, to capture the differences in user interests across modalities, we design\na interest-centralized attention mechanism to independently model user sequence\nrepresentations in different modalities. Finally, we fuse the user embeddings\nfrom multiple modalities to achieve accurate item recommendation. Experimental\nresults on five real-world datasets demonstrate the superiority of MDSRec over\nstate-of-the-art baselines and the efficacy of multimodal difference learning.\n","authors":["Changhong Li","Zhiqiang Guo"],"pdf_url":"https://arxiv.org/pdf/2412.08103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08071v1","updated":"2024-12-11T03:33:51Z","published":"2024-12-11T03:33:51Z","title":"A Tutorial of Personalized Federated Recommender Systems: Recent\n Advances and Future Directions","summary":" Personalization stands as the cornerstone of recommender systems (RecSys),\nstriving to sift out redundant information and offer tailor-made services for\nusers. However, the conventional cloud-based RecSys necessitates centralized\ndata collection, posing significant risks of user privacy breaches. In response\nto this challenge, federated recommender systems (FedRecSys) have emerged,\ngarnering considerable attention. FedRecSys enable users to retain personal\ndata locally and solely share model parameters with low privacy sensitivity for\nglobal model training, significantly bolstering the system's privacy protection\ncapabilities. Within the distributed learning framework, the pronounced non-iid\nnature of user behavior data introduces fresh hurdles to federated\noptimization. Meanwhile, the ability of federated learning to concurrently\nlearn multiple models presents an opportunity for personalized user modeling.\nConsequently, the development of personalized FedRecSys (PFedRecSys) is crucial\nand holds substantial significance. This tutorial seeks to provide an\nintroduction to PFedRecSys, encompassing (1) an overview of existing studies on\nPFedRecSys, (2) a comprehensive taxonomy of PFedRecSys spanning four pivotal\nresearch directions-client-side adaptation, server-side aggregation,\ncommunication efficiency, privacy and protection, and (3) exploration of open\nchallenges and promising future directions in PFedRecSys. This tutorial aims to\nestablish a robust foundation and spark new perspectives for subsequent\nexploration and practical implementations in the evolving realm of RecSys.\n","authors":["Jing Jiang","Chunxu Zhang","Honglei Zhang","Zhiwei Li","Yidong Li","Bo Yang"],"pdf_url":"https://arxiv.org/pdf/2412.08071v1.pdf","comment":"A technical tutorial will appear at The Web Conference 2025"},{"id":"http://arxiv.org/abs/2412.08066v1","updated":"2024-12-11T03:22:04Z","published":"2024-12-11T03:22:04Z","title":"Cluster-Enhanced Federated Graph Neural Network for Recommendation","summary":" Personal interaction data can be effectively modeled as individual graphs for\neach user in recommender systems.Graph Neural Networks (GNNs)-based\nrecommendation techniques have become extremely popular since they can capture\nhigh-order collaborative signals between users and items by aggregating the\nindividual graph into a global interactive graph.However, this centralized\napproach inherently poses a threat to user privacy and security. Recently,\nfederated GNN-based recommendation techniques have emerged as a promising\nsolution to mitigate privacy concerns. Nevertheless, current implementations\neither limit on-device training to an unaccompanied individual graphs or\nnecessitate reliance on an extra third-party server to touch other individual\ngraphs, which also increases the risk of privacy leakage. To address this\nchallenge, we propose a Cluster-enhanced Federated Graph Neural Network\nframework for Recommendation, named CFedGR, which introduces high-order\ncollaborative signals to augment individual graphs in a privacy preserving\nmanner. Specifically, the server clusters the pretrained user representations\nto identify high-order collaborative signals. In addition, two efficient\nstrategies are devised to reduce communication between devices and the server.\nExtensive experiments on three benchmark datasets validate the effectiveness of\nour proposed methods.\n","authors":["Haiyan Wang","Ye Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.08066v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07998v1","updated":"2024-12-11T00:44:52Z","published":"2024-12-11T00:44:52Z","title":"RALI@TREC iKAT 2024: Achieving Personalization via Retrieval Fusion in\n Conversational Search","summary":" The Recherche Appliquee en Linguistique Informatique (RALI) team participated\nin the 2024 TREC Interactive Knowledge Assistance (iKAT) Track. In personalized\nconversational search, effectively capturing a user's complex search intent\nrequires incorporating both contextual information and key elements from the\nuser profile into query reformulation. The user profile often contains many\nrelevant pieces, and each could potentially complement the user's information\nneeds. It is difficult to disregard any of them, whereas introducing an\nexcessive number of these pieces risks drifting from the original query and\nhinders search performance. This is a challenge we denote as\nover-personalization. To address this, we propose different strategies by\nfusing ranking lists generated from the queries with different levels of\npersonalization.\n","authors":["Yuchen Hui","Fengran Mo","Milan Mao","Jian-Yun Nie"],"pdf_url":"https://arxiv.org/pdf/2412.07998v1.pdf","comment":"Work presented at NIST Text Retrieval Conference 2024.\n https://www.nist.gov/news-events/events/2024/11/trec2024"}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.12008v3","updated":"2024-12-11T20:34:25Z","published":"2024-11-18T19:48:18Z","title":"Compression of Higher Order Ambisonics with Multichannel RVQGAN","summary":" A multichannel extension to the RVQGAN neural coding method is proposed, and\nrealized for data-driven compression of third-order Ambisonics audio. The\ninput- and output layers of the generator and discriminator models are modified\nto accept multiple (16) channels without increasing the model bitrate. We also\npropose a loss function for accounting for spatial perception in immersive\nreproduction, and transfer learning from single-channel models. Listening test\nresults with 7.1.4 immersive playback show that the proposed extension is\nsuitable for coding scene-based, 16-channel Ambisonics content with good\nquality at 16 kbps when trained and tested on the EigenScape database. The\nmodel has potential applications for learning other types of content and\nmultichannel formats.\n","authors":["Toni Hirvonen","Mahmoud Namazi"],"pdf_url":"https://arxiv.org/pdf/2411.12008v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08577v1","updated":"2024-12-11T17:51:44Z","published":"2024-12-11T17:51:44Z","title":"Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio\n Generation","summary":" Text-to-audio (TTA) model is capable of generating diverse audio from textual\nprompts. However, most mainstream TTA models, which predominantly rely on\nMel-spectrograms, still face challenges in producing audio with rich content.\nThe intricate details and texture required in Mel-spectrograms for such audio\noften surpass the models' capacity, leading to outputs that are blurred or lack\ncoherence. In this paper, we begin by investigating the critical role of U-Net\nin Mel-spectrogram generation. Our analysis shows that in U-Net structure,\nhigh-frequency components in skip-connections and the backbone influence\ntexture and detail, while low-frequency components in the backbone are critical\nfor the diffusion denoising process. We further propose ``Mel-Refine'', a\nplug-and-play approach that enhances Mel-spectrogram texture and detail by\nadjusting different component weights during inference. Our method requires no\nadditional training or fine-tuning and is fully compatible with any\ndiffusion-based TTA architecture. Experimental results show that our approach\nboosts performance metrics of the latest TTA model Tango2 by 25\\%,\ndemonstrating its effectiveness.\n","authors":["Hongming Guo","Ruibo Fu","Yizhong Geng","Shuai Liu","Shuchen Shi","Tao Wang","Chunyu Qiang","Chenxing Li","Ya Li","Zhengqi Wen","Yukun Liu","Xuefei Liu"],"pdf_url":"https://arxiv.org/pdf/2412.08577v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08504v1","updated":"2024-12-11T16:15:14Z","published":"2024-12-11T16:15:14Z","title":"PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based\n Talking Head Synthesis","summary":" Talking head synthesis with arbitrary speech audio is a crucial challenge in\nthe field of digital humans. Recently, methods based on radiance fields have\nreceived increasing attention due to their ability to synthesize high-fidelity\nand identity-consistent talking heads from just a few minutes of training\nvideo. However, due to the limited scale of the training data, these methods\noften exhibit poor performance in audio-lip synchronization and visual quality.\nIn this paper, we propose a novel 3D Gaussian-based method called PointTalk,\nwhich constructs a static 3D Gaussian field of the head and deforms it in sync\nwith the audio. It also incorporates an audio-driven dynamic lip point cloud as\na critical component of the conditional information, thereby facilitating the\neffective synthesis of talking heads. Specifically, the initial step involves\ngenerating the corresponding lip point cloud from the audio signal and\ncapturing its topological structure. The design of the dynamic difference\nencoder aims to capture the subtle nuances inherent in dynamic lip movements\nmore effectively. Furthermore, we integrate the audio-point enhancement module,\nwhich not only ensures the synchronization of the audio signal with the\ncorresponding lip point cloud within the feature space, but also facilitates a\ndeeper understanding of the interrelations among cross-modal conditional\nfeatures. Extensive experiments demonstrate that our method achieves superior\nhigh-fidelity and audio-lip synchronization in talking head synthesis compared\nto previous methods.\n","authors":["Yifan Xie","Tao Feng","Xin Zhang","Xiangyang Luo","Zixuan Guo","Weijiang Yu","Heng Chang","Fei Ma","Fei Richard Yu"],"pdf_url":"https://arxiv.org/pdf/2412.08504v1.pdf","comment":"9 pages, accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.08489v1","updated":"2024-12-11T15:53:13Z","published":"2024-12-11T15:53:13Z","title":"A Dual-Module Denoising Approach with Curriculum Learning for Enhancing\n Multimodal Aspect-Based Sentiment Analysis","summary":" Multimodal Aspect-Based Sentiment Analysis (MABSA) combines text and images\nto perform sentiment analysis but often struggles with irrelevant or misleading\nvisual information. Existing methodologies typically address either\nsentence-image denoising or aspect-image denoising but fail to comprehensively\ntackle both types of noise. To address these limitations, we propose DualDe, a\nnovel approach comprising two distinct components: the Hybrid Curriculum\nDenoising Module (HCD) and the Aspect-Enhance Denoising Module (AED). The HCD\nmodule enhances sentence-image denoising by incorporating a flexible curriculum\nlearning strategy that prioritizes training on clean data. Concurrently, the\nAED module mitigates aspect-image noise through an aspect-guided attention\nmechanism that filters out noisy visual regions which unrelated to the specific\naspects of interest. Our approach demonstrates effectiveness in addressing both\nsentence-image and aspect-image noise, as evidenced by experimental evaluations\non benchmark datasets.\n","authors":["Nguyen Van Doan","Dat Tran Nguyen","Cam-Van Thi Nguyen"],"pdf_url":"https://arxiv.org/pdf/2412.08489v1.pdf","comment":"Accepted at PACLIC 2024"},{"id":"http://arxiv.org/abs/2412.08443v1","updated":"2024-12-11T15:08:25Z","published":"2024-12-11T15:08:25Z","title":"POINTS1.5: Building a Vision-Language Model towards Real World\n Applications","summary":" Vision-language models have made significant strides recently, demonstrating\nsuperior performance across a range of tasks, e.g. optical character\nrecognition and complex diagram analysis. Building on this trend, we introduce\na new vision-language model, POINTS1.5, designed to excel in various real-world\napplications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several\nkey innovations: i) We replace the original CLIP vision encoder, which had a\nfixed image resolution, with a NaViT-style vision encoder that supports native\ndynamic high resolution. This allows POINTS1.5 to process images of any\nresolution without needing to split them into tiles. ii) We add bilingual\nsupport to POINTS1.5, significantly enhancing its capability in Chinese. Due to\nthe scarcity of open-source Chinese datasets for vision-language models, we\ncollect numerous images from the Internet and annotate them using a combination\nof manual and automatic methods. iii) We propose a set of rigorous filtering\nmethods for visual instruction tuning datasets. We comprehensively evaluate all\nthese filtering methods, and choose the most effective ones to obtain the final\nvisual instruction tuning set. Thanks to these innovations, POINTS1.5\nsignificantly outperforms POINTS1.0 and demonstrates strong performance across\na range of real-world applications. Notably, POINTS1.5-7B is trained on fewer\nthan 4 billion tokens and ranks first on the OpenCompass leaderboard among\nmodels with fewer than 10 billion parameters\n","authors":["Yuan Liu","Le Tian","Xiao Zhou","Xinyu Gao","Kavio Yu","Yang Yu","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.08443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05185v2","updated":"2024-12-11T14:43:02Z","published":"2024-12-06T17:04:42Z","title":"LinVT: Empower Your Image-level Large Language Model to Understand\n Videos","summary":" Large Language Models (LLMs) have been widely used in various tasks,\nmotivating us to develop an LLM-based assistant for videos. Instead of training\nfrom scratch, we propose a module to transform arbitrary well-trained\nimage-based LLMs into video-LLMs (after being trained on video data). To better\nadapt image-LLMs for processing videos, we introduce two design principles:\nlinear transformation to preserve the original visual-language alignment and\nrepresentative information condensation from redundant video content. Guided by\nthese principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),\nwhich enables existing image-LLMs to understand videos. We benchmark LinVT with\nsix recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,\nshowcasing the high compatibility of LinVT. LinVT-based LLMs achieve\nstate-of-the-art performance across various video benchmarks, illustrating the\neffectiveness of LinVT in multi-modal video understanding.\n","authors":["Lishuai Gao","Yujie Zhong","Yingsen Zeng","Haoxian Tan","Dengjie Li","Zheng Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.05185v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08312v1","updated":"2024-12-11T11:47:39Z","published":"2024-12-11T11:47:39Z","title":"A Unified Model For Voice and Accent Conversion In Speech and Singing\n using Self-Supervised Learning and Feature Extraction","summary":" This paper presents a new voice conversion model capable of transforming both\nspeaking and singing voices. It addresses key challenges in current systems,\nsuch as conveying emotions, managing pronunciation and accent changes, and\nreproducing non-verbal sounds. One of the model's standout features is its\nability to perform accent conversion on hybrid voice samples that encompass\nboth speech and singing, allowing it to change the speaker's accent while\npreserving the original content and prosody. The proposed model uses an\nencoder-decoder architecture: the encoder is based on HuBERT to process the\nspeech's acoustic and linguistic content, while the HiFi-GAN decoder audio\nmatches the target speaker's voice. The model incorporates fundamental\nfrequency (f0) features and singer embeddings to enhance performance while\nensuring the pitch & tone accuracy and vocal identity are preserved during\ntransformation. This approach improves how naturally and flexibly voice style\ncan be transformed, showing strong potential for applications in voice dubbing,\ncontent creation, and technologies like Text-to-Speech (TTS) and Interactive\nVoice Response (IVR) systems.\n","authors":["Sowmya Cheripally"],"pdf_url":"https://arxiv.org/pdf/2412.08312v1.pdf","comment":"7 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.08247v1","updated":"2024-12-11T09:55:09Z","published":"2024-12-11T09:55:09Z","title":"MoMuSE: Momentum Multi-modal Target Speaker Extraction for Real-time\n Scenarios with Impaired Visual Cues","summary":" Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of\na specific target speaker from an audio mixture using time-synchronized visual\ncues. In real-world scenarios, visual cues are not always available due to\nvarious impairments, which undermines the stability of AV-TSE. Despite this\nchallenge, humans can maintain attentional momentum over time, even when the\ntarget speaker is not visible. In this paper, we introduce the Momentum\nMulti-modal target Speaker Extraction (MoMuSE), which retains a speaker\nidentity momentum in memory, enabling the model to continuously track the\ntarget speaker. Designed for real-time inference, MoMuSE extracts the current\nspeech window with guidance from both visual cues and dynamically updated\nspeaker momentum. Experimental results demonstrate that MoMuSE exhibits\nsignificant improvement, particularly in scenarios with severe impairment of\nvisual cues.\n","authors":["Junjie Li","Ke Zhang","Shuai Wang","Kong Aik Lee","Haizhou Li"],"pdf_url":"https://arxiv.org/pdf/2412.08247v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08197v1","updated":"2024-12-11T08:40:37Z","published":"2024-12-11T08:40:37Z","title":"SAFIRE: Segment Any Forged Image Region","summary":" Most techniques approach the problem of image forgery localization as a\nbinary segmentation task, training neural networks to label original areas as 0\nand forged areas as 1. In contrast, we tackle this issue from a more\nfundamental perspective by partitioning images according to their originating\nsources. To this end, we propose Segment Any Forged Image Region (SAFIRE),\nwhich solves forgery localization using point prompting. Each point on an image\nis used to segment the source region containing itself. This allows us to\npartition images into multiple source regions, a capability achieved for the\nfirst time. Additionally, rather than memorizing certain forgery traces, SAFIRE\nnaturally focuses on uniform characteristics within each source region. This\napproach leads to more stable and effective learning, achieving superior\nperformance in both the new task and the traditional binary forgery\nlocalization.\n","authors":["Myung-Joon Kwon","Wonjun Lee","Seung-Hun Nam","Minji Son","Changick Kim"],"pdf_url":"https://arxiv.org/pdf/2412.08197v1.pdf","comment":"Accepted at AAAI 2025. Code is available at:\n https://github.com/mjkwon2021/SAFIRE"},{"id":"http://arxiv.org/abs/2412.08176v1","updated":"2024-12-11T08:07:12Z","published":"2024-12-11T08:07:12Z","title":"TextRefiner: Internal Visual Feature as Efficient Refiner for\n Vision-Language Models Prompt Tuning","summary":" Despite the efficiency of prompt learning in transferring vision-language\nmodels (VLMs) to downstream tasks, existing methods mainly learn the prompts in\na coarse-grained manner where the learned prompt vectors are shared across all\ncategories. Consequently, the tailored prompts often fail to discern\nclass-specific visual concepts, thereby hindering the transferred performance\nfor classes that share similar or complex visual attributes. Recent advances\nmitigate this challenge by leveraging external knowledge from Large Language\nModels (LLMs) to furnish class descriptions, yet incurring notable inference\ncosts. In this paper, we introduce TextRefiner, a plug-and-play method to\nrefine the text prompts of existing methods by leveraging the internal\nknowledge of VLMs. Particularly, TextRefiner builds a novel local cache module\nto encapsulate fine-grained visual concepts derivedfrom local tokens within the\nimage branch. By aggregating and aligning the cached visual descriptions with\nthe original output of the text branch, TextRefiner can efficiently refine and\nenrich the learned prompts from existing methods without relying on any\nexternal expertise. For example, it improves the performance of CoOp from 71.66\n% to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise\nfeatures for text prompts. Equipped with TextRefiner, PromptKD achieves\nstate-of-the-art performance and is efficient in inference. Our code is relesed\nat https://github.com/xjjxmu/TextRefiner\n","authors":["Jingjing Xie","Yuxin Zhang","Jun Peng","Zhaohong Huang","Liujuan Cao"],"pdf_url":"https://arxiv.org/pdf/2412.08176v1.pdf","comment":"Accepted by AAAI2025"},{"id":"http://arxiv.org/abs/2412.08161v1","updated":"2024-12-11T07:33:18Z","published":"2024-12-11T07:33:18Z","title":"Collaborative Hybrid Propagator for Temporal Misalignment in\n Audio-Visual Segmentation","summary":" Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of\nsound-producing objects that accurately align with the corresponding audio.\nHowever, existing methods often face temporal misalignment, where audio cues\nand segmentation results are not temporally coordinated. Audio provides two\ncritical pieces of information: i) target object-level details and ii) the\ntiming of when objects start and stop producing sounds. Current methods focus\nmore on object-level information but neglect the boundaries of audio semantic\nchanges, leading to temporal misalignment. To address this issue, we propose a\nCollaborative Hybrid Propagator Framework~(Co-Prop). This framework includes\ntwo main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame\nAudio-Insert Propagation. To Anchor the audio boundary, we employ\nretrieval-assist prompts with Qwen large language models to identify control\npoints of audio semantic changes. These control points split the audio into\nsemantically consistent audio portions. After obtaining the control point\nlists, we propose the Audio Insertion Propagator to process each audio portion\nusing a frame-by-frame audio insertion propagation and matching approach. We\ncurated a compact dataset comprising diverse source conversion cases and\ndevised a metric to assess alignment rates. Compared to traditional\nsimultaneous processing methods, our approach reduces memory requirements and\nfacilitates frame alignment. Experimental results demonstrate the effectiveness\nof our approach across three datasets and two backbones. Furthermore, our\nmethod can be integrated with existing AVVS approaches, offering plug-and-play\nfunctionality to enhance their performance.\n","authors":["Kexin Li","Zongxin Yang","Yi Yang","Jun Xiao"],"pdf_url":"https://arxiv.org/pdf/2412.08161v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08117v1","updated":"2024-12-11T05:55:06Z","published":"2024-12-11T05:55:06Z","title":"LatentSpeech: Latent Diffusion for Text-To-Speech Generation","summary":" Diffusion-based Generative AI gains significant attention for its superior\nperformance over other generative techniques like Generative Adversarial\nNetworks and Variational Autoencoders. While it has achieved notable\nadvancements in fields such as computer vision and natural language processing,\ntheir application in speech generation remains under-explored. Mainstream\nText-to-Speech systems primarily map outputs to Mel-Spectrograms in the\nspectral space, leading to high computational loads due to the sparsity of\nMelSpecs. To address these limitations, we propose LatentSpeech, a novel TTS\ngeneration approach utilizing latent diffusion models. By using latent\nembeddings as the intermediate representation, LatentSpeech reduces the target\ndimension to 5% of what is required for MelSpecs, simplifying the processing\nfor the TTS encoder and vocoder and enabling efficient high-quality speech\ngeneration. This study marks the first integration of latent diffusion models\nin TTS, enhancing the accuracy and naturalness of generated speech.\nExperimental results on benchmark datasets demonstrate that LatentSpeech\nachieves a 25% improvement in Word Error Rate and a 24% improvement in Mel\nCepstral Distortion compared to existing models, with further improvements\nrising to 49.5% and 26%, respectively, with additional training data. These\nfindings highlight the potential of LatentSpeech to advance the\nstate-of-the-art in TTS technology\n","authors":["Haowei Lou","Helen Paik","Pari Delir Haghighi","Wen Hu","Lina Yao"],"pdf_url":"https://arxiv.org/pdf/2412.08117v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07316v2","updated":"2024-12-11T03:05:26Z","published":"2024-12-10T08:58:51Z","title":"Preserving Speaker Information in Direct Speech-to-Speech Translation\n with Non-Autoregressive Generation and Pretraining","summary":" Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one\nlanguage into semantically equivalent speech in another language, facilitating\ncommunication between speakers of different languages. Speech-to-Discrete Unit\nTranslation (S2UT), a mainstream approach for end-to-end S2ST, addresses\nchallenges such as error propagation across modules and slow inference speed\noften encountered in traditional cascade systems. However, as discrete units\nprimarily capture content information, conventional S2UT methods fail to retain\nspeaker-specific characteristics from the source. Our previous work, SC-S2UT,\nintroduced a speaker adapter and a unit-to-mel structure, enabling the\npreservation of speaker information and non-autoregressive speech generation.\nBuilding on this foundation, this study proposes a self-supervised pretraining\nmethod to enrich the information extracted by both the speaker adapter and the\nunit-to-mel structure. Additionally, we investigate different feature fusion\nstrategies to further improve the integration of speaker and content features.\nExperiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks\ndemonstrate that our proposed method achieves a BLEU score improvement of 1.14\ncompared to SC-S2UT, along with significant enhancements in MOS and speaker\nsimilarity. Furthermore, our approach achieves translation quality comparable\nto traditional S2UT, with only a minimal increase of 0.04s per utterance in\ninference time, while maintaining high speaker similarity. These results\nvalidate the effectiveness of the proposed method.\n","authors":["Rui Zhou","Akinori Ito","Takashi Nose"],"pdf_url":"https://arxiv.org/pdf/2412.07316v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.08029v1","updated":"2024-12-11T02:17:33Z","published":"2024-12-11T02:17:33Z","title":"NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF\n and Neural View Synthesis Methods","summary":" Neural View Synthesis (NVS) has demonstrated efficacy in generating\nhigh-fidelity dense viewpoint videos using a image set with sparse views.\nHowever, existing quality assessment methods like PSNR, SSIM, and LPIPS are not\ntailored for the scenes with dense viewpoints synthesized by NVS and NeRF\nvariants, thus, they often fall short in capturing the perceptual quality,\nincluding spatial and angular aspects of NVS-synthesized scenes. Furthermore,\nthe lack of dense ground truth views makes the full reference quality\nassessment on NVS-synthesized scenes challenging. For instance, datasets such\nas LLFF provide only sparse images, insufficient for complete full-reference\nassessments. To address the issues above, we propose NeRF-NQA, the first\nno-reference quality assessment method for densely-observed scenes synthesized\nfrom the NVS and NeRF variants. NeRF-NQA employs a joint quality assessment\nstrategy, integrating both viewwise and pointwise approaches, to evaluate the\nquality of NVS-generated scenes. The viewwise approach assesses the spatial\nquality of each individual synthesized view and the overall inter-views\nconsistency, while the pointwise approach focuses on the angular qualities of\nscene surface points and their compound inter-point quality. Extensive\nevaluations are conducted to compare NeRF-NQA with 23 mainstream visual quality\nassessment methods (from fields of image, video, and light-field assessment).\nThe results demonstrate NeRF-NQA outperforms the existing assessment methods\nsignificantly and it shows substantial superiority on assessing NVS-synthesized\nscenes without references. An implementation of this paper are available at\nhttps://github.com/VincentQQu/NeRF-NQA.\n","authors":["Qiang Qu","Hanxue Liang","Xiaoming Chen","Yuk Ying Chung","Yiran Shen"],"pdf_url":"https://arxiv.org/pdf/2412.08029v1.pdf","comment":null}]},"2024-12-10T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2404.13298v3","updated":"2024-12-10T19:40:49Z","published":"2024-04-20T07:04:46Z","title":"MARec: Metadata Alignment for cold-start Recommendation","summary":" For many recommender systems, the primary data source is a historical record\nof user clicks. The associated click matrix is often very sparse, as the number\nof users x products can be far larger than the number of clicks. Such sparsity\nis accentuated in cold-start settings, which makes the efficient use of\nmetadata information of paramount importance. In this work, we propose a simple\napproach to address cold-start recommendations by leveraging content metadata,\nMetadata Alignment for cold-start Recommendation. We show that this approach\ncan readily augment existing matrix factorization and autoencoder approaches,\nenabling a smooth transition to top performing algorithms in warmer set-ups.\nOur experimental results indicate three separate contributions: first, we show\nthat our proposed framework largely beats SOTA results on 4 cold-start datasets\nwith different sparsity and scale characteristics, with gains ranging from\n+8.4% to +53.8% on reported ranking metrics; second, we provide an ablation\nstudy on the utility of semantic features, and proves the additional gain\nobtained by leveraging such features ranges between +46.8% and +105.5%; and\nthird, our approach is by construction highly competitive in warm set-ups, and\nwe propose a closed-form solution outperformed by SOTA results by only 0.8% on\naverage.\n","authors":["Julien Monteil","Volodymyr Vaskovych","Wentao Lu","Anirban Majumder","Anton van den Hengel"],"pdf_url":"https://arxiv.org/pdf/2404.13298v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.16780v2","updated":"2024-12-10T18:45:18Z","published":"2024-10-22T07:53:41Z","title":"Beyond Retrieval: Generating Narratives in Conversational Recommender\n Systems","summary":" The recent advances in Large Language Model's generation and reasoning\ncapabilities present an opportunity to develop truly conversational\nrecommendation systems. However, effectively integrating recommender system\nknowledge into LLMs for natural language generation which is tailored towards\nrecommendation tasks remains a challenge. This paper addresses this challenge\nby making two key contributions.\n First, we introduce a new dataset (REGEN) for natural language generation\ntasks in conversational recommendations. REGEN (Reviews Enhanced with\nGEnerative Narratives) extends the Amazon Product Reviews dataset with rich\nuser narratives, including personalized explanations of product preferences,\nproduct endorsements for recommended items, and summaries of user purchase\nhistory. REGEN is made publicly available to facilitate further research.\nFurthermore, we establish benchmarks using well-known generative metrics, and\nperform an automated evaluation of the new dataset using a rater LLM. Second,\nthe paper introduces a fusion architecture (CF model with an LLM) which serves\nas a baseline for REGEN. And to the best of our knowledge, represents the first\nattempt to analyze the capabilities of LLMs in understanding recommender\nsignals and generating rich narratives. We demonstrate that LLMs can\neffectively learn from simple fusion architectures utilizing interaction-based\nCF embeddings, and this can be further enhanced using the metadata and\npersonalization data associated with items. Our experiments show that combining\nCF and content embeddings leads to improvements of 4-12% in key language\nmetrics compared to using either type of embedding individually. We also\nprovide an analysis to interpret how CF and content embeddings contribute to\nthis new generative task.\n","authors":["Krishna Sayana","Raghavendra Vasudeva","Yuri Vasilevski","Kun Su","Liam Hebert","James Pine","Hubert Pham","Ambarish Jash","Sukhdeep Sodhi"],"pdf_url":"https://arxiv.org/pdf/2410.16780v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07713v1","updated":"2024-12-10T18:01:33Z","published":"2024-12-10T18:01:33Z","title":"Benchmark for Evaluation and Analysis of Citation Recommendation Models","summary":" Citation recommendation systems have attracted much academic interest,\nresulting in many studies and implementations. These systems help authors\nautomatically generate proper citations by suggesting relevant references based\non the text they have written. However, the methods used in citation\nrecommendation differ across various studies and implementations. Some\napproaches focus on the overall content of papers, while others consider the\ncontext of the citation text. Additionally, the datasets used in these studies\ninclude different aspects of papers, such as metadata, citation context, or\neven the full text of the paper in various formats and structures. The\ndiversity in models, datasets, and evaluation metrics makes it challenging to\nassess and compare citation recommendation methods effectively. To address this\nissue, a standardized dataset and evaluation metrics are needed to evaluate\nthese models consistently. Therefore, we propose developing a benchmark\nspecifically designed to analyze and compare citation recommendation models.\nThis benchmark will evaluate the performance of models on different features of\nthe citation context and provide a comprehensive evaluation of the models\nacross all these tasks, presenting the results in a standardized way. By\ncreating a benchmark with standardized evaluation metrics, researchers and\npractitioners in the field of citation recommendation will have a common\nplatform to assess and compare different models. This will enable meaningful\ncomparisons and help identify promising approaches for further research and\ndevelopment in the field.\n","authors":["Puja Maharjan"],"pdf_url":"https://arxiv.org/pdf/2412.07713v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2405.12207v3","updated":"2024-12-10T17:06:57Z","published":"2024-05-20T17:47:18Z","title":"Optimistic Query Routing in Clustering-based Approximate Maximum Inner\n Product Search","summary":" Clustering-based nearest neighbor search is an effective method in which\npoints are partitioned into geometric shards to form an index, with only a few\nshards searched during query processing to find a set of top-$k$ vectors. Even\nthough the search efficacy is heavily influenced by the algorithm that\nidentifies the shards to probe, it has received little attention in the\nliterature. This work bridges that gap by studying routing in clustering-based\nmaximum inner product search. We unpack existing routers and notice the\nsurprising contribution of optimism. We then take a page from the sequential\ndecision making literature and formalize that insight following the principle\nof ``optimism in the face of uncertainty.'' In particular, we present a\nframework that incorporates the moments of the distribution of inner products\nwithin each shard to estimate the maximum inner product. We then present an\ninstance of our algorithm that uses only the first two moments to reach the\nsame accuracy as state-of-the-art routers such as ScaNN by probing up to $50\\%$\nfewer points on benchmark datasets. Our algorithm is also space-efficient: we\ndesign a sketch of the second moment whose size is independent of the number of\npoints and requires $\\mathcal{O}(1)$ vectors per shard.\n","authors":["Sebastian Bruch","Aditya Krishnan","Franco Maria Nardini"],"pdf_url":"https://arxiv.org/pdf/2405.12207v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07626v1","updated":"2024-12-10T16:05:56Z","published":"2024-12-10T16:05:56Z","title":"OmniDocBench: Benchmarking Diverse PDF Document Parsing with\n Comprehensive Annotations","summary":" Document content extraction is crucial in computer vision, especially for\nmeeting the high-quality data needs of large language models (LLMs) and\nretrieval-augmented generation (RAG) technologies. However, current document\nparsing methods suffer from significant limitations in terms of diversity and\ncomprehensive evaluation. To address these challenges, we introduce\nOmniDocBench, a novel multi-source benchmark designed to advance automated\ndocument content extraction. OmniDocBench includes a meticulously curated and\nannotated high-quality evaluation dataset comprising nine diverse document\ntypes, such as academic papers, textbooks, slides, among others. Our benchmark\nprovides a flexible and comprehensive evaluation framework with 19 layout\ncategory labels and 14 attribute labels, enabling multi-level assessments\nacross entire datasets, individual modules, or specific data types. Using\nOmniDocBench, we perform an exhaustive comparative analysis of existing modular\npipelines and multimodal end-to-end methods, highlighting their limitations in\nhandling document diversity and ensuring fair evaluation. OmniDocBench\nestablishes a robust, diverse, and fair evaluation standard for the document\ncontent extraction field, offering crucial insights for future advancements and\nfostering the development of document parsing technologies. The codes and\ndataset is available in https://github.com/opendatalab/OmniDocBench.\n","authors":["Linke Ouyang","Yuan Qu","Hongbin Zhou","Jiawei Zhu","Rui Zhang","Qunshu Lin","Bin Wang","Zhiyuan Zhao","Man Jiang","Xiaomeng Zhao","Jin Shi","Fan Wu","Pei Chu","Minghao Liu","Zhenxiang Li","Chao Xu","Bo Zhang","Botian Shi","Zhongying Tu","Conghui He"],"pdf_url":"https://arxiv.org/pdf/2412.07626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.05668v2","updated":"2024-12-10T16:00:55Z","published":"2024-03-08T20:44:59Z","title":"CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model\n Recommender System","summary":" This work takes a critical stance on previous studies concerning fairness\nevaluation in Large Language Model (LLM)-based recommender systems, which have\nprimarily assessed consumer fairness by comparing recommendation lists\ngenerated with and without sensitive user attributes. Such approaches\nimplicitly treat discrepancies in recommended items as biases, overlooking\nwhether these changes might stem from genuine personalization aligned with true\npreferences of users. Moreover, these earlier studies typically address single\nsensitive attributes in isolation, neglecting the complex interplay of\nintersectional identities. In response to these shortcomings, we introduce\nCFaiRLLM, an enhanced evaluation framework that not only incorporates true\npreference alignment but also rigorously examines intersectional fairness by\nconsidering overlapping sensitive attributes. Additionally, CFaiRLLM introduces\ndiverse user profile sampling strategies-random, top-rated, and\nrecency-focused-to better understand the impact of profile generation fed to\nLLMs in light of inherent token limitations in these systems. Given that\nfairness depends on accurately understanding users' tastes and preferences,,\nthese strategies provide a more realistic assessment of fairness within\nRecLLMs.\n The results demonstrated that true preference alignment offers a more\npersonalized and fair assessment compared to similarity-based measures,\nrevealing significant disparities when sensitive and intersectional attributes\nare incorporated. Notably, our study finds that intersectional attributes\namplify fairness gaps more prominently, especially in less structured domains\nsuch as music recommendations in LastFM.\n","authors":["Yashar Deldjoo","Tommaso di Noia"],"pdf_url":"https://arxiv.org/pdf/2403.05668v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07573v1","updated":"2024-12-10T15:06:48Z","published":"2024-12-10T15:06:48Z","title":"SST framework for Document Matching","summary":" Long-form document matching aims to judge the relevance between two documents\nand has been applied to various scenarios. Most existing works utilize\nhierarchical or long context models to process documents, which achieve coarse\nunderstanding but may ignore details. Some researchers construct a document\nview with similar sentences about aligned document subtopics to focus on\ndetailed matching signals. However, a long document generally contains multiple\nsubtopics. The matching signals are heterogeneous from multiple topics.\nConsidering only the homologous aligned subtopics may not be representative\nenough and may cause biased modeling. In this paper, we introduce a new\nframework to model representative matching signals. First, we propose to\ncapture various matching signals through subtopics of document pairs. Next, We\nconstruct multiple document views based on subtopics to cover heterogeneous and\nvaluable details. However, existing spatial aggregation methods like attention,\nwhich integrate all these views simultaneously, are hard to integrate\nheterogeneous information. Instead, we propose temporal aggregation, which\neffectively integrates different views gradually as the training progresses.\nExperimental results show that our learning framework is effective on several\ndocument-matching tasks, including news duplication and legal case retrieval.\n","authors":["Youchao Zhou","Heyan Huang","Zhijing Wu","Yuhang Liu","Xinglin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.07573v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17643v2","updated":"2024-12-10T12:36:09Z","published":"2024-03-26T12:23:34Z","title":"S+t-SNE -- Bringing Dimensionality Reduction to Data Streams","summary":" We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle\ninfinite data streams. The core idea behind S+t-SNE is to update the t-SNE\nembedding incrementally as new data arrives, ensuring scalability and\nadaptability to handle streaming scenarios. By selecting the most important\npoints at each step, the algorithm ensures scalability while keeping\ninformative visualisations. By employing a blind method for drift management,\nthe algorithm adjusts the embedding space, which facilitates the visualisation\nof evolving data dynamics. Our experimental evaluations demonstrate the\neffectiveness and efficiency of S+t-SNE, whilst highlighting its ability to\ncapture patterns in a streaming scenario. We hope our approach offers\nresearchers and practitioners a real-time tool for understanding and\ninterpreting high-dimensional data.\n","authors":["Pedro C. Vieira","João P. Montrezol","João T. Vieira","João Gama"],"pdf_url":"https://arxiv.org/pdf/2403.17643v2.pdf","comment":"This preprint has undergone peer review but does not have any\n post-submission improvements or corrections. Full version after peer-review\n and post-acceptance improvements was presented at IDA2024\n (https://ida2024.org/)"},{"id":"http://arxiv.org/abs/2412.07462v1","updated":"2024-12-10T12:31:33Z","published":"2024-12-10T12:31:33Z","title":"Bilingual BSARD: Extending Statutory Article Retrieval to Dutch","summary":" Statutory article retrieval plays a crucial role in making legal information\nmore accessible to both laypeople and legal professionals. Multilingual\ncountries like Belgium present unique challenges for retrieval models due to\nthe need for handling legal issues in multiple languages. Building on the\nBelgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the\nbilingual version of this dataset, bBSARD. The dataset contains parallel\nBelgian statutory articles in both French and Dutch, along with legal questions\nfrom BSARD and their Dutch translation. Using bBSARD, we conduct extensive\nbenchmarking of retrieval models available for Dutch and French. Our\nbenchmarking setup includes lexical models, zero-shot dense models, and\nfine-tuned small foundation models. Our experiments show that BM25 remains a\ncompetitive baseline compared to many zero-shot dense models in both languages.\nWe also observe that while proprietary models outperform open alternatives in\nthe zero-shot setting, they can be matched or surpassed by fine-tuning small\nlanguage-specific models. Our dataset and evaluation code are publicly\navailable.\n","authors":["Ehsan Lotfi","Nikolay Banar","Nerses Yuzbashyan","Walter Daelemans"],"pdf_url":"https://arxiv.org/pdf/2412.07462v1.pdf","comment":"To be presented at RegNLP-2025 (COLING)"},{"id":"http://arxiv.org/abs/2312.11018v2","updated":"2024-12-10T11:20:21Z","published":"2023-12-18T08:35:10Z","title":"Hypergrah-Enhanced Dual Convolutional Network for Bundle Recommendation","summary":" Bundle recommendations strive to offer users a set of items as a package\nnamed bundle, enhancing convenience and contributing to the seller's revenue.\nWhile previous approaches have demonstrated notable performance, we argue that\nthey may compromise the ternary relationship among users, items, and bundles.\nThis compromise can result in information loss, ultimately impacting the\noverall model performance. To address this gap, we develop a unified model for\nbundle recommendation, termed hypergraph-enhanced dual convolutional neural\nnetwork (HED). Our approach is characterized by two key aspects. Firstly, we\nconstruct a complete hypergraph to capture interaction dynamics among users,\nitems, and bundles. Secondly, we incorporate U-B interaction information to\nenhance the information representation derived from users and bundle embedding\nvectors. Extensive experimental results on the Youshu and Netease datasets have\ndemonstrated that HED surpasses state-of-the-art baselines, proving its\neffectiveness. In addition, various ablation studies and sensitivity analyses\nrevealed the working mechanism and proved our effectiveness. Codes and datasets\nare available at https://github.com/AAI-Lab/HED\n","authors":["Yang Li","Kangbo Liu","Yaoxin Wu","Zhaoxuan Wang","Erik Cambria","Xiaoxu Wang"],"pdf_url":"https://arxiv.org/pdf/2312.11018v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07420v1","updated":"2024-12-10T11:18:29Z","published":"2024-12-10T11:18:29Z","title":"RAG-based Question Answering over Heterogeneous Data and Text","summary":" This article presents the QUASAR system for question answering over\nunstructured text, structured tables, and knowledge graphs, with unified\ntreatment of all sources. The system adopts a RAG-based architecture, with a\npipeline of evidence retrieval followed by answer generation, with the latter\npowered by a moderate-sized language model. Additionally and uniquely, QUASAR\nhas components for question understanding, to derive crisper input for evidence\nretrieval, and for re-ranking and filtering the retrieved evidence before\nfeeding the most informative pieces into the answer generation. Experiments\nwith three different benchmarks demonstrate the high answering quality of our\napproach, being on par with or better than large GPT models, while keeping the\ncomputational cost and energy consumption orders of magnitude lower.\n","authors":["Philipp Christmann","Gerhard Weikum"],"pdf_url":"https://arxiv.org/pdf/2412.07420v1.pdf","comment":"IEEE Data Engineering Bulletin -- December 2024 Edition on RAG"},{"id":"http://arxiv.org/abs/2412.07403v1","updated":"2024-12-10T10:52:44Z","published":"2024-12-10T10:52:44Z","title":"RLT4Rec: Reinforcement Learning Transformer for User Cold Start and Item\n Recommendation","summary":" We introduce a new sequential transformer reinforcement learning architecture\nRLT4Rec and demonstrate that it achieves excellent performance in a range of\nitem recommendation tasks. RLT4Rec uses a relatively simple transformer\narchitecture that takes as input the user's (item,rating) history and outputs\nthe next item to present to the user. Unlike existing RL approaches, there is\nno need to input a state observation or estimate. RLT4Rec handles new users and\nestablished users within the same consistent framework and automatically\nbalances the \"exploration\" needed to discover the preferences of a new user\nwith the \"exploitation\" that is more appropriate for established users.\nTraining of RLT4Rec is robust and fast and is insensitive to the choice of\ntraining data, learning to generate \"good\" personalised sequences that the user\ntends to rate highly even when trained on \"bad\" data.\n","authors":["Dilina Chandika Rajapakse","Douglas Leith"],"pdf_url":"https://arxiv.org/pdf/2412.07403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07382v1","updated":"2024-12-10T10:28:32Z","published":"2024-12-10T10:28:32Z","title":"Temporal Linear Item-Item Model for Sequential Recommendation","summary":" In sequential recommendation (SR), neural models have been actively explored\ndue to their remarkable performance, but they suffer from inefficiency inherent\nto their complexity. On the other hand, linear SR models exhibit high\nefficiency and achieve competitive or superior accuracy compared to neural\nmodels. However, they solely deal with the sequential order of items (i.e.,\nsequential information) and overlook the actual timestamp (i.e., temporal\ninformation). It is limited to effectively capturing various user preference\ndrifts over time. To address this issue, we propose a novel linear SR model,\nnamed TemporAl LinEar item-item model (TALE), incorporating temporal\ninformation while preserving training/inference efficiency, with three key\ncomponents. (i) Single-target augmentation concentrates on a single target\nitem, enabling us to learn the temporal correlation for the target item. (ii)\nTime interval-aware weighting utilizes the actual timestamp to discern the item\ncorrelation depending on time intervals. (iii) Trend-aware normalization\nreflects the dynamic shift of item popularity over time. Our empirical studies\nshow that TALE outperforms ten competing SR models by up to 18.71% gains on\nfive benchmark datasets. It also exhibits remarkable effectiveness in\nevaluating long-tail items by up to 30.45% gains. The source code is available\nat https://github.com/psm1206/TALE.\n","authors":["Seongmin Park","Mincheol Yoon","Minjin Choi","Jongwuk Lee"],"pdf_url":"https://arxiv.org/pdf/2412.07382v1.pdf","comment":"Accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2404.11180v3","updated":"2024-12-10T10:22:57Z","published":"2024-04-17T08:50:29Z","title":"Causal Deconfounding via Confounder Disentanglement for Dual-Target\n Cross-Domain Recommendation","summary":" In recent years, dual-target Cross-Domain Recommendation (CDR) has been\nproposed to capture comprehensive user preferences in order to ultimately\nenhance the recommendation accuracy in both data-richer and data-sparser\ndomains simultaneously. However, in addition to users' true preferences, the\nuser-item interactions might also be affected by confounders (e.g., free\nshipping, sales promotion). As a result, dual-target CDR has to meet two\nchallenges: (1) how to effectively decouple observed confounders, including\nsingle-domain confounders and cross-domain confounders, and (2) how to preserve\nthe positive effects of observed confounders on predicted interactions, while\neliminating their negative effects on capturing comprehensive user preferences.\nTo address the above two challenges, we propose a Causal Deconfounding\nframework via Confounder Disentanglement for dual-target Cross-Domain\nRecommendation, called CD2CDR. In CD2CDR, we first propose a confounder\ndisentanglement module to effectively decouple observed single-domain and\ncross-domain confounders. We then propose a causal deconfounding module to\npreserve the positive effects of such observed confounders and eliminate their\nnegative effects via backdoor adjustment, thereby enhancing the recommendation\naccuracy in each domain. Extensive experiments conducted on five real-world\ndatasets demonstrate that CD2CDR significantly outperforms the state-of-the-art\nmethods.\n","authors":["Jiajie Zhu","Yan Wang","Feng Zhu","Zhu Sun"],"pdf_url":"https://arxiv.org/pdf/2404.11180v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04629v2","updated":"2024-12-10T08:02:24Z","published":"2024-12-05T21:51:05Z","title":"Argumentative Experience: Reducing Confirmation Bias on Controversial\n Issues through LLM-Generated Multi-Persona Debates","summary":" Large language models (LLMs) are enabling designers to give life to exciting\nnew user experiences for information access. In this work, we present a system\nthat generates LLM personas to debate a topic of interest from different\nperspectives. How might information seekers use and benefit from such a system?\nCan centering information access around diverse viewpoints help to mitigate\nthorny challenges like confirmation bias in which information seekers\nover-trust search results matching existing beliefs? How do potential biases\nand hallucinations in LLMs play out alongside human users who are also fallible\nand possibly biased?\n Our study exposes participants to multiple viewpoints on controversial issues\nvia a mixed-methods, within-subjects study. We use eye-tracking metrics to\nquantitatively assess cognitive engagement alongside qualitative feedback.\nCompared to a baseline search system, we see more creative interactions and\ndiverse information-seeking with our multi-persona debate system, which more\neffectively reduces user confirmation bias and conviction toward their initial\nbeliefs. Overall, our study contributes to the emerging design space of\nLLM-based information access systems, specifically investigating the potential\nof simulated personas to promote greater exposure to information diversity,\nemulate collective intelligence, and mitigate bias in information seeking.\n","authors":["Li Shi","Houjiang Liu","Yian Wong","Utkarsh Mujumdar","Dan Zhang","Jacek Gwizdka","Matthew Lease"],"pdf_url":"https://arxiv.org/pdf/2412.04629v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.03988v2","updated":"2024-12-10T07:40:54Z","published":"2024-05-07T04:00:30Z","title":"LEARN: Knowledge Adaptation from Large Language Model to Recommendation\n for Practical Industrial Application","summary":" Contemporary recommendation systems predominantly rely on ID embedding to\ncapture latent associations among users and items. However, this approach\noverlooks the wealth of semantic information embedded within textual\ndescriptions of items, leading to suboptimal performance and poor\ngeneralizations. Leveraging the capability of large language models to\ncomprehend and reason about textual content presents a promising avenue for\nadvancing recommendation systems. To achieve this, we propose an Llm-driven\nknowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world\nknowledge with collaborative knowledge. We address computational complexity\nconcerns by utilizing pretrained LLMs as item encoders and freezing LLM\nparameters to avoid catastrophic forgetting and preserve open-world knowledge.\nTo bridge the gap between the open-world and collaborative domains, we design a\ntwin-tower structure supervised by the recommendation task and tailored for\npractical industrial application. Through experiments on the real large-scale\nindustrial dataset and online A/B tests, we demonstrate the efficacy of our\napproach in industry application. We also achieve state-of-the-art performance\non six Amazon Review datasets to verify the superiority of our method.\n","authors":["Jian Jia","Yipei Wang","Yan Li","Honggang Chen","Xuehan Bai","Zhaocheng Liu","Jian Liang","Quan Chen","Han Li","Peng Jiang","Kun Gai"],"pdf_url":"https://arxiv.org/pdf/2405.03988v2.pdf","comment":"Accepted by AAAI 2025"},{"id":"http://arxiv.org/abs/2412.07213v1","updated":"2024-12-10T06:09:49Z","published":"2024-12-10T06:09:49Z","title":"IntellectSeeker: A Personalized Literature Management System with the\n Probabilistic Model and Large Language Model","summary":" Faced with the burgeoning volume of academic literature, researchers often\nneed help with uncertain article quality and mismatches in term searches using\ntraditional academic engines. We introduce IntellectSeeker, an innovative and\npersonalized intelligent academic literature management platform to address\nthese challenges. This platform integrates a Large Language Model (LLM)--based\nsemantic enhancement bot with a sophisticated probability model to personalize\nand streamline literature searches. We adopted the GPT-3.5-turbo model to\ntransform everyday language into professional academic terms across various\nscenarios using multiple rounds of few-shot learning. This adaptation mainly\nbenefits academic newcomers, effectively bridging the gap between general\ninquiries and academic terminology. The probabilistic model intelligently\nfilters academic articles to align closely with the specific interests of\nusers, which are derived from explicit needs and behavioral patterns. Moreover,\nIntellectSeeker incorporates an advanced recommendation system and text\ncompression tools. These features enable intelligent article recommendations\nbased on user interactions and present search results through concise one-line\nsummaries and innovative word cloud visualizations, significantly enhancing\nresearch efficiency and user experience. IntellectSeeker offers academic\nresearchers a highly customizable literature management solution with\nexceptional search precision and matching capabilities. The code can be found\nhere: https://github.com/LuckyBian/ISY5001\n","authors":["Weizhen Bian","Siyan Liu","Yubo Zhou","Dezhi Chen","Yijie Liao","Zhenzhen Fan","Aobo Wang"],"pdf_url":"https://arxiv.org/pdf/2412.07213v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05579v2","updated":"2024-12-10T05:49:12Z","published":"2024-12-07T08:07:24Z","title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","summary":" The rapid advancement of Large Language Models (LLMs) has driven their\nexpanding application across various fields. One of the most promising\napplications is their role as evaluators based on natural language responses,\nreferred to as ''LLMs-as-judges''. This framework has attracted growing\nattention from both academia and industry due to their excellent effectiveness,\nability to generalize across tasks, and interpretability in the form of natural\nlanguage. This paper presents a comprehensive survey of the LLMs-as-judges\nparadigm from five key perspectives: Functionality, Methodology, Applications,\nMeta-evaluation, and Limitations. We begin by providing a systematic definition\nof LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then\nwe address methodology to construct an evaluation system with LLMs (How to use\nLLM judges?). Additionally, we investigate the potential domains for their\napplication (Where to use LLM judges?) and discuss methods for evaluating them\nin various contexts (How to evaluate LLM judges?). Finally, we provide a\ndetailed analysis of the limitations of LLM judges and discuss potential future\ndirections. Through a structured and comprehensive analysis, we aim aims to\nprovide insights on the development and application of LLMs-as-judges in both\nresearch and practice. We will continue to maintain the relevant resource list\nat https://github.com/CSHaitao/Awesome-LLMs-as-Judges.\n","authors":["Haitao Li","Qian Dong","Junjie Chen","Huixue Su","Yujia Zhou","Qingyao Ai","Ziyi Ye","Yiqun Liu"],"pdf_url":"https://arxiv.org/pdf/2412.05579v2.pdf","comment":"60 pages, comprehensive and continuously updated"},{"id":"http://arxiv.org/abs/2408.08931v2","updated":"2024-12-10T03:39:16Z","published":"2024-08-16T05:49:14Z","title":"Personalized Federated Collaborative Filtering: A Variational\n AutoEncoder Approach","summary":" Federated Collaborative Filtering (FedCF) is an emerging field focused on\ndeveloping a new recommendation framework with preserving privacy in a\nfederated setting. Existing FedCF methods typically combine distributed\nCollaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and\nthen preserve personalized information into a user embedding vector. However,\nthe user embedding is usually insufficient to preserve the rich information of\nthe fine-grained personalization across heterogeneous clients. This paper\nproposes a novel personalized FedCF method by preserving users' personalized\ninformation into a latent variable and a neural model simultaneously.\nSpecifically, we decompose the modeling of user knowledge into two encoders,\neach designed to capture shared knowledge and personalized knowledge\nseparately. A personalized gating network is then applied to balance\npersonalization and generalization between the global and local encoders.\nMoreover, to effectively train the proposed framework, we model the CF problem\nas a specialized Variational AutoEncoder (VAE) task by integrating user\ninteraction vector reconstruction with missing value prediction. The decoder is\ntrained to reconstruct the implicit feedback from items the user has interacted\nwith, while also predicting items the user might be interested in but has not\nyet interacted with. Experimental results on benchmark datasets demonstrate\nthat the proposed method outperforms other baseline methods, showcasing\nsuperior performance. Our code is available at https://github.com/mtics/FedDAE.\n","authors":["Zhiwei Li","Guodong Long","Tianyi Zhou","Jing Jiang","Chengqi Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.08931v2.pdf","comment":"10 pages, 3 figures, 4 tables, conference"},{"id":"http://arxiv.org/abs/2410.02126v2","updated":"2024-12-10T02:14:00Z","published":"2024-10-03T01:14:30Z","title":"BayesCNS: A Unified Bayesian Approach to Address Cold Start and\n Non-Stationarity in Search Systems at Scale","summary":" Information Retrieval (IR) systems used in search and recommendation\nplatforms frequently employ Learning-to-Rank (LTR) models to rank items in\nresponse to user queries. These models heavily rely on features derived from\nuser interactions, such as clicks and engagement data. This dependence\nintroduces cold start issues for items lacking user engagement and poses\nchallenges in adapting to non-stationary shifts in user behavior over time. We\naddress both challenges holistically as an online learning problem and propose\nBayesCNS, a Bayesian approach designed to handle cold start and non-stationary\ndistribution shifts in search systems at scale. BayesCNS achieves this by\nestimating prior distributions for user-item interactions, which are\ncontinuously updated with new user interactions gathered online. This online\nlearning procedure is guided by a ranker model, enabling efficient exploration\nof relevant items using contextual information provided by the ranker. We\nsuccessfully deployed BayesCNS in a large-scale search system and demonstrated\nits efficacy through comprehensive offline and online experiments. Notably, an\nonline A/B experiment showed a 10.60% increase in new item interactions and a\n1.05% improvement in overall success metrics over the existing production\nbaseline.\n","authors":["Randy Ardywibowo","Rakesh Sunki","Lucy Kuo","Sankalp Nayak"],"pdf_url":"https://arxiv.org/pdf/2410.02126v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.07948v1","updated":"2024-12-10T22:22:19Z","published":"2024-12-10T22:22:19Z","title":"Frechet Music Distance: A Metric For Generative Symbolic Music\n Evaluation","summary":" In this paper we introduce the Frechet Music Distance (FMD), a novel\nevaluation metric for generative symbolic music models, inspired by the Frechet\nInception Distance (FID) in computer vision and Frechet Audio Distance (FAD) in\ngenerative audio. FMD calculates the distance between distributions of\nreference and generated symbolic music embeddings, capturing abstract musical\nfeatures. We validate FMD across several datasets and models. Results indicate\nthat FMD effectively differentiates model quality, providing a domain-specific\nmetric for evaluating symbolic music generation, and establishing a\nreproducible standard for future research in symbolic music modeling.\n","authors":["Jan Retkowski","Jakub Stępniak","Mateusz Modrzejewski"],"pdf_url":"https://arxiv.org/pdf/2412.07948v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.22046v3","updated":"2024-12-10T19:51:42Z","published":"2024-10-29T13:53:09Z","title":"CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions","summary":" Chord progressions encapsulate important information about music, pertaining\nto its structure and conveyed emotions. They serve as the backbone of musical\ncomposition, and in many cases, they are the sole information required for a\nmusician to play along and follow the music. Despite their importance, chord\nprogressions as a data domain remain underexplored. There is a lack of\nlarge-scale datasets suitable for deep learning applications, and limited\nresearch exploring chord progressions as an input modality. In this work, we\npresent Chordonomicon, a dataset of over 666,000 songs and their chord\nprogressions, annotated with structural parts, genre, and release date -\ncreated by scraping various sources of user-generated progressions and\nassociated metadata. We demonstrate the practical utility of the Chordonomicon\ndataset for classification and generation tasks, and discuss its potential to\nprovide valuable insights to the research community. Chord progressions are\nunique in their ability to be represented in multiple formats (e.g. text,\ngraph) and the wealth of information chords convey in given contexts, such as\ntheir harmonic function . These characteristics make the Chordonomicon an ideal\ntestbed for exploring advanced machine learning techniques, including\ntransformers, graph machine learning, and hybrid systems that combine knowledge\nrepresentation and machine learning.\n","authors":["Spyridon Kantarelis","Konstantinos Thomas","Vassilis Lyberatos","Edmund Dervakos","Giorgos Stamou"],"pdf_url":"https://arxiv.org/pdf/2410.22046v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07889v1","updated":"2024-12-10T19:48:57Z","published":"2024-12-10T19:48:57Z","title":"Low-Latency Scalable Streaming for Event-Based Vision","summary":" Recently, we have witnessed the rise of novel ``event-based'' camera sensors\nfor high-speed, low-power video capture. Rather than recording discrete image\nframes, these sensors output asynchronous ``event'' tuples with microsecond\nprecision, only when the brightness change of a given pixel exceeds a certain\nthreshold. Although these sensors have enabled compelling new computer vision\napplications, these applications often require expensive, power-hungry GPU\nsystems, rendering them incompatible for deployment on the low-power devices\nfor which event cameras are optimized. Whereas receiver-driven rate adaptation\nis a crucial feature of modern video streaming solutions, this topic is\nunderexplored in the realm of event-based vision systems. On a real-world event\ncamera dataset, we first demonstrate that a state-of-the-art object detection\napplication is resilient to dramatic data loss, and that this loss may be\nweighted towards the end of each temporal window. We then propose a scalable\nstreaming method for event-based data based on Media Over QUIC, prioritizing\nobject detection performance and low latency. The application server can\nreceive complementary event data across several streams simultaneously, and\ndrop streams as needed to maintain a certain latency. With a latency target of\n5 ms for end-to-end transmission across a small network, we observe an average\nreduction in detection mAP as low as 0.36. With a more relaxed latency target\nof 50 ms, we observe an average mAP reduction as low as 0.19.\n","authors":["Andrew Hamara","Benjamin Kilpatrick","Alex Baratta","Brendon Kofink","Andrew C. Freeman"],"pdf_url":"https://arxiv.org/pdf/2412.07889v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07730v1","updated":"2024-12-10T18:27:06Z","published":"2024-12-10T18:27:06Z","title":"STIV: Scalable Text and Image Conditioned Video Generation","summary":" The field of video generation has made remarkable advancements, yet there\nremains a pressing need for a clear, systematic recipe that can guide the\ndevelopment of robust and scalable models. In this work, we present a\ncomprehensive study that systematically explores the interplay of model\narchitectures, training recipes, and data curation strategies, culminating in a\nsimple and scalable text-image-conditioned video generation method, named STIV.\nOur framework integrates image condition into a Diffusion Transformer (DiT)\nthrough frame replacement, while incorporating text conditioning via a joint\nimage-text conditional classifier-free guidance. This design enables STIV to\nperform both text-to-video (T2V) and text-image-to-video (TI2V) tasks\nsimultaneously. Additionally, STIV can be easily extended to various\napplications, such as video prediction, frame interpolation, multi-view\ngeneration, and long video generation, etc. With comprehensive ablation studies\non T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple\ndesign. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,\nsurpassing both leading open and closed-source models like CogVideoX-5B, Pika,\nKling, and Gen-3. The same-sized model also achieves a state-of-the-art result\nof 90.1 on VBench I2V task at 512 resolution. By providing a transparent and\nextensible recipe for building cutting-edge video generation models, we aim to\nempower future research and accelerate progress toward more versatile and\nreliable video generation solutions.\n","authors":["Zongyu Lin","Wei Liu","Chen Chen","Jiasen Lu","Wenze Hu","Tsu-Jui Fu","Jesse Allardice","Zhengfeng Lai","Liangchen Song","Bowen Zhang","Cha Chen","Yiran Fei","Yifan Jiang","Lezhi Li","Yizhou Sun","Kai-Wei Chang","Yinfei Yang"],"pdf_url":"https://arxiv.org/pdf/2412.07730v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12140v2","updated":"2024-12-10T18:24:13Z","published":"2024-09-18T17:03:30Z","title":"MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion","summary":" We introduce MoRAG, a novel multi-part fusion based retrieval-augmented\ngeneration strategy for text-based human motion generation. The method enhances\nmotion diffusion models by leveraging additional knowledge obtained through an\nimproved motion retrieval process. By effectively prompting large language\nmodels (LLMs), we address spelling errors and rephrasing issues in motion\nretrieval. Our approach utilizes a multi-part retrieval strategy to improve the\ngeneralizability of motion retrieval across the language space. We create\ndiverse samples through the spatial composition of the retrieved motions.\nFurthermore, by utilizing low-level, part-specific motion information, we can\nconstruct motion samples for unseen text descriptions. Our experiments\ndemonstrate that our framework can serve as a plug-and-play module, improving\nthe performance of motion diffusion models. Code, pretrained models and sample\nvideos are available at: https://motion-rag.github.io/\n","authors":["Sai Shashank Kalakonda","Shubh Maheshwari","Ravi Kiran Sarvadevabhatla"],"pdf_url":"https://arxiv.org/pdf/2409.12140v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.13117v2","updated":"2024-12-10T16:17:50Z","published":"2024-07-18T02:55:52Z","title":"SOMONITOR: Combining Explainable AI & Large Language Models for\n Marketing Analytics","summary":" Online marketing faces formidable challenges in managing and interpreting\nimmense volumes of data necessary for competitor analysis, content research,\nand strategic branding. It is impossible to review hundreds to thousands of\ntransient online content items by hand, and partial analysis often leads to\nsuboptimal outcomes and poorly performing campaigns. We introduce an\nexplainable AI framework SOMONITOR that aims to synergize human intuition with\nAI-based efficiency, helping marketers across all stages of the marketing\nfunnel, from strategic planning to content creation and campaign execution.\nSOMONITOR incorporates a CTR prediction and ranking model for advertising\ncontent and uses large language models (LLMs) to process high-performing\ncompetitor content, identifying core content pillars such as target audiences,\ncustomer needs, and product features. These pillars are then organized into\nbroader categories, including communication themes and targeted customer\npersonas. By integrating these insights with data from the brand's own\nadvertising campaigns, SOMONITOR constructs a narrative for addressing new\ncustomer personas and simultaneously generates detailed content briefs in the\nform of user stories that, as shown in the conducted case study, can be\ndirectly applied by marketing teams to streamline content production and\ncampaign execution. The adoption of SOMONITOR in daily operations allows\ndigital marketers to quickly parse through extensive datasets, offering\nactionable insights that significantly enhance campaign effectiveness and\noverall job satisfaction.\n","authors":["Aleksandr Farseev","Qi Yang","Marlo Ongpin","Ilia Gossoudarev","Yu-Yi Chu-Farseeva","Sergey Nikolenko"],"pdf_url":"https://arxiv.org/pdf/2407.13117v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07406v1","updated":"2024-12-10T10:56:02Z","published":"2024-12-10T10:56:02Z","title":"Learning Self-Supervised Audio-Visual Representations for Sound\n Recommendations","summary":" We propose a novel self-supervised approach for learning audio and visual\nrepresentations from unlabeled videos, based on their correspondence. The\napproach uses an attention mechanism to learn the relative importance of\nconvolutional features extracted at different resolutions from the audio and\nvisual streams and uses the attention features to encode the audio and visual\ninput based on their correspondence. We evaluated the representations learned\nby the model to classify audio-visual correlation as well as to recommend sound\neffects for visual scenes. Our results show that the representations generated\nby the attention model improves the correlation accuracy compared to the\nbaseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is\na public video dataset. Additionally, audio-visual representations learned by\ntraining the attention model with cross-modal contrastive learning further\nimproves the recommendation performance, based on our evaluation using\nVGG-Sound and a more challenging dataset consisting of gameplay video\nrecordings.\n","authors":["Sudha Krishnamurthy"],"pdf_url":"https://arxiv.org/pdf/2412.07406v1.pdf","comment":"Published in the Proceedings of the International Symposium on Visual\n Computing, 2021 https://dl.acm.org/doi/10.1007/978-3-030-90436-4_10"},{"id":"http://arxiv.org/abs/2412.07292v1","updated":"2024-12-10T08:21:19Z","published":"2024-12-10T08:21:19Z","title":"Multimodal Sentiment Analysis Based on Causal Reasoning","summary":" With the rapid development of multimedia, the shift from unimodal textual\nsentiment analysis to multimodal image-text sentiment analysis has obtained\nacademic and industrial attention in recent years. However, multimodal\nsentiment analysis is affected by unimodal data bias, e.g., text sentiment is\nmisleading due to explicit sentiment semantic, leading to low accuracy in the\nfinal sentiment classification. In this paper, we propose a novel\nCounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal\ncounterfactual inference to construct multimodal sentiment causal inference.\nCF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity\nacross modalities by differentiating the treatment variables between\nmodalities. In addition, considering the information complementarity and bias\ndifferences between modalities, we propose a new optimisation objective to\neffectively integrate different modalities and reduce the inherent bias from\neach modality. Experimental results on two public datasets, MVSA-Single and\nMVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing\ncapability and achieves new state-of-the-art performances. We will release the\ncode and datasets to facilitate future research.\n","authors":["Fuhai Chen","Pengpeng Huang","Xuri Ge","Jie Huang","Zishuo Bao"],"pdf_url":"https://arxiv.org/pdf/2412.07292v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07270v1","updated":"2024-12-10T07:52:23Z","published":"2024-12-10T07:52:23Z","title":"Reducing Traffic Wastage in Video Streaming via Bandwidth-Efficient\n Bitrate Adaptation","summary":" Bitrate adaptation (also known as ABR) is a crucial technique to improve the\nquality of experience (QoE) for video streaming applications. However, existing\nABR algorithms suffer from severe traffic wastage, which refers to the traffic\ncost of downloading the video segments that users do not finally consume, for\nexample, due to early departure or video skipping. In this paper, we carefully\nformulate the dynamics of buffered data volume (BDV), a strongly correlated\nindicator of traffic wastage, which, to the best of our knowledge, is the first\ntime to rigorously clarify the effect of downloading plans on potential\nwastage. To reduce wastage while keeping a high QoE, we present a\nbandwidth-efficient bitrate adaptation algorithm (named BE-ABR), achieving\nconsistently low BDV without distinct QoE losses. Specifically, we design a\nprecise, time-aware transmission delay prediction model over the Transformer\narchitecture, and develop a fine-grained buffer control scheme. Through\nextensive experiments conducted on emulated and real network environments\nincluding WiFi, 4G, and 5G, we demonstrate that BE-ABR performs well in both\nQoE and bandwidth savings, enabling a 60.87\\% wastage reduction and a\ncomparable, or even better, QoE, compared to the state-of-the-art methods.\n","authors":["Hairong Su","Shibo Wang","Shusen Yang","Tianchi Huang","Xuebin Ren"],"pdf_url":"https://arxiv.org/pdf/2412.07270v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07268v1","updated":"2024-12-10T07:49:07Z","published":"2024-12-10T07:49:07Z","title":"PTSBench: A Comprehensive Post-Training Sparsity Benchmark Towards\n Algorithms and Models","summary":" With the increased attention to model efficiency, post-training sparsity\n(PTS) has become more and more prevalent because of its effectiveness and\nefficiency. However, there remain questions on better practice of PTS\nalgorithms and the sparsification ability of models, which hinders the further\ndevelopment of this area. Therefore, a benchmark to comprehensively investigate\nthe issues above is urgently needed. In this paper, we propose the first\ncomprehensive post-training sparsity benchmark called PTSBench towards\nalgorithms and models. We benchmark 10+ PTS general-pluggable fine-grained\ntechniques on 3 typical tasks using over 40 off-the-shelf model architectures.\nThrough extensive experiments and analyses, we obtain valuable conclusions and\nprovide several insights from both algorithms and model aspects. Our PTSBench\ncan provide (1) new observations for a better understanding of the PTS\nalgorithms, (2) in-depth and comprehensive evaluations for the sparsification\nability of models, and (3) a well-structured and easy-integrate open-source\nframework. We hope this work will provide illuminating conclusions and advice\nfor future studies of post-training sparsity methods and\nsparsification-friendly model design. The code for our PTSBench is released at\n\\href{https://github.com/ModelTC/msbench}{https://github.com/ModelTC/msbench}.\n","authors":["Zining Wnag","Jinyang Guo","Ruihao Gong","Yang Yong","Aishan Liu","Yushi Huang","Jiaheng Liu","Xianglong Liu"],"pdf_url":"https://arxiv.org/pdf/2412.07268v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07215v1","updated":"2024-12-10T06:11:59Z","published":"2024-12-10T06:11:59Z","title":"RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation","summary":" In recent years, robotics has advanced significantly through the integration\nof larger models and large-scale datasets. However, challenges remain in\napplying these models to 3D spatial interactions and managing data collection\ncosts. To address these issues, we propose the multimodal robotic manipulation\nmodel, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances\n3D perception through camera parameters and occupancy supervision. Building on\nOpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder\nblocks, improving modality fusion and fine-grained perception. RoboData offers\nthe complete evaluation system by integrating several well-known datasets,\nachieving the first fusion of multi-view images, camera parameters, depth maps,\nand actions, and the space alignment facilitates comprehensive learning from\ndiverse robotic datasets. Equipped with RoboData and the unified physical\nspace, RoboMM is the generalist policy that enables simultaneous evaluation\nacross all tasks within multiple datasets, rather than focusing on limited\nselection of data or tasks. Its design significantly enhances robotic\nmanipulation performance, increasing the average sequence length on the CALVIN\nfrom 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving\nstate-of-the-art results across multiple datasets.\n","authors":["Feng Yan","Fanfan Liu","Liming Zheng","Yufeng Zhong","Yiyang Huang","Zechao Guan","Chengjian Feng","Lin Ma"],"pdf_url":"https://arxiv.org/pdf/2412.07215v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07155v1","updated":"2024-12-10T03:24:14Z","published":"2024-12-10T03:24:14Z","title":"Annotation Techniques for Judo Combat Phase Classification from\n Tournament Footage","summary":" This paper presents a semi-supervised approach to extracting and analyzing\ncombat phases in judo tournaments using live-streamed footage. The objective is\nto automate the annotation and summarization of live streamed judo matches. We\ntrain models that extract relevant entities and classify combat phases from\nfixed-perspective judo recordings. We employ semi-supervised methods to address\nlimited labeled data in the domain. We build a model of combat phases via\ntransfer learning from a fine-tuned object detector to classify the presence,\nactivity, and standing state of the match. We evaluate our approach on a\ndataset of 19 thirty-second judo clips, achieving an F1 score on a $20\\%$ test\nhold-out of 0.66, 0.78, and 0.87 for the three classes, respectively. Our\nresults show initial promise for automating more complex information retrieval\ntasks using rigorous methods with limited labeled data.\n","authors":["Anthony Miyaguchi","Jed Moutahir","Tanmay Sutar"],"pdf_url":"https://arxiv.org/pdf/2412.07155v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.07080v1","updated":"2024-12-10T00:42:54Z","published":"2024-12-10T00:42:54Z","title":"EvRepSL: Event-Stream Representation via Self-Supervised Learning for\n Event-Based Vision","summary":" Event-stream representation is the first step for many computer vision tasks\nusing event cameras. It converts the asynchronous event-streams into a\nformatted structure so that conventional machine learning models can be applied\neasily. However, most of the state-of-the-art event-stream representations are\nmanually designed and the quality of these representations cannot be guaranteed\ndue to the noisy nature of event-streams. In this paper, we introduce a\ndata-driven approach aiming at enhancing the quality of event-stream\nrepresentations. Our approach commences with the introduction of a new\nevent-stream representation based on spatial-temporal statistics, denoted as\nEvRep. Subsequently, we theoretically derive the intrinsic relationship between\nasynchronous event-streams and synchronous video frames. Building upon this\ntheoretical relationship, we train a representation generator, RepGen, in a\nself-supervised learning manner accepting EvRep as input. Finally, the\nevent-streams are converted to high-quality representations, termed as EvRepSL,\nby going through the learned RepGen (without the need of fine-tuning or\nretraining). Our methodology is rigorously validated through extensive\nevaluations on a variety of mainstream event-based classification and optical\nflow datasets (captured with various types of event cameras). The experimental\nresults highlight not only our approach's superior performance over existing\nevent-stream representations but also its versatility, being agnostic to\ndifferent event cameras and tasks.\n","authors":["Qiang Qu","Xiaoming Chen","Yuk Ying Chung","Yiran Shen"],"pdf_url":"https://arxiv.org/pdf/2412.07080v1.pdf","comment":"Published on IEEE Transactions on Image Processing"}]},"2024-12-09T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.07030v1","updated":"2024-12-09T22:35:44Z","published":"2024-12-09T22:35:44Z","title":"FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge\n Distillation for Question Answering","summary":" Multimodal multihop question answering is a complex task that requires\nreasoning over multiple sources of information, such as images and text, to\nanswer questions. While there has been significant progress in visual question\nanswering, the multihop setting remains unexplored due to the lack of\nhigh-quality datasets. Current methods focus on single-hop question answering\nor a single modality, which makes them unsuitable for real-world scenarios such\nas analyzing multimodal educational materials, summarizing lengthy academic\narticles, or interpreting scientific studies that combine charts, images, and\ntext. To address this gap, we propose a novel methodology, introducing the\nfirst framework for creating a high-quality dataset that enables training\nmodels for multimodal multihop question answering. Our approach consists of a\n5-stage pipeline that involves acquiring relevant multimodal documents from\nWikipedia, synthetically generating high-level questions and answers, and\nvalidating them through rigorous criteria to ensure quality data. We evaluate\nour methodology by training models on our synthesized dataset and testing on\ntwo benchmarks, our results demonstrate that, with an equal sample size, models\ntrained on our synthesized data outperform those trained on human-collected\ndata by 1.9 in exact match (EM) on average. We believe our data synthesis\nmethod will serve as a strong foundation for training and evaluating multimodal\nmultihop question answering models.\n","authors":["Amirhossein Abaskohi","Spandana Gella","Giuseppe Carenini","Issam H. Laradji"],"pdf_url":"https://arxiv.org/pdf/2412.07030v1.pdf","comment":"20 pages, 11 figures, 10 tables, Submitted to CVPR 2025"},{"id":"http://arxiv.org/abs/2412.06949v1","updated":"2024-12-09T19:53:13Z","published":"2024-12-09T19:53:13Z","title":"Bridging Conversational and Collaborative Signals for Conversational\n Recommendation","summary":" Conversational recommendation systems (CRS) leverage contextual information\nfrom conversations to generate recommendations but often struggle due to a lack\nof collaborative filtering (CF) signals, which capture user-item interaction\npatterns essential for accurate recommendations. We introduce Reddit-ML32M, a\ndataset that links reddit conversations with interactions on MovieLens 32M, to\nenrich item representations by leveraging collaborative knowledge and\naddressing interaction sparsity in conversational datasets. We propose an\nLLM-based framework that uses Reddit-ML32M to align LLM-generated\nrecommendations with CF embeddings, refining rankings for better performance.\nWe evaluate our framework against three sets of baselines: CF-based\nrecommenders using only interactions from CRS tasks, traditional CRS models,\nand LLM-based methods relying on conversational context without item\nrepresentations. Our approach achieves consistent improvements, including a\n12.32% increase in Hit Rate and a 9.9% improvement in NDCG, outperforming the\nbest-performing baseline that relies on conversational context but lacks\ncollaborative item representations.\n","authors":["Ahmad Bin Rabiah","Nafis Sadeq","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2412.06949v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06924v1","updated":"2024-12-09T19:10:03Z","published":"2024-12-09T19:10:03Z","title":"Efficient user history modeling with amortized inference for deep\n learning recommendation models","summary":" We study user history modeling via Transformer encoders in deep learning\nrecommendation models (DLRM). Such architectures can significantly improve\nrecommendation quality, but usually incur high latency cost necessitating\ninfrastructure upgrades or very small Transformer models. An important part of\nuser history modeling is early fusion of the candidate item and various methods\nhave been studied. We revisit early fusion and compare concatenation of the\ncandidate to each history item against appending it to the end of the list as a\nseparate item. Using the latter method, allows us to reformulate the recently\nproposed amortized history inference algorithm M-FALCON \\cite{zhai2024actions}\nfor the case of DLRM models. We show via experimental results that appending\nwith cross-attention performs on par with concatenation and that amortization\nsignificantly reduces inference costs. We conclude with results from deploying\nthis model on the LinkedIn Feed and Ads surfaces, where amortization reduces\nlatency by 30\\% compared to non-amortized inference.\n","authors":["Lars Hertel","Neil Daftary","Fedor Borisyuk","Aman Gupta","Rahul Mazumder"],"pdf_url":"https://arxiv.org/pdf/2412.06924v1.pdf","comment":"5 pages, 3 figures, WWW 2025"},{"id":"http://arxiv.org/abs/2412.00430v3","updated":"2024-12-09T18:46:37Z","published":"2024-11-30T10:56:30Z","title":"Predictive Models in Sequential Recommendations: Bridging Performance\n Laws with Data Quality Insights","summary":" Sequential Recommendation (SR) plays a critical role in predicting users'\nsequential preferences. Despite its growing prominence in various industries,\nthe increasing scale of SR models incurs substantial computational costs and\nunpredictability, challenging developers to manage resources efficiently. Under\nthis predicament, Scaling Laws have achieved significant success by examining\nthe loss as models scale up. However, there remains a disparity between loss\nand model performance, which is of greater concern in practical applications.\nMoreover, as data continues to expand, it incorporates repetitive and\ninefficient data. In response, we introduce the Performance Law for SR models,\nwhich aims to theoretically investigate and model the relationship between\nmodel performance and data quality. Specifically, we first fit the HR and NDCG\nmetrics to transformer-based SR models. Subsequently, we propose Approximate\nEntropy (ApEn) to assess data quality, presenting a more nuanced approach\ncompared to traditional data quantity metrics. Our method enables accurate\npredictions across various dataset scales and model sizes, demonstrating a\nstrong correlation in large SR models and offering insights into achieving\noptimal performance for any given model configuration.\n","authors":["Tingjia Shen","Hao Wang","Chuhan Wu","Jin Yao Chin","Wei Guo","Yong Liu","Huifeng Guo","Defu Lian","Ruiming Tang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00430v3.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2403.19546v3","updated":"2024-12-09T18:37:55Z","published":"2024-03-28T16:27:26Z","title":"Croissant: A Metadata Format for ML-Ready Datasets","summary":" Data is a critical resource for machine learning (ML), yet working with data\nremains a key friction point. This paper introduces Croissant, a metadata\nformat for datasets that creates a shared representation across ML tools,\nframeworks, and platforms. Croissant makes datasets more discoverable,\nportable, and interoperable, thereby addressing significant challenges in ML\ndata management. Croissant is already supported by several popular dataset\nrepositories, spanning hundreds of thousands of datasets, enabling easy loading\ninto the most commonly-used ML frameworks, regardless of where the data is\nstored. Our initial evaluation by human raters shows that Croissant metadata is\nreadable, understandable, complete, yet concise.\n","authors":["Mubashara Akhtar","Omar Benjelloun","Costanza Conforti","Luca Foschini","Joan Giner-Miguelez","Pieter Gijsbers","Sujata Goswami","Nitisha Jain","Michalis Karamousadakis","Michael Kuchnik","Satyapriya Krishna","Sylvain Lesage","Quentin Lhoest","Pierre Marcenac","Manil Maskey","Peter Mattson","Luis Oala","Hamidah Oderinwale","Pierre Ruyssen","Tim Santos","Rajat Shinde","Elena Simperl","Arjun Suresh","Goeffry Thomas","Slava Tykhonov","Joaquin Vanschoren","Susheel Varma","Jos van der Velde","Steffen Vogler","Carole-Jean Wu","Luyao Zhang"],"pdf_url":"https://arxiv.org/pdf/2403.19546v3.pdf","comment":"Published at the NeurIPS 2024 Datasets and Benchmark Track. A shorter\n version appeared earlier in Proceedings of ACM SIGMOD/PODS'24 Data Management\n for End-to-End Machine Learning (DEEM) Workshop\n https://dl.acm.org/doi/10.1145/3650203.3663326"},{"id":"http://arxiv.org/abs/2412.06695v1","updated":"2024-12-09T17:41:25Z","published":"2024-12-09T17:41:25Z","title":"DEEPER: Dense Electroencephalography Passage Retrieval","summary":" Information retrieval systems have historically relied on explicit query\nformulation, requiring users to translate their information needs into text.\nThis process is particularly disruptive during reading tasks, where users must\ninterrupt their natural flow to formulate queries. We present DEEPER (Dense\nElectroencephalography Passage Retrieval), a novel framework that enables\ndirect retrieval of relevant passages from users' neural signals during\nnaturalistic reading without intermediate text translation. Building on dense\nretrieval architectures, DEEPER employs a dual-encoder approach with\nspecialised components for processing neural data, mapping EEG signals and text\npassages into a shared semantic space. Through careful architecture design and\ncross-modal negative sampling strategies, our model learns to align neural\npatterns with their corresponding textual content. Experimental results on the\nZuCo dataset demonstrate that direct brain-to-passage retrieval significantly\noutperforms current EEG-to-text baselines, achieving a 571% improvement in\nPrecision@1. Our ablation studies reveal that the model successfully learns\naligned representations between EEG and text modalities (0.29 cosine\nsimilarity), while our hard negative sampling strategy contributes to overall\nperformance increases.\n","authors":["Niall McGuire","Yashar Moshfeghi"],"pdf_url":"https://arxiv.org/pdf/2412.06695v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06649v1","updated":"2024-12-09T16:43:23Z","published":"2024-12-09T16:43:23Z","title":"Semantic Search and Recommendation Algorithm","summary":" This paper introduces a new semantic search algorithm that uses Word2Vec and\nAnnoy Index to improve the efficiency of information retrieval from large\ndatasets. The proposed approach addresses the limitations of traditional search\nmethods by offering enhanced speed, accuracy, and scalability. Testing on\ndatasets up to 100GB demonstrates the method's effectiveness in processing vast\namounts of data while maintaining high precision and performance.\n","authors":["Aryan Duhan","Aryan Singhal","Shourya Sharma"," Neeraj","Arti MK"],"pdf_url":"https://arxiv.org/pdf/2412.06649v1.pdf","comment":"6 pages, 5 Figures"},{"id":"http://arxiv.org/abs/2409.05633v2","updated":"2024-12-09T09:44:27Z","published":"2024-09-09T14:04:17Z","title":"Enhancing Graph Contrastive Learning with Reliable and Informative\n Augmentation for Recommendation","summary":" Graph neural network(GNN) has been a powerful approach in collaborative\nfiltering(CF) due to its ability to model high-order user-item relationships.\nRecently, to alleviate the data sparsity and enhance representation learning,\nmany efforts have been conducted to integrate contrastive learning(CL) with\nGNNs. Despite the promising improvements, the contrastive view generation based\non structure and representation perturbations in existing methods potentially\ndisrupts the collaborative information in contrastive views, resulting in\nlimited effectiveness of positive alignment. To overcome this issue, we propose\nCoGCL, a novel framework that aims to enhance graph contrastive learning by\nconstructing contrastive views with stronger collaborative information via\ndiscrete codes. The core idea is to map users and items into discrete codes\nrich in collaborative information for reliable and informative contrastive view\ngeneration. To this end, we initially introduce a multi-level vector quantizer\nin an end-to-end manner to quantize user and item representations into discrete\ncodes. Based on these discrete codes, we enhance the collaborative information\nof contrastive views by considering neighborhood structure and semantic\nrelevance respectively. For neighborhood structure, we propose virtual neighbor\naugmentation by treating discrete codes as virtual neighbors, which expands an\nobserved user-item interaction into multiple edges involving discrete codes.\nRegarding semantic relevance, we identify similar users/items based on shared\ndiscrete codes and interaction targets to generate the semantically relevant\nview. Through these strategies, we construct contrastive views with stronger\ncollaborative information and develop a triple-view graph contrastive learning\napproach. Extensive experiments on four public datasets demonstrate the\neffectiveness of our proposed approach.\n","authors":["Bowen Zheng","Junjie Zhang","Hongyu Lu","Yu Chen","Ming Chen","Wayne Xin Zhao","Ji-Rong Wen"],"pdf_url":"https://arxiv.org/pdf/2409.05633v2.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.05248v2","updated":"2024-12-09T09:21:49Z","published":"2024-12-06T18:27:15Z","title":"Enhancing FKG.in: automating Indian food composition analysis","summary":" This paper presents a novel approach to compute food composition data for\nIndian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The\nprimary focus is to provide a broad overview of an automated food composition\nanalysis workflow and describe its core functionalities: nutrition data\naggregation, food composition analysis, and LLM-augmented information\nresolution. This workflow aims to complement FKG.in and iteratively supplement\nfood composition data from verified knowledge bases. Additionally, this paper\nhighlights the challenges of representing Indian food and accessing food\ncomposition data digitally. It also reviews three key sources of food\ncomposition data: the Indian Food Composition Tables, the Indian Nutrient\nDatabank, and the Nutritionix API. Furthermore, it briefly outlines how users\ncan interact with the workflow to obtain diet-based health recommendations and\ndetailed food composition information for numerous recipes. We then explore the\ncomplex challenges of analyzing Indian recipe information across dimensions\nsuch as structure, multilingualism, and uncertainty as well as present our\nongoing work on LLM-based solutions to address these issues. The methods\nproposed in this workshop paper for AI-driven knowledge curation and\ninformation resolution are application-agnostic, generalizable, and replicable\nfor any domain.\n","authors":["Saransh Kumar Gupta","Lipika Dey","Partha Pratim Das","Geeta Trilok-Kumar","Ramesh Jain"],"pdf_url":"https://arxiv.org/pdf/2412.05248v2.pdf","comment":"15 pages, 5 figures, 30 references, International Conference on\n Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop"},{"id":"http://arxiv.org/abs/2412.06308v1","updated":"2024-12-09T08:55:48Z","published":"2024-12-09T08:55:48Z","title":"PRECISE: Pre-training Sequential Recommenders with Collaborative and\n Semantic Information","summary":" Real-world recommendation systems commonly offer diverse content scenarios\nfor users to interact with. Considering the enormous number of users in\nindustrial platforms, it is infeasible to utilize a single unified\nrecommendation model to meet the requirements of all scenarios. Usually,\nseparate recommendation pipelines are established for each distinct scenario.\nThis practice leads to challenges in comprehensively grasping users' interests.\nRecent research endeavors have been made to tackle this problem by pre-training\nmodels to encapsulate the overall interests of users. Traditional pre-trained\nrecommendation models mainly capture user interests by leveraging collaborative\nsignals. Nevertheless, a prevalent drawback of these systems is their\nincapacity to handle long-tail items and cold-start scenarios. With the recent\nadvent of large language models, there has been a significant increase in\nresearch efforts focused on exploiting LLMs to extract semantic information for\nusers and items. However, text-based recommendations highly rely on elaborate\nfeature engineering and frequently fail to capture collaborative similarities.\nTo overcome these limitations, we propose a novel pre-training framework for\nsequential recommendation, termed PRECISE. This framework combines\ncollaborative signals with semantic information. Moreover, PRECISE employs a\nlearning framework that initially models users' comprehensive interests across\nall recommendation scenarios and subsequently concentrates on the specific\ninterests of target-scene behaviors. We demonstrate that PRECISE precisely\ncaptures the entire range of user interests and effectively transfers them to\nthe target interests. Empirical findings reveal that the PRECISE framework\nattains outstanding performance on both public and industrial datasets.\n","authors":["Chonggang Song","Chunxu Shen","Hao Gu","Yaoming Wu","Lingling Yi","Jie Wen","Chuan Chen"],"pdf_url":"https://arxiv.org/pdf/2412.06308v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06272v1","updated":"2024-12-09T07:46:14Z","published":"2024-12-09T07:46:14Z","title":"Methods for Legal Citation Prediction in the Age of LLMs: An Australian\n Law Case Study","summary":" In recent years, Large Language Models (LLMs) have shown great potential\nacross a wide range of legal tasks. Despite these advances, mitigating\nhallucination remains a significant challenge, with state-of-the-art LLMs still\nfrequently generating incorrect legal references. In this paper, we focus on\nthe problem of legal citation prediction within the Australian law context,\nwhere correctly identifying and citing relevant legislations or precedents is\ncritical. We compare several approaches: prompting general purpose and\nlaw-specialised LLMs, retrieval-only pipelines with both generic and\ndomain-specific embeddings, task-specific instruction-tuning of LLMs, and\nhybrid strategies that combine LLMs with retrieval augmentation, query\nexpansion, or voting ensembles. Our findings indicate that domain-specific\npre-training alone is insufficient for achieving satisfactory citation accuracy\neven after law-specialised pre-training. In contrast, instruction tuning on our\ntask-specific dataset dramatically boosts performance reaching the best results\nacross all settings. We also highlight that database granularity along with the\ntype of embeddings play a critical role in the performance of retrieval\nsystems. Among retrieval-based approaches, hybrid methods consistently\noutperform retrieval-only setups, and among these, ensemble voting delivers the\nbest result by combining the predictive quality of instruction-tuned LLMs with\nthe retrieval system.\n","authors":["Ehsan Shareghi","Jiuzhou Han","Paul Burgess"],"pdf_url":"https://arxiv.org/pdf/2412.06272v1.pdf","comment":"For code, data, and models see https://auslawbench.github.io"},{"id":"http://arxiv.org/abs/2405.13792v2","updated":"2024-12-09T06:07:03Z","published":"2024-05-22T16:15:17Z","title":"xRAG: Extreme Context Compression for Retrieval-augmented Generation\n with One Token","summary":" This paper introduces xRAG, an innovative context compression method tailored\nfor retrieval-augmented generation. xRAG reinterprets document embeddings in\ndense retrieval--traditionally used solely for retrieval--as features from the\nretrieval modality. By employing a modality fusion methodology, xRAG seamlessly\nintegrates these embeddings into the language model representation space,\neffectively eliminating the need for their textual counterparts and achieving\nan extreme compression rate. In xRAG, the only trainable component is the\nmodality bridge, while both the retriever and the language model remain frozen.\nThis design choice allows for the reuse of offline-constructed document\nembeddings and preserves the plug-and-play nature of retrieval augmentation.\nExperimental results demonstrate that xRAG achieves an average improvement of\nover 10% across six knowledge-intensive tasks, adaptable to various language\nmodel backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts\nconfiguration. xRAG not only significantly outperforms previous context\ncompression methods but also matches the performance of uncompressed models on\nseveral datasets, while reducing overall FLOPs by a factor of 3.53. Our work\npioneers new directions in retrieval-augmented generation from the perspective\nof multimodality fusion, and we hope it lays the foundation for future\nefficient and scalable retrieval-augmented systems\n","authors":["Xin Cheng","Xun Wang","Xingxing Zhang","Tao Ge","Si-Qing Chen","Furu Wei","Huishuai Zhang","Dongyan Zhao"],"pdf_url":"https://arxiv.org/pdf/2405.13792v2.pdf","comment":"Neurips 2024"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.06693v1","updated":"2024-12-09T17:39:43Z","published":"2024-12-09T17:39:43Z","title":"OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large\n Language Model and its Omni-Extensions","summary":" The rapid advancements in Large Language Models (LLMs) have significantly\nexpanded their applications, ranging from multilingual support to\ndomain-specific tasks and multimodal integration. In this paper, we present\nOmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their\nomni-extensions across multilingual, multidomain, and multimodal capabilities.\nUnlike existing benchmarks that often focus on a single aspect, OmniEvalKit\nprovides a modular, lightweight, and automated evaluation system. It is\nstructured with a modular architecture comprising a Static Builder and Dynamic\nData Flow, promoting the seamless integration of new models and datasets.\nOmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering\ncomprehensive evaluations across thousands of model-dataset combinations.\nOmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable\nevaluation framework, making downstream applications more convenient and\nversatile for the AI community.\n","authors":["Yi-Kai Zhang","Xu-Xiang Zhong","Shiyin Lu","Qing-Guo Chen","De-Chuan Zhan","Han-Jia Ye"],"pdf_url":"https://arxiv.org/pdf/2412.06693v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06660v1","updated":"2024-12-09T16:59:35Z","published":"2024-12-09T16:59:35Z","title":"MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large\n Language Models","summary":" Research on large language models has advanced significantly across text,\nspeech, images, and videos. However, multi-modal music understanding and\ngeneration remain underexplored due to the lack of well-annotated datasets. To\naddress this, we introduce a dataset with 167.69 hours of multi-modal data,\nincluding text, images, videos, and music annotations. Based on this dataset,\nwe propose MuMu-LLaMA, a model that leverages pre-trained encoders for music,\nimages, and videos. For music generation, we integrate AudioLDM 2 and MusicGen.\nOur evaluation across four tasks--music understanding, text-to-music\ngeneration, prompt-based music editing, and multi-modal music\ngeneration--demonstrates that MuMu-LLaMA outperforms state-of-the-art models,\nshowing its potential for multi-modal music applications.\n","authors":["Shansong Liu","Atin Sakkeer Hussain","Qilong Wu","Chenshuo Sun","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2412.06660v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.11255v5","updated":"2024-12-09T16:15:07Z","published":"2023-11-19T06:50:52Z","title":"M$^{2}$UGen: Multi-modal Music Understanding and Generation with the\n Power of Large Language Models","summary":" The current landscape of research leveraging large language models (LLMs) is\nexperiencing a surge. Many works harness the powerful reasoning capabilities of\nthese models to comprehend various modalities, such as text, speech, images,\nvideos, etc. They also utilize LLMs to understand human intention and generate\ndesired outputs like images, videos, and music. However, research that combines\nboth understanding and generation using LLMs is still limited and in its\nnascent stage. To address this gap, we introduce a Multi-modal Music\nUnderstanding and Generation (M$^{2}$UGen) framework that integrates LLM's\nabilities to comprehend and generate music for different modalities. The\nM$^{2}$UGen framework is purpose-built to unlock creative potential from\ndiverse sources of inspiration, encompassing music, image, and video through\nthe use of pretrained MERT, ViT, and ViViT models, respectively. To enable\nmusic generation, we explore the use of AudioLDM 2 and MusicGen. Bridging\nmulti-modal understanding and music generation is accomplished through the\nintegration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA\nmodel to generate extensive datasets that support text/image/video-to-music\ngeneration, facilitating the training of our M$^{2}$UGen framework. We conduct\na thorough evaluation of our proposed framework. The experimental results\ndemonstrate that our model achieves or surpasses the performance of the current\nstate-of-the-art models.\n","authors":["Shansong Liu","Atin Sakkeer Hussain","Qilong Wu","Chenshuo Sun","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2311.11255v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06617v1","updated":"2024-12-09T16:09:44Z","published":"2024-12-09T16:09:44Z","title":"AI TrackMate: Finally, Someone Who Will Give Your Music More Than Just\n \"Sounds Great!\"","summary":" The rise of \"bedroom producers\" has democratized music creation, while\nchallenging producers to objectively evaluate their work. To address this, we\npresent AI TrackMate, an LLM-based music chatbot designed to provide\nconstructive feedback on music productions. By combining LLMs' inherent musical\nknowledge with direct audio track analysis, AI TrackMate offers\nproduction-specific insights, distinguishing it from text-only approaches. Our\nframework integrates a Music Analysis Module, an LLM-Readable Music Report, and\nMusic Production-Oriented Feedback Instruction, creating a plug-and-play,\ntraining-free system compatible with various LLMs and adaptable to future\nadvancements. We demonstrate AI TrackMate's capabilities through an interactive\nweb interface and present findings from a pilot study with a music producer. By\nbridging AI capabilities with the needs of independent producers, AI TrackMate\noffers on-demand analytical feedback, potentially supporting the creative\nprocess and skill development in music production. This system addresses the\ngrowing demand for objective self-assessment tools in the evolving landscape of\nindependent music production.\n","authors":["Yi-Lin Jiang","Chia-Ho Hsiung","Yen-Tung Yeh","Lu-Rong Chen","Bo-Yu Chen"],"pdf_url":"https://arxiv.org/pdf/2412.06617v1.pdf","comment":"Accepted for the NeurIPS 2024 Creative AI Track"},{"id":"http://arxiv.org/abs/2412.06602v1","updated":"2024-12-09T15:50:25Z","published":"2024-12-09T15:50:25Z","title":"Towards Controllable Speech Synthesis in the Era of Large Language\n Models: A Survey","summary":" Text-to-speech (TTS), also known as speech synthesis, is a prominent research\narea that aims to generate natural-sounding human speech from text. Recently,\nwith the increasing industrial demand, TTS technologies have evolved beyond\nsynthesizing human-like speech to enabling controllable speech generation. This\nincludes fine-grained control over various attributes of synthesized speech\nsuch as emotion, prosody, timbre, and duration. Besides, advancements in deep\nlearning, such as diffusion and large language models, have significantly\nenhanced controllable TTS over the past several years. In this paper, we\nconduct a comprehensive survey of controllable TTS, covering approaches ranging\nfrom basic control techniques to methods utilizing natural language prompts,\naiming to provide a clear understanding of the current state of research. We\nexamine the general controllable TTS pipeline, challenges, model architectures,\nand control strategies, offering a comprehensive and clear taxonomy of existing\nmethods. Additionally, we provide a detailed summary of datasets and evaluation\nmetrics and shed some light on the applications and future directions of\ncontrollable TTS. To the best of our knowledge, this survey paper provides the\nfirst comprehensive review of emerging controllable TTS methods, which can\nserve as a beneficial resource for both academic researchers and industry\npractitioners.\n","authors":["Tianxin Xie","Yan Rong","Pengfei Zhang","Li Liu"],"pdf_url":"https://arxiv.org/pdf/2412.06602v1.pdf","comment":"A comprehensive survey on controllable TTS, 23 pages, 6 tables, 4\n figures, 280 references"},{"id":"http://arxiv.org/abs/2405.05691v2","updated":"2024-12-09T14:43:56Z","published":"2024-05-09T11:41:27Z","title":"StableMoFusion: Towards Robust and Efficient Diffusion-based Motion\n Generation Framework","summary":" Thanks to the powerful generative capacity of diffusion models, recent years\nhave witnessed rapid progress in human motion generation. Existing\ndiffusion-based methods employ disparate network architectures and training\nstrategies. The effect of the design of each component is still unclear. In\naddition, the iterative denoising process consumes considerable computational\noverhead, which is prohibitive for real-time scenarios such as virtual\ncharacters and humanoid robots. For this reason, we first conduct a\ncomprehensive investigation into network architectures, training strategies,\nand inference processs. Based on the profound analysis, we tailor each\ncomponent for efficient high-quality human motion generation. Despite the\npromising performance, the tailored model still suffers from foot skating which\nis an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we\nidentify foot-ground contact and correct foot motions along the denoising\nprocess. By organically combining these well-designed components together, we\npresent StableMoFusion, a robust and efficient framework for human motion\ngeneration. Extensive experimental results show that our StableMoFusion\nperforms favorably against current state-of-the-art methods. Project page:\nhttps://h-y1heng.github.io/StableMoFusion-page/\n","authors":["Yiheng Huang","Hui Yang","Chuanchen Luo","Yuxi Wang","Shibiao Xu","Zhaoxiang Zhang","Man Zhang","Junran Peng"],"pdf_url":"https://arxiv.org/pdf/2405.05691v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.17335v2","updated":"2024-12-09T13:55:27Z","published":"2023-11-29T03:24:30Z","title":"Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and\n Baseline","summary":" Nowadays, short-form videos (SVs) are essential to web information\nacquisition and sharing in our daily life. The prevailing use of SVs to spread\nemotions leads to the necessity of conducting video emotion analysis (VEA)\ntowards SVs. Considering the lack of SVs emotion data, we introduce a\nlarge-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we\nalleviate the impact of subjectivities on labeling quality by emphasizing\nbetter personnel allocations and multi-stage annotations. In addition, we\nprovide the category-balanced and test-oriented variants through targeted data\nsampling. Some commonly used videos, such as facial expressions, have been well\nstudied. However, it is still challenging to analysis the emotions in SVs.\nSince the broader content diversity brings more distinct semantic gaps and\ndifficulties in learning emotion-related features, and there exists local\nbiases and collective information gaps caused by the emotion inconsistence\nunder the prevalently audio-visual co-expressions. To tackle these challenges,\nwe present an end-to-end audio-visual baseline AV-CANet which employs the video\ntransformer to better learn semantically relevant representations. We further\ndesign the Local-Global Fusion Module to progressively capture the correlations\nof audio-visual features. The EP-CE Loss is then introduced to guide model\noptimization. Extensive experimental results on seven datasets demonstrate the\neffectiveness of AV-CANet, while providing broad insights for future works.\nBesides, we investigate the key components of AV-CANet by ablation studies.\nDatasets and code will be fully open soon.\n","authors":["Xuecheng Wu","Heli Sun","Junxiao Xue","Jiayu Nie","Xiangyan Kong","Ruofan Zhai","Liang He"],"pdf_url":"https://arxiv.org/pdf/2311.17335v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06299v1","updated":"2024-12-09T08:44:19Z","published":"2024-12-09T08:44:19Z","title":"4D Gaussian Splatting with Scale-aware Residual Field and Adaptive\n Optimization for Real-time Rendering of Temporally Complex Dynamic Scenes","summary":" Reconstructing dynamic scenes from video sequences is a highly promising task\nin the multimedia domain. While previous methods have made progress, they often\nstruggle with slow rendering and managing temporal complexities such as\nsignificant motion and object appearance/disappearance. In this paper, we\npropose SaRO-GS as a novel dynamic scene representation capable of achieving\nreal-time rendering while effectively handling temporal complexities in dynamic\nscenes. To address the issue of slow rendering speed, we adopt a Gaussian\nprimitive-based representation and optimize the Gaussians in 4D space, which\nfacilitates real-time rendering with the assistance of 3D Gaussian Splatting.\nAdditionally, to handle temporally complex dynamic scenes, we introduce a\nScale-aware Residual Field. This field considers the size information of each\nGaussian primitive while encoding its residual feature and aligns with the\nself-splitting behavior of Gaussian primitives. Furthermore, we propose an\nAdaptive Optimization Schedule, which assigns different optimization strategies\nto Gaussian primitives based on their distinct temporal properties, thereby\nexpediting the reconstruction of dynamic regions. Through evaluations on\nmonocular and multi-view datasets, our method has demonstrated state-of-the-art\nperformance. Please see our project page at\nhttps://yjb6.github.io/SaRO-GS.github.io.\n","authors":["Jinbo Yan","Rui Peng","Luyang Tang","Ronggang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.06299v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06211v1","updated":"2024-12-09T05:15:44Z","published":"2024-12-09T05:15:44Z","title":"MSCrackMamba: Leveraging Vision Mamba for Crack Detection in Fused\n Multispectral Imagery","summary":" Crack detection is a critical task in structural health monitoring, aimed at\nassessing the structural integrity of bridges, buildings, and roads to prevent\npotential failures. Vision-based crack detection has become the mainstream\napproach due to its ease of implementation and effectiveness. Fusing infrared\n(IR) channels with red, green and blue (RGB) channels can enhance feature\nrepresentation and thus improve crack detection. However, IR and RGB channels\noften differ in resolution. To align them, higher-resolution RGB images\ntypically need to be downsampled to match the IR image resolution, which leads\nto the loss of fine details. Moreover, crack detection performance is\nrestricted by the limited receptive fields and high computational complexity of\ntraditional image segmentation networks. Inspired by the recently proposed\nMamba neural architecture, this study introduces a two-stage paradigm called\nMSCrackMamba, which leverages Vision Mamba along with a super-resolution\nnetwork to address these challenges. Specifically, to align IR and RGB\nchannels, we first apply super-resolution to IR channels to match the\nresolution of RGB channels for data fusion. Vision Mamba is then adopted as the\nbackbone network, while UperNet is employed as the decoder for crack detection.\nOur approach is validated on the large-scale Crack Detection dataset Crack900,\ndemonstrating an improvement of 3.55% in mIoU compared to the best-performing\nbaseline methods.\n","authors":["Qinfeng Zhu","Yuan Fang","Lei Fan"],"pdf_url":"https://arxiv.org/pdf/2412.06211v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06209v1","updated":"2024-12-09T05:04:50Z","published":"2024-12-09T05:04:50Z","title":"Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal\n Latent Alignment","summary":" How does audio describe the world around us? In this work, we propose a\nmethod for generating images of visual scenes from diverse in-the-wild sounds.\nThis cross-modal generation task is challenging due to the significant\ninformation gap between auditory and visual signals. We address this challenge\nby designing a model that aligns audio-visual modalities by enriching audio\nfeatures with visual information and translating them into the visual latent\nspace. These features are then fed into the pre-trained image generator to\nproduce images. To enhance image quality, we use sound source localization to\nselect audio-visual pairs with strong cross-modal correlations. Our method\nachieves substantially better results on the VEGAS and VGGSound datasets\ncompared to previous work and demonstrates control over the generation process\nthrough simple manipulations to the input waveform or latent space.\nFurthermore, we analyze the geometric properties of the learned embedding space\nand demonstrate that our learning approach effectively aligns audio-visual\nsignals for cross-modal generation. Based on this analysis, we show that our\nmethod is agnostic to specific design choices, showing its generalizability by\nintegrating various model architectures and different types of audio-visual\ndata.\n","authors":["Kim Sung-Bin","Arda Senocak","Hyunwoo Ha","Tae-Hyun Oh"],"pdf_url":"https://arxiv.org/pdf/2412.06209v1.pdf","comment":"Under-review"},{"id":"http://arxiv.org/abs/2412.06208v1","updated":"2024-12-09T04:58:49Z","published":"2024-12-09T04:58:49Z","title":"Pilot-guided Multimodal Semantic Communication for Audio-Visual Event\n Localization","summary":" Multimodal semantic communication, which integrates various data modalities\nsuch as text, images, and audio, significantly enhances communication\nefficiency and reliability. It has broad application prospects in fields such\nas artificial intelligence, autonomous driving, and smart homes. However,\ncurrent research primarily relies on analog channels and assumes constant\nchannel states (perfect CSI), which is inadequate for addressing dynamic\nphysical channels and noise in real-world scenarios. Existing methods often\nfocus on single modality tasks and fail to handle multimodal stream data, such\nas video and audio, and their corresponding tasks. Furthermore, current\nsemantic encoding and decoding modules mainly transmit single modality\nfeatures, neglecting the need for multimodal semantic enhancement and\nrecognition tasks.\n To address these challenges, this paper proposes a pilot-guided framework for\nmultimodal semantic communication specifically tailored for audio-visual event\nlocalization tasks. This framework utilizes digital pilot codes and channel\nmodules to guide the state of analog channels in real-wold scenarios and\ndesigns Euler-based multimodal semantic encoding and decoding that consider\ntime-frequency characteristics based on dynamic channel state. This approach\neffectively handles multimodal stream source data, especially for audio-visual\nevent localization tasks. Extensive numerical experiments demonstrate the\nrobustness of the proposed framework in channel changes and its support for\nvarious communication scenarios. The experimental results show that the\nframework outperforms existing benchmark methods in terms of Signal-to-Noise\nRatio (SNR), highlighting its advantage in semantic communication quality.\n","authors":["Fei Yu","Zhe Xiang","Nan Che","Zhuoran Zhang","Yuandi Li","Junxiao Xue","Zhiguo Wan"],"pdf_url":"https://arxiv.org/pdf/2412.06208v1.pdf","comment":null}]},"2024-12-08T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.06078v1","updated":"2024-12-08T21:55:12Z","published":"2024-12-08T21:55:12Z","title":"Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse\n GraphRAG","summary":" Recent advances have extended the context window of frontier LLMs\ndramatically, from a few thousand tokens up to millions, enabling entire books\nand codebases to fit into context. However, the compute costs of inferencing\nlong-context LLMs are massive and often prohibitive in practice. RAG offers an\nefficient and effective alternative: retrieve and process only the subset of\nthe context most important for the current task. Although promising, recent\nwork applying RAG to long-context tasks has two core limitations: 1) there has\nbeen little focus on making the RAG pipeline compute efficient, and 2) such\nworks only test on simple QA tasks, and their performance on more challenging\ntasks is unclear. To address this, we develop an algorithm based on PageRank, a\ngraph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR).\nMixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented\nusing sparse matrices for efficent, cheap retrieval that can deal with a\nvariety of complex tasks. Our MixPR retriever achieves state-of-the-art results\nacross a wide range of long-context benchmark tasks, outperforming both\nexisting RAG methods, specialized retrieval architectures, and long-context\nLLMs despite being far more compute efficient. Due to using sparse embeddings,\nour retriever is extremely compute efficient, capable of embedding and\nretrieving millions of tokens within a few seconds and runs entirely on CPU.\n","authors":["Nicholas Alonso","Beren Millidge"],"pdf_url":"https://arxiv.org/pdf/2412.06078v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06069v1","updated":"2024-12-08T21:14:57Z","published":"2024-12-08T21:14:57Z","title":"Fuzzy Norm-Explicit Product Quantization for Recommender Systems","summary":" As the data resources grow, providing recommendations that best meet the\ndemands has become a vital requirement in business and life to overcome the\ninformation overload problem. However, building a system suggesting relevant\nrecommendations has always been a point of debate. One of the most\ncost-efficient techniques in terms of producing relevant recommendations at a\nlow complexity is Product Quantization (PQ). PQ approaches have continued\ndeveloping in recent years. This system's crucial challenge is improving\nproduct quantization performance in terms of recall measures without\ncompromising its complexity. This makes the algorithm suitable for problems\nthat require a greater number of potentially relevant items without\ndisregarding others, at high-speed and low-cost to keep up with traffic. This\nis the case of online shops where the recommendations for the purpose are\nimportant, although customers can be susceptible to scoping other products.\nThis research proposes a fuzzy approach to perform norm-based product\nquantization. Type-2 Fuzzy sets (T2FSs) define the codebook allowing\nsub-vectors (T2FSs) to be associated with more than one element of the\ncodebook, and next, its norm calculus is resolved by means of integration. Our\nmethod finesses the recall measure up, making the algorithm suitable for\nproblems that require querying at most possible potential relevant items\nwithout disregarding others. The proposed method outperforms all PQ approaches\nsuch as NEQ, PQ, and RQ up to +6%, +5%, and +8% by achieving a recall of 94%,\n69%, 59% in Netflix, Audio, Cifar60k datasets, respectively. More and over,\ncomputing time and complexity nearly equals the most computationally efficient\nexisting PQ method in the state-of-the-art.\n","authors":["Mohammadreza Jamalifard","Javier Andreu-Perez","Hani Hagras","Luis Martínez López"],"pdf_url":"https://arxiv.org/pdf/2412.06069v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.06009v1","updated":"2024-12-08T17:53:43Z","published":"2024-12-08T17:53:43Z","title":"1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval\n (LeSeR) for Regulatory Question Answering","summary":" This paper presents the system description of our entry for the COLING 2025\nRegNLP RIRAG (Regulatory Information Retrieval and Answer Generation)\nchallenge, focusing on leveraging advanced information retrieval and answer\ngeneration techniques in regulatory domains. We experimented with a combination\nof embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged\nfine-tuning and reranking for retrieving relevant documents in top ranks. We\nutilized a novel approach, LeSeR, which achieved competitive results with a\nrecall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights\nthe transformative potential of natural language processing techniques in\nregulatory applications, offering insights into their capabilities for\nimplementing a retrieval augmented generation system while identifying areas\nfor future improvement in robustness and domain adaptation.\n","authors":["Jebish Purbey","Drishti Sharma","Siddhant Gupta","Khawaja Murad","Siddartha Pullakhandam","Ram Mohan Rao Kadiyala"],"pdf_url":"https://arxiv.org/pdf/2412.06009v1.pdf","comment":"5 pages, Accepted to RegNLP @ COLING 2025"},{"id":"http://arxiv.org/abs/2409.02864v3","updated":"2024-12-08T15:45:30Z","published":"2024-09-04T16:43:14Z","title":"Language Model Powered Digital Biology with BRAD","summary":" Recent advancements in Large Language Models (LLMs) are transforming biology,\ncomputer science, engineering, and every day life. However, integrating the\nwide array of computational tools, databases, and scientific literature\ncontinues to pose a challenge to biological research. LLMs are well-suited for\nunstructured integration, efficient information retrieval, and automating\nstandard workflows and actions from these diverse resources. To harness these\ncapabilities in bioinformatics, we present a prototype Bioinformatics Retrieval\nAugmented Digital assistant (BRAD). BRAD is a chatbot and agentic system that\nintegrates a variety of bioinformatics tools. The Python package implements an\nAI \\texttt{Agent} that is powered by LLMs and connects to a local file system,\nonline databases, and a user's software. The \\texttt{Agent} is highly\nconfigurable, enabling tasks such as Retrieval-Augmented Generation, searches\nacross bioinformatics databases, and the execution of software pipelines.\nBRAD's coordinated integration of bioinformatics tools delivers a context-aware\nand semi-autonomous system that extends beyond the capabilities of conventional\nLLM-based chatbots. A graphical user interface (GUI) provides an intuitive\ninterface to the system.\n","authors":["Joshua Pickard","Ram Prakash","Marc Andrew Choi","Natalie Oliven","Cooper Stansbury","Jillian Cwycyshyn","Alex Gorodetsky","Alvaro Velasquez","Indika Rajapakse"],"pdf_url":"https://arxiv.org/pdf/2409.02864v3.pdf","comment":"12 pages, 3 figures, 1 table. See: https://github.com/Jpickard1/BRAD"},{"id":"http://arxiv.org/abs/2412.05937v1","updated":"2024-12-08T13:36:42Z","published":"2024-12-08T13:36:42Z","title":"Accelerating Manufacturing Scale-Up from Material Discovery Using\n Agentic Web Navigation and Retrieval-Augmented AI for Process Engineering\n Schematics Design","summary":" Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (PIDs)\nare critical tools for industrial process design, control, and safety. However,\nthe generation of precise and regulation-compliant diagrams remains a\nsignificant challenge, particularly in scaling breakthroughs from material\ndiscovery to industrial production in an era of automation and digitalization.\nThis paper introduces an autonomous agentic framework to address these\nchallenges through a twostage approach involving knowledge acquisition and\ngeneration. The framework integrates specialized sub-agents for retrieving and\nsynthesizing multimodal data from publicly available online sources and\nconstructs ontological knowledge graphs using a Graph Retrieval-Augmented\nGeneration (Graph RAG) paradigm. These capabilities enable the automation of\ndiagram generation and open-domain question answering (ODQA) tasks with high\ncontextual accuracy. Extensive empirical experiments demonstrate the frameworks\nability to deliver regulation-compliant diagrams with minimal expert\nintervention, highlighting its practical utility for industrial applications.\n","authors":["Sakhinana Sagar Srinivas","Akash Das","Shivam Gupta","Venkataramana Runkana"],"pdf_url":"https://arxiv.org/pdf/2412.05937v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05921v1","updated":"2024-12-08T12:31:32Z","published":"2024-12-08T12:31:32Z","title":"Learning Cluster Representatives for Approximate Nearest Neighbor Search","summary":" Developing increasingly efficient and accurate algorithms for approximate\nnearest neighbor search is a paramount goal in modern information retrieval. A\nprimary approach to addressing this question is clustering, which involves\npartitioning the dataset into distinct groups, with each group characterized by\na representative data point. By this method, retrieving the top-k data points\nfor a query requires identifying the most relevant clusters based on their\nrepresentatives -- a routing step -- and then conducting a nearest neighbor\nsearch within these clusters only, drastically reducing the search space.\n The objective of this thesis is not only to provide a comprehensive\nexplanation of clustering-based approximate nearest neighbor search but also to\nintroduce and delve into every aspect of our novel state-of-the-art method,\nwhich originated from a natural observation: The routing function solves a\nranking problem, making the function amenable to learning-to-rank. The\ndevelopment of this intuition and applying it to maximum inner product search\nhas led us to demonstrate that learning cluster representatives using a simple\nlinear function significantly boosts the accuracy of clustering-based\napproximate nearest neighbor search.\n","authors":["Thomas Vecchiato"],"pdf_url":"https://arxiv.org/pdf/2412.05921v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01269v3","updated":"2024-12-08T12:07:00Z","published":"2024-12-02T08:35:54Z","title":"CPRM: A LLM-based Continual Pre-training Framework for Relevance\n Modeling in Commercial Search","summary":" Relevance modeling between queries and items stands as a pivotal component in\ncommercial search engines, directly affecting the user experience. Given the\nremarkable achievements of large language models (LLMs) in various natural\nlanguage processing (NLP) tasks, LLM-based relevance modeling is gradually\nbeing adopted within industrial search systems. Nevertheless, foundational LLMs\nlack domain-specific knowledge and do not fully exploit the potential of\nin-context learning. Furthermore, structured item text remains underutilized,\nand there is a shortage in the supply of corresponding queries and background\nknowledge. We thereby propose CPRM (Continual Pre-training for Relevance\nModeling), a framework designed for the continual pre-training of LLMs to\naddress these issues. Our CPRM framework includes three modules: 1) employing\nboth queries and multi-field item to jointly pre-train for enhancing domain\nknowledge, 2) applying in-context pre-training, a novel approach where LLMs are\npre-trained on a sequence of related queries or items, and 3) conducting\nreading comprehension on items to produce associated domain knowledge and\nbackground information (e.g., generating summaries and corresponding queries)\nto further strengthen LLMs. Results on offline experiments and online A/B\ntesting demonstrate that our model achieves convincing performance compared to\nstrong baselines.\n","authors":["Kaixin Wu","Yixin Ji","Zeyuan Chen","Qiang Wang","Cunxiang Wang","Hong Liu","Baijun Ji","Jia Xu","Zhongyi Liu","Jinjie Gu","Yuan Zhou","Linjian Mo"],"pdf_url":"https://arxiv.org/pdf/2412.01269v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05868v1","updated":"2024-12-08T09:20:25Z","published":"2024-12-08T09:20:25Z","title":"Automated Extraction and Creation of FBS Design Reasoning Knowledge\n Graphs from Structured Data in Product Catalogues Lacking Contextual\n Information","summary":" Ontology-based knowledge graphs (KG) are desirable for effective knowledge\nmanagement and reuse in various decision making scenarios, including design.\nCreating and populating extensive KG based on specific ontological models can\nbe highly labour and time-intensive unless automated processes are developed\nfor knowledge extraction and graph creation. Most research and development on\nautomated extraction and creation of KG is based on extensive unstructured data\nsets that provide contextual information. However, some of the most useful\ninformation about the products and services of a company has traditionally been\nrecorded as structured data. Such structured data sets rarely follow a standard\nontology, do not capture explicit mapping of relationships between the\nentities, and provide no contextual information. Therefore, this research\nreports a method and digital workflow developed to address this gap. The\ndeveloped method and workflow employ rule-based techniques to extract and\ncreate a Function Behaviour-Structure (FBS) ontology-based KG from legacy\nstructured data, especially specification sheets and product catalogues. The\nsolution approach consists of two main components: a process for deriving\ncontext and context-based classification rules for FBS ontology concepts and a\nworkflow for populating and retrieving the FBS ontology-based KG. KG and\nNatural Language Processing (NLP) are used to automate knowledge extraction,\nrepresentation, and retrieval. The workflow's effectiveness is demonstrated via\npilot implementation in an industrial context. Insights gained from the pilot\nstudy are reported regarding the challenges and opportunities, including\ndiscussing the FBS ontology and concepts.\n","authors":["Vijayalaxmi Sahadevan","Sushil Mario","Yash Jaiswal","Divyanshu Bajpai","Vishal Singh","Hiralal Aggarwal","Suhas Suresh","Manjunath Maigur"],"pdf_url":"https://arxiv.org/pdf/2412.05868v1.pdf","comment":"31 pages, with 17 figures and 10 tables"},{"id":"http://arxiv.org/abs/2412.00424v2","updated":"2024-12-08T07:07:37Z","published":"2024-11-30T10:30:49Z","title":"FairSort: Learning to Fair Rank for Personalized Recommendations in\n Two-Sided Platforms","summary":" Traditional recommendation systems focus on maximizing user satisfaction by\nsuggesting their favourite items. This user-centric approach may lead to unfair\nexposure distribution among the providers. On the contrary, a provider-centric\ndesign might become unfair to the users. Therefore, this paper proposes a\nre-ranking model FairSort to find a trade-off solution among user-side\nfairness, provider-side fairness, and personalized recommendations utility.\nPrevious works habitually treat this issue as a knapsack problem, incorporating\nboth-side fairness as constraints.\n In this paper, we adopt a novel perspective, treating each recommendation\nlist as a runway rather than a knapsack. In this perspective, each item on the\nrunway gains a velocity and runs within a specific time, achieving re-ranking\nfor both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for\npersonalized recommendations by designing a Binary Search approach. This can\nprovide more reliable recommendations compared to the conventional greedy\nstrategy based on the knapsack problem. We further broaden the applicability of\nFairSort, designing two versions for online and offline recommendation\nscenarios. Theoretical analysis and extensive experiments on real-world\ndatasets indicate that FairSort can ensure more reliable personalized\nrecommendations while considering fairness for both the provider and user.\n","authors":["Guoli Wu","Zhiyong Feng","Shizhan Chen","Hongyue Wu","Xiao Xue","Jianmao Xiao","Guodong Fan","Hongqi Chen","Jingyu Li"],"pdf_url":"https://arxiv.org/pdf/2412.00424v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.06001v1","updated":"2024-12-08T17:23:03Z","published":"2024-12-08T17:23:03Z","title":"M6: Multi-generator, Multi-domain, Multi-lingual and cultural,\n Multi-genres, Multi-instrument Machine-Generated Music Detection Databases","summary":" Machine-generated music (MGM) has emerged as a powerful tool with\napplications in music therapy, personalised editing, and creative inspiration\nfor the music community. However, its unregulated use threatens the\nentertainment, education, and arts sectors by diminishing the value of\nhigh-quality human compositions. Detecting machine-generated music (MGMD) is,\ntherefore, critical to safeguarding these domains, yet the field lacks\ncomprehensive datasets to support meaningful progress. To address this gap, we\nintroduce \\textbf{M6}, a large-scale benchmark dataset tailored for MGMD\nresearch. M6 is distinguished by its diversity, encompassing multiple\ngenerators, domains, languages, cultural contexts, genres, and instruments. We\noutline our methodology for data selection and collection, accompanied by\ndetailed data analysis, providing all WAV form of music. Additionally, we\nprovide baseline performance scores using foundational binary classification\nmodels, illustrating the complexity of MGMD and the significant room for\nimprovement. By offering a robust and multifaceted resource, we aim to empower\nfuture research to develop more effective detection methods for MGM. We believe\nM6 will serve as a critical step toward addressing this societal challenge. The\ndataset and code will be freely available to support open collaboration and\ninnovation in this field.\n","authors":["Yupei Li","Hanqian Li","Lucia Specia","Björn W. Schuller"],"pdf_url":"https://arxiv.org/pdf/2412.06001v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05831v1","updated":"2024-12-08T06:37:27Z","published":"2024-12-08T06:37:27Z","title":"Semi-Supervised Contrastive Learning for Controllable Video-to-Music\n Retrieval","summary":" Content creators often use music to enhance their videos, from soundtracks in\nmovies to background music in video blogs and social media content. However,\nidentifying the best music for a video can be a difficult and time-consuming\ntask. To address this challenge, we propose a novel framework for automatically\nretrieving a matching music clip for a given video, and vice versa. Our\napproach leverages annotated music labels, as well as the inherent artistic\ncorrespondence between visual and music elements. Distinct from previous\ncross-modal music retrieval works, our method combines both self-supervised and\nsupervised training objectives. We use self-supervised and label-supervised\ncontrastive learning to train a joint embedding space between music and video.\nWe show the effectiveness of our approach by using music genre labels for the\nsupervised training component, and our framework can be generalized to other\nmusic annotations (e.g., emotion, instrument, etc.). Furthermore, our method\nenables fine-grained control over how much the retrieval process focuses on\nself-supervised vs. label information at inference time. We evaluate the\nlearned embeddings through a variety of video-to-music and music-to-video\nretrieval tasks. Our experiments show that the proposed approach successfully\ncombines self-supervised and supervised objectives and is effective for\ncontrollable music-video retrieval.\n","authors":["Shanti Stewart","Gouthaman KV","Lie Lu","Andrea Fanelli"],"pdf_url":"https://arxiv.org/pdf/2412.05831v1.pdf","comment":"4 pages + 1 reference page, 2 figures, 2 tables. Under review"},{"id":"http://arxiv.org/abs/2412.05818v1","updated":"2024-12-08T05:28:08Z","published":"2024-12-08T05:28:08Z","title":"SILMM: Self-Improving Large Multimodal Models for Compositional\n Text-to-Image Generation","summary":" Large Multimodal Models (LMMs) have demonstrated impressive capabilities in\nmultimodal understanding and generation, pushing forward advancements in\ntext-to-image generation. However, achieving accurate text-image alignment for\nLMMs, particularly in compositional scenarios, remains challenging. Existing\napproaches, such as layout planning for multi-step generation and learning from\nhuman feedback or AI feedback, depend heavily on prompt engineering, costly\nhuman annotations, and continual upgrading, limiting flexibility and\nscalability. In this work, we introduce a model-agnostic iterative\nself-improvement framework (SILMM) that can enable LMMs to provide helpful and\nscalable self-feedback and optimize text-image alignment via Direct Preference\nOptimization (DPO). DPO can readily applied to LMMs that use discrete visual\ntokens as intermediate image representations; while it is less suitable for\nLMMs with continuous visual features, as obtaining generation probabilities is\nchallenging. To adapt SILMM to LMMs with continuous features, we propose a\ndiversity mechanism to obtain diverse representations and a kernel-based\ncontinuous DPO for alignment. Extensive experiments on three compositional\ntext-to-image generation benchmarks validate the effectiveness and superiority\nof SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20%\non DPG-Bench.\n","authors":["Leigang Qu","Haochuan Li","Wenjie Wang","Xiang Liu","Juncheng Li","Liqiang Nie","Tat-Seng Chua"],"pdf_url":"https://arxiv.org/pdf/2412.05818v1.pdf","comment":"project page: https://silmm.github.io/"},{"id":"http://arxiv.org/abs/2308.12610v3","updated":"2024-12-08T05:26:25Z","published":"2023-08-24T07:20:47Z","title":"Emotion-Aligned Contrastive Learning Between Images and Music","summary":" Traditional music search engines rely on retrieval methods that match natural\nlanguage queries with music metadata. There have been increasing efforts to\nexpand retrieval methods to consider the audio characteristics of music itself,\nusing queries of various modalities including text, video, and speech. While\nmost approaches aim to match general music semantics to the input queries, only\na few focus on affective qualities. In this work, we address the task of\nretrieving emotionally-relevant music from image queries by learning an\naffective alignment between images and music audio. Our approach focuses on\nlearning an emotion-aligned joint embedding space between images and music.\nThis embedding space is learned via emotion-supervised contrastive learning,\nusing an adapted cross-modal version of the SupCon loss. We evaluate the joint\nembeddings through cross-modal retrieval tasks (image-to-music and\nmusic-to-image) based on emotion labels. Furthermore, we investigate the\ngeneralizability of the learned music embeddings via automatic music tagging.\nOur experiments show that the proposed approach successfully aligns images and\nmusic, and that the learned embedding space is effective for cross-modal\nretrieval applications.\n","authors":["Shanti Stewart","Kleanthis Avramidis","Tiantian Feng","Shrikanth Narayanan"],"pdf_url":"https://arxiv.org/pdf/2308.12610v3.pdf","comment":"Published at ICASSP 2024. Code:\n https://github.com/shantistewart/Emo-CLIM"},{"id":"http://arxiv.org/abs/2412.05808v1","updated":"2024-12-08T04:09:14Z","published":"2024-12-08T04:09:14Z","title":"SizeGS: Size-aware Compression of 3D Gaussians with Hierarchical Mixed\n Precision Quantization","summary":" Effective compression technology is crucial for 3DGS to adapt to varying\nstorage and transmission conditions. However, existing methods fail to address\nsize constraints while maintaining optimal quality. In this paper, we introduce\nSizeGS, a framework that compresses 3DGS within a specified size budget while\noptimizing visual quality. We start with a size estimator to establish a clear\nrelationship between file size and hyperparameters. Leveraging this estimator,\nwe incorporate mixed precision quantization (MPQ) into 3DGS attributes,\nstructuring MPQ in two hierarchical level -- inter-attribute and\nintra-attribute -- to optimize visual quality under the size constraint. At the\ninter-attribute level, we assign bit-widths to each attribute channel by\nformulating the combinatorial optimization as a 0-1 integer linear program,\nwhich can be efficiently solved. At the intra-attribute level, we divide each\nattribute channel into blocks of vectors, quantizing each vector based on the\noptimal bit-width derived at the inter-attribute level. Dynamic programming\ndetermines block lengths. Using the size estimator and MPQ, we develop a\ncalibrated algorithm to identify optimal hyperparameters in just 10 minutes,\nachieving a 1.69$\\times$ efficiency increase with quality comparable to\nstate-of-the-art methods.\n","authors":["Shuzhao Xie","Jiahang Liu","Weixiang Zhang","Shijia Ge","Sicheng Pan","Chen Tang","Yunpeng Bai","Zhi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05808v1.pdf","comment":"Automatically compressing 3DGS into the desired file size while\n maximizing the visual quality"},{"id":"http://arxiv.org/abs/2409.09403v2","updated":"2024-12-08T03:25:51Z","published":"2024-09-14T10:27:36Z","title":"AI-Driven Virtual Teacher for Enhanced Educational Efficiency:\n Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction","summary":" Students frequently make mistakes while solving mathematical problems, and\ntraditional error correction methods are both time-consuming and\nlabor-intensive. This paper introduces an innovative \\textbf{V}irtual\n\\textbf{A}I \\textbf{T}eacher system designed to autonomously analyze and\ncorrect student \\textbf{E}rrors (VATE). Leveraging advanced large language\nmodels (LLMs), the system uses student drafts as a primary source for error\nanalysis, which enhances understanding of the student's learning process. It\nincorporates sophisticated prompt engineering and maintains an error pool to\nreduce computational overhead. The AI-driven system also features a real-time\ndialogue component for efficient student interaction. Our approach demonstrates\nsignificant advantages over traditional and machine learning-based error\ncorrection methods, including reduced educational costs, high scalability, and\nsuperior generalizability. The system has been deployed on the Squirrel AI\nlearning platform for elementary mathematics education, where it achieves\n78.3\\% accuracy in error analysis and shows a marked improvement in student\nlearning efficiency. Satisfaction surveys indicate a strong positive reception,\nhighlighting the system's potential to transform educational practices.\n","authors":["Tianlong Xu","Yi-Fan Zhang","Zhendong Chu","Shen Wang","Qingsong Wen"],"pdf_url":"https://arxiv.org/pdf/2409.09403v2.pdf","comment":"AAAI/IAAI 2025 Innovative Application Award"}]},"2024-12-07T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2310.14379v2","updated":"2024-12-07T19:03:35Z","published":"2023-10-22T18:22:35Z","title":"The Impact of User-Level Explanation Properties on Explanation Goals in\n Recommender Systems","summary":" Explanations are crucial for improving users' transparency, persuasiveness,\nengagement, and trust in Recommender Systems (RSs) by connecting interacted\nitems to recommended items based on shared attributes. However, evaluating the\neffectiveness of explanation algorithms regarding those goals offline remains\nchallenging due to their subjectiveness. This paper investigates the impact of\nuser-level explanation properties, such as diversity and popularity of\nattributes, on the user perception of explanation goals. In an offline setting,\nwe used metrics adapted from ranking to evaluate the characteristics of\nexplanations generated by three state-of-the-art post-hoc explanation\nalgorithms, based on the items and properties used to form the explanation\nsentence, across six recommendation systems. We compared the offline metrics\nresults with those of an online user study. The findings highlight a trade-off\nbetween the goals of transparency and trust, which are related to popular\nproperties, and the goals of engagement and persuasiveness, which are\nassociated with the diversification of properties displayed to users.\nFurthermore, the study contributes to developing more robust evaluation methods\nfor explanation algorithms in RSs.\n","authors":["André Levi Zanon","Marcelo Garcia Manzato","Leonardo Rocha"],"pdf_url":"https://arxiv.org/pdf/2310.14379v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05710v1","updated":"2024-12-07T17:51:31Z","published":"2024-12-07T17:51:31Z","title":"PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic\n Languages with Example Selection from Related Example Banks","summary":" Large Language Models (LLMs) have recently demonstrated impressive few-shot\nlearning capabilities through in-context learning (ICL). However, ICL\nperformance is highly dependent on the choice of few-shot demonstrations,\nmaking the selection of the most optimal examples a persistent research\nchallenge. This issue is further amplified in low-resource Indic languages,\nwhere the scarcity of ground-truth data complicates the selection process. In\nthis work, we propose PromptRefine, a novel Alternating Minimization approach\nfor example selection that improves ICL performance on low-resource Indic\nlanguages. PromptRefine leverages auxiliary example banks from related\nhigh-resource Indic languages and employs multi-task learning techniques to\nalign language-specific retrievers, enabling effective cross-language\nretrieval. Additionally, we incorporate diversity in the selected examples to\nenhance generalization and reduce bias. Through comprehensive evaluations on\nfour text generation tasks -- Cross-Lingual Question Answering, Multilingual\nQuestion Answering, Machine Translation, and Cross-Lingual Summarization using\nstate-of-the-art LLMs such as LLAMA-3.1-8B, LLAMA-2-7B, Qwen-2-7B, and\nQwen-2.5-7B, we demonstrate that PromptRefine significantly outperforms\nexisting frameworks for retrieving examples.\n","authors":["Soumya Suvra Ghosal","Soumyabrata Pal","Koyel Mukherjee","Dinesh Manocha"],"pdf_url":"https://arxiv.org/pdf/2412.05710v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05708v1","updated":"2024-12-07T17:43:21Z","published":"2024-12-07T17:43:21Z","title":"On the effective transfer of knowledge from English to Hindi Wikipedia","summary":" Although Wikipedia is the largest multilingual encyclopedia, it remains\ninherently incomplete. There is a significant disparity in the quality of\ncontent between high-resource languages (HRLs, e.g., English) and low-resource\nlanguages (LRLs, e.g., Hindi), with many LRL articles lacking adequate\ninformation. To bridge these content gaps, we propose a lightweight framework\nto enhance knowledge equity between English and Hindi. In case the English\nWikipedia page is not up-to-date, our framework extracts relevant information\nfrom external resources readily available (such as English books) and adapts it\nto align with Wikipedia's distinctive style, including its \\textit{neutral\npoint of view} (NPOV) policy, using in-context learning capabilities of large\nlanguage models. The adapted content is then machine-translated into Hindi for\nintegration into the corresponding Wikipedia articles. On the other hand, if\nthe English version is comprehensive and up-to-date, the framework directly\ntransfers knowledge from English to Hindi. Our framework effectively generates\nnew content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles\nrespectively by 65% and 62% according to automatic and human judgment-based\nevaluations.\n","authors":["Paramita Das","Amartya Roy","Ritabrata Chakraborty","Animesh Mukherjee"],"pdf_url":"https://arxiv.org/pdf/2412.05708v1.pdf","comment":"accepted at COLING Industry Track 2025"},{"id":"http://arxiv.org/abs/2409.15346v2","updated":"2024-12-07T16:00:31Z","published":"2024-09-10T13:46:14Z","title":"Big data searching using words","summary":" Big data analytics is one of the most promising areas of new research and\ndevelopment in computer science, enterprises, e-commerce, and defense. For many\norganizations, big data is regarded as one of their most important strategic\nassets. This explosive growth has made it necessary to develop effective\ntechniques for examining and analyzing big data from a mathematical\nperspective. Among various methods of analyzing big data, topological data\nanalysis (TDA) is now considered one of the useful tools. However, there is no\nfundamental concept related to topological structure in big data. In this\npaper, we introduce some fundamental ideas related to the neighborhood\nstructure of words in data searching, which can be extended to form important\ntopological structures of big data in the future. Additionally, we introduce\nbig data primal in big data searching and discuss the application of\nneighborhood structures in detecting anomalies in data searching using the\nJaccard similarity coefficient.\n","authors":["Santanu Acharjee","Ripunjoy Choudhury"],"pdf_url":"https://arxiv.org/pdf/2409.15346v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05547v1","updated":"2024-12-07T05:49:14Z","published":"2024-12-07T05:49:14Z","title":"KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large\n Language Models","summary":" Large language models with retrieval-augmented generation encounter a pivotal\nchallenge in intricate retrieval tasks, e.g., multi-hop question answering,\nwhich requires the model to navigate across multiple documents and generate\ncomprehensive responses based on fragmented information. To tackle this\nchallenge, we introduce a novel Knowledge Graph-based RAG framework with a\nhierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing\nin KG-Retriever is constructed on a hierarchical index graph that consists of a\nknowledge graph layer and a collaborative document layer. The associative\nnature of graph structures is fully utilized to strengthen intra-document and\ninter-document connectivity, thereby fundamentally alleviating the information\nfragmentation problem and meanwhile improving the retrieval efficiency in\ncross-document retrieval of LLMs. With the coarse-grained collaborative\ninformation from neighboring documents and concise information from the\nknowledge graph, KG-Retriever achieves marked improvements on five public QA\ndatasets, showing the effectiveness and efficiency of our proposed RAG\nframework.\n","authors":["Weijie Chen","Ting Bai","Jinbo Su","Jian Luan","Wei Liu","Chuan Shi"],"pdf_url":"https://arxiv.org/pdf/2412.05547v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05543v1","updated":"2024-12-07T05:37:00Z","published":"2024-12-07T05:37:00Z","title":"ULMRec: User-centric Large Language Model for Sequential Recommendation","summary":" Recent advances in Large Language Models (LLMs) have demonstrated promising\nperformance in sequential recommendation tasks, leveraging their superior\nlanguage understanding capabilities. However, existing LLM-based recommendation\napproaches predominantly focus on modeling item-level co-occurrence patterns\nwhile failing to adequately capture user-level personalized preferences. This\nis problematic since even users who display similar behavioral patterns (e.g.,\nclicking or purchasing similar items) may have fundamentally different\nunderlying interests. To alleviate this problem, in this paper, we propose\nULMRec, a framework that effectively integrates user personalized preferences\ninto LLMs for sequential recommendation. Considering there has the semantic gap\nbetween item IDs and LLMs, we replace item IDs with their corresponding titles\nin user historical behaviors, enabling the model to capture the item semantics.\nFor integrating the user personalized preference, we design two key components:\n(1) user indexing: a personalized user indexing mechanism that leverages vector\nquantization on user reviews and user IDs to generate meaningful and unique\nuser representations, and (2) alignment tuning: an alignment-based tuning stage\nthat employs comprehensive preference alignment tasks to enhance the model's\ncapability in capturing personalized information. Through this design, ULMRec\nachieves deep integration of language semantics with user personalized\npreferences, facilitating effective adaptation to recommendation. Extensive\nexperiments on two public datasets demonstrate that ULMRec significantly\noutperforms existing methods, validating the effectiveness of our approach.\n","authors":["Minglai Shao","Hua Huang","Qiyao Peng","Hongtao Liu"],"pdf_url":"https://arxiv.org/pdf/2412.05543v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.07335v2","updated":"2024-12-07T16:56:16Z","published":"2024-11-11T19:53:05Z","title":"Multimodal Fusion Balancing Through Game-Theoretic Regularization","summary":" Multimodal learning can complete the picture of information extraction by\nuncovering key dependencies between data sources. However, current systems fail\nto fully leverage multiple modalities for optimal performance. This has been\nattributed to modality competition, where modalities strive for training\nresources, leaving some underoptimized. We show that current balancing methods\nstruggle to train multimodal models that surpass even simple baselines, such as\nensembles. This raises the question: how can we ensure that all modalities in\nmultimodal training are sufficiently trained, and that learning from new\nmodalities consistently improves performance? This paper proposes the\nMultimodal Competition Regularizer (MCR), a new loss component inspired by\nmutual information (MI) decomposition designed to prevent the adverse effects\nof competition in multimodal training. Our key contributions are: 1)\nIntroducing game-theoretic principles in multimodal learning, where each\nmodality acts as a player competing to maximize its influence on the final\noutcome, enabling automatic balancing of the MI terms. 2) Refining lower and\nupper bounds for each MI term to enhance the extraction of task-relevant unique\nand shared information across modalities. 3) Suggesting latent space\npermutations for conditional MI estimation, significantly improving\ncomputational efficiency. MCR outperforms all previously suggested training\nstrategies and is the first to consistently improve multimodal learning beyond\nthe ensemble baseline, clearly demonstrating that combining modalities leads to\nsignificant performance gains on both synthetic and large real-world datasets.\n","authors":["Konstantinos Kontras","Thomas Strypsteen","Christos Chatzichristos","Paul Pu Liang","Matthew Blaschko","Maarten De Vos"],"pdf_url":"https://arxiv.org/pdf/2411.07335v2.pdf","comment":"21 pages, 6 figures, 4 tables, 1 algorithm"},{"id":"http://arxiv.org/abs/2412.05694v1","updated":"2024-12-07T16:43:02Z","published":"2024-12-07T16:43:02Z","title":"Combining Genre Classification and Harmonic-Percussive Features with\n Diffusion Models for Music-Video Generation","summary":" This study presents a novel method for generating music visualisers using\ndiffusion models, combining audio input with user-selected artwork. The process\ninvolves two main stages: image generation and video creation. First, music\ncaptioning and genre classification are performed, followed by the retrieval of\nartistic style descriptions. A diffusion model then generates images based on\nthe user's input image and the derived artistic style descriptions. The video\ngeneration stage utilises the same diffusion model to interpolate frames,\ncontrolled by audio energy vectors derived from key musical features of\nharmonics and percussives. The method demonstrates promising results across\nvarious genres, and a new metric, Audio-Visual Synchrony (AVS), is introduced\nto quantitatively evaluate the synchronisation between visual and audio\nelements. Comparative analysis shows significantly higher AVS values for videos\ngenerated using the proposed method with audio energy vectors, compared to\nlinear interpolation. This approach has potential applications in diverse\nfields, including independent music video creation, film production, live music\nevents, and enhancing audio-visual experiences in public spaces.\n","authors":["Leonardo Pina","Yongmin Li"],"pdf_url":"https://arxiv.org/pdf/2412.05694v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05558v1","updated":"2024-12-07T06:43:39Z","published":"2024-12-07T06:43:39Z","title":"WavFusion: Towards wav2vec 2.0 Multimodal Speech Emotion Recognition","summary":" Speech emotion recognition (SER) remains a challenging yet crucial task due\nto the inherent complexity and diversity of human emotions. To address this\nproblem, researchers attempt to fuse information from other modalities via\nmultimodal learning. However, existing multimodal fusion techniques often\noverlook the intricacies of cross-modal interactions, resulting in suboptimal\nfeature representations. In this paper, we propose WavFusion, a multimodal\nspeech emotion recognition framework that addresses critical research problems\nin effective multimodal fusion, heterogeneity among modalities, and\ndiscriminative representation learning. By leveraging a gated cross-modal\nattention mechanism and multimodal homogeneous feature discrepancy learning,\nWavFusion demonstrates improved performance over existing state-of-the-art\nmethods on benchmark datasets. Our work highlights the importance of capturing\nnuanced cross-modal interactions and learning discriminative representations\nfor accurate multimodal SER. Experimental results on two benchmark datasets\n(IEMOCAP and MELD) demonstrate that WavFusion succeeds over the\nstate-of-the-art strategies on emotion recognition.\n","authors":["Feng Li","Jiusong Luo","Wanjun Xia"],"pdf_url":"https://arxiv.org/pdf/2412.05558v1.pdf","comment":"Accepted by 31st International Conference on MultiMedia Modeling\n (MMM2025)"},{"id":"http://arxiv.org/abs/2412.05487v1","updated":"2024-12-07T01:17:21Z","published":"2024-12-07T01:17:21Z","title":"Securing Social Media Against Deepfakes using Identity, Behavioral, and\n Geometric Signatures","summary":" Trust in social media is a growing concern due to its ability to influence\nsignificant societal changes. However, this space is increasingly compromised\nby various types of deepfake multimedia, which undermine the authenticity of\nshared content. Although substantial efforts have been made to address the\nchallenge of deepfake content, existing detection techniques face a major\nlimitation in generalization: they tend to perform well only on specific types\nof deepfakes they were trained on.This dependency on recognizing specific\ndeepfake artifacts makes current methods vulnerable when applied to unseen or\nvaried deepfakes, thereby compromising their performance in real-world\napplications such as social media platforms. To address the generalizability of\ndeepfake detection, there is a need for a holistic approach that can capture a\nbroader range of facial attributes and manipulations beyond isolated artifacts.\nTo address this, we propose a novel deepfake detection framework featuring an\neffective feature descriptor that integrates Deep identity, Behavioral, and\nGeometric (DBaG) signatures, along with a classifier named DBaGNet.\nSpecifically, the DBaGNet classifier utilizes the extracted DBaG signatures,\nleveraging a triplet loss objective to enhance generalized representation\nlearning for improved classification. Specifically, the DBaGNet classifier\nutilizes the extracted DBaG signatures and applies a triplet loss objective to\nenhance generalized representation learning for improved classification. To\ntest the effectiveness and generalizability of our proposed approach, we\nconduct extensive experiments using six benchmark deepfake datasets: WLDR,\nCelebDF, DFDC, FaceForensics++, DFD, and NVFAIR. Specifically, to ensure the\neffectiveness of our approach, we perform cross-dataset evaluations, and the\nresults demonstrate significant performance gains over several state-of-the-art\nmethods.\n","authors":["Muhammad Umar Farooq","Awais Khan","Ijaz Ul Haq","Khalid Mahmood Malik"],"pdf_url":"https://arxiv.org/pdf/2412.05487v1.pdf","comment":null}]},"2024-12-06T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2411.18814v2","updated":"2024-12-06T23:09:05Z","published":"2024-11-27T23:36:59Z","title":"Unifying Generative and Dense Retrieval for Sequential Recommendation","summary":" Sequential dense retrieval models utilize advanced sequence learning\ntechniques to compute item and user representations, which are then used to\nrank relevant items for a user through inner product computation between the\nuser and all item representations. However, this approach requires storing a\nunique representation for each item, resulting in significant memory\nrequirements as the number of items grow. In contrast, the recently proposed\ngenerative retrieval paradigm offers a promising alternative by directly\npredicting item indices using a generative model trained on semantic IDs that\nencapsulate items' semantic information. Despite its potential for large-scale\napplications, a comprehensive comparison between generative retrieval and\nsequential dense retrieval under fair conditions is still lacking, leaving open\nquestions regarding performance, and computation trade-offs. To address this,\nwe compare these two approaches under controlled conditions on academic\nbenchmarks and propose LIGER (LeveragIng dense retrieval for GEnerative\nRetrieval), a hybrid model that combines the strengths of these two widely used\nmethods. LIGER integrates sequential dense retrieval into generative retrieval,\nmitigating performance differences and enhancing cold-start item recommendation\nin the datasets evaluated. This hybrid approach provides insights into the\ntrade-offs between these approaches and demonstrates improvements in efficiency\nand effectiveness for recommendation systems in small-scale benchmarks.\n","authors":["Liu Yang","Fabian Paischer","Kaveh Hassani","Jiacheng Li","Shuai Shao","Zhang Gabriel Li","Yun He","Xue Feng","Nima Noorshams","Sem Park","Bo Long","Robert D Nowak","Xiaoli Gao","Hamid Eghbalzadeh"],"pdf_url":"https://arxiv.org/pdf/2411.18814v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05447v1","updated":"2024-12-06T22:05:39Z","published":"2024-12-06T22:05:39Z","title":"A Graph-Based Approach for Conversational AI-Driven Personal Memory\n Capture and Retrieval in a Real-world Application","summary":" TOBU is a novel mobile application that captures and retrieves `personal\nmemories' (pictures/videos together with stories and context around those\nmoments) in a user-engaging AI-guided conversational approach. Our initial\nprototype showed that existing retrieval techniques such as retrieval-augmented\ngeneration (RAG) systems fall short due to their limitations in understanding\nmemory relationships, causing low recall, hallucination, and unsatisfactory\nuser experience. We design TOBUGraph, a novel graph-based retrieval approach.\nDuring capturing, TOBUGraph leverages large language models (LLMs) to\nautomatically create a dynamic knowledge graph of memories, establishing\ncontext and relationships of those memories. During retrieval, TOBUGraph\ncombines LLMs with the memory graph to achieve comprehensive recall through\ngraph traversal. Our evaluation using real user data demonstrates that\nTOBUGraph outperforms multiple RAG implementations in both precision and\nrecall, significantly improving user experience through improved retrieval\naccuracy and reduced hallucination.\n","authors":["Savini Kashmira","Jayanaka L. Dantanarayana","Joshua Brodsky","Ashish Mahendra","Yiping Kang","Krisztian Flautner","Lingjia Tang","Jason Mars"],"pdf_url":"https://arxiv.org/pdf/2412.05447v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05206v1","updated":"2024-12-06T17:35:52Z","published":"2024-12-06T17:35:52Z","title":"ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented\n Argumentation with LLM Judges","summary":" Computational argumentation, which involves generating answers or summaries\nfor controversial topics like abortion bans and vaccination, has become\nincreasingly important in today's polarized environment. Sophisticated LLM\ncapabilities offer the potential to provide nuanced, evidence-based answers to\nsuch questions through Retrieval-Augmented Argumentation (RAArg), leveraging\nreal-world evidence for high-quality, grounded arguments. However, evaluating\nRAArg remains challenging, as human evaluation is costly and difficult for\ncomplex, lengthy answers on complicated topics. At the same time, re-using\nexisting argumentation datasets is no longer sufficient, as they lack long,\ncomplex arguments and realistic evidence from potentially misleading sources,\nlimiting holistic evaluation of retrieval effectiveness and argument quality.\nTo address these gaps, we investigate automated evaluation methods using\nmultiple fine-grained LLM judges, providing better and more interpretable\nassessments than traditional single-score metrics and even previously reported\nhuman crowdsourcing. To validate the proposed techniques, we introduce ConQRet,\na new benchmark featuring long and complex human-authored arguments on debated\ntopics, grounded in real-world websites, allowing an exhaustive evaluation\nacross retrieval effectiveness, argument quality, and groundedness. We validate\nour LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed\nLLM Judges and the ConQRet benchmark can enable rapid progress in computational\nargumentation and can be naturally extended to other complex\nretrieval-augmented generation tasks.\n","authors":["Kaustubh D. Dhole","Kai Shu","Eugene Agichtein"],"pdf_url":"https://arxiv.org/pdf/2412.05206v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09439v2","updated":"2024-12-06T12:09:15Z","published":"2024-08-18T11:07:38Z","title":"Towards Boosting LLMs-driven Relevance Modeling with Progressive\n Retrieved Behavior-augmented Prompting","summary":" Relevance modeling is a critical component for enhancing user experience in\nsearch engines, with the primary objective of identifying items that align with\nusers' queries. Traditional models only rely on the semantic congruence between\nqueries and items to ascertain relevance. However, this approach represents\nmerely one aspect of the relevance judgement, and is insufficient in isolation.\nEven powerful Large Language Models (LLMs) still cannot accurately judge the\nrelevance of a query and an item from a semantic perspective. To augment\nLLMs-driven relevance modeling, this study proposes leveraging user\ninteractions recorded in search logs to yield insights into users' implicit\nsearch intentions. The challenge lies in the effective prompting of LLMs to\ncapture dynamic search intentions, which poses several obstacles in real-world\nrelevance scenarios, i.e., the absence of domain-specific knowledge, the\ninadequacy of an isolated prompt, and the prohibitive costs associated with\ndeploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved\nBehavior-augmented Prompting framework for integrating search scenario-oriented\nknowledge with LLMs effectively. Specifically, we perform the user-driven\nbehavior neighbors retrieval from the daily search logs to obtain\ndomain-specific knowledge in time, retrieving candidates that users consider to\nmeet their expectations. Then, we guide LLMs for relevance modeling by\nemploying advanced prompting techniques that progressively improve the outputs\nof the LLMs, followed by a progressive aggregation with comprehensive\nconsideration of diverse aspects. For online serving, we have developed an\nindustrial application framework tailored for the deployment of LLMs in\nrelevance modeling. Experiments on real-world industry data and online A/B\ntesting demonstrate our proposal achieves promising performance.\n","authors":["Zeyuan Chen","Haiyan Wu","Kaixin Wu","Wei Chen","Mingjie Zhong","Jia Xu","Zhongyi Liu","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.09439v2.pdf","comment":"Accepted By COLING 2025"},{"id":"http://arxiv.org/abs/2412.04846v1","updated":"2024-12-06T08:33:49Z","published":"2024-12-06T08:33:49Z","title":"eXpath: Explaining Knowledge Graph Link Prediction with Ontological\n Closed Path Rules","summary":" Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but\ncommonly suffers from interpretability issues. While several methods have been\nproposed to explain embedding-based LP models, they are generally limited to\nlocal explanations on KG and are deficient in providing human interpretable\nsemantics. Based on real-world observations of the characteristics of KGs from\nmultiple domains, we propose to explain LP models in KG with path-based\nexplanations. An integrated framework, namely eXpath, is introduced which\nincorporates the concept of relation path with ontological closed path rules to\nenhance both the efficiency and effectiveness of LP interpretation. Notably,\nthe eXpath explanations can be fused with other single-link explanation\napproaches to achieve a better overall solution. Extensive experiments across\nbenchmark datasets and LP models demonstrate that introducing eXpath can boost\nthe quality of resulting explanations by about 20% on two key metrics and\nreduce the required explanation time by 61.4%, in comparison to the best\nexisting method. Case studies further highlight eXpath's ability to provide\nmore semantically meaningful explanations through path-based evidence.\n","authors":["Ye Sun","Lei Shi","Yongxin Tong"],"pdf_url":"https://arxiv.org/pdf/2412.04846v1.pdf","comment":"13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201"},{"id":"http://arxiv.org/abs/2401.13509v2","updated":"2024-12-06T05:54:55Z","published":"2024-01-24T15:06:44Z","title":"TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient\n and Effective Retrieval","summary":" This paper considers Pseudo-Relevance Feedback (PRF) methods for dense\nretrievers in a resource constrained environment such as that of cheap cloud\ninstances or embedded systems (e.g., smartphones and smartwatches), where\nmemory and CPU are limited and GPUs are not present. For this, we propose a\ntransformer-based PRF method (TPRF), which has a much smaller memory footprint\nand faster inference time compared to other deep language models that employ\nPRF mechanisms, with a marginal effectiveness loss. TPRF learns how to\neffectively combine the relevance feedback signals from dense passage\nrepresentations. Specifically, TPRF provides a mechanism for modelling\nrelationships and weights between the query and the relevance feedback signals.\nThe method is agnostic to the specific dense representation used and thus can\nbe generally applied to any dense retriever.\n","authors":["Hang Li","Chuting Yu","Ahmed Mourad","Bevan Koopman","Guido Zuccon"],"pdf_url":"https://arxiv.org/pdf/2401.13509v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17740v3","updated":"2024-12-06T05:26:40Z","published":"2024-03-26T14:29:34Z","title":"All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating\n Prediction","summary":" Cold-start rating prediction is a fundamental problem in recommender systems\nthat has been extensively studied. Many methods have been proposed that exploit\nexplicit relations among existing data, such as collaborative filtering, social\nrecommendations and heterogeneous information network, to alleviate the data\ninsufficiency issue for cold-start users and items. However, the explicit\nrelations constructed based on data between different roles may be unreliable\nand irrelevant, which limits the performance ceiling of the specific\nrecommendation task. Motivated by this, in this paper, we propose a flexible\nframework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not\nsolely rely on the pre-defined interaction pattern or the manually constructed\nheterogeneous information network. Instead, we devise a Heterogeneous\nInteraction Module (HIM) to jointly model the heterogeneous interactions and\ndirectly infer the important interactions via the observed data. In the\nexperiments, we evaluate our model under three cold-start settings on three\nreal-world datasets. The experimental results show that HIRE outperforms other\nbaselines by a large margin. Furthermore, we visualize the inferred\ninteractions of HIRE to confirm the contribution of our model.\n","authors":["Shuheng Fang","Kangfei Zhao","Yu Rong","Zhixun Li","Jeffrey Xu Yu"],"pdf_url":"https://arxiv.org/pdf/2403.17740v3.pdf","comment":"14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.05339v1","updated":"2024-12-06T04:30:00Z","published":"2024-12-06T04:30:00Z","title":"PyTerrier-GenRank: The PyTerrier Plugin for Reranking with Large\n Language Models","summary":" Using LLMs as rerankers requires experimenting with various hyperparameters,\nsuch as prompt formats, model choice, and reformulation strategies. We\nintroduce PyTerrier-GenRank, a PyTerrier plugin to facilitate seamless\nreranking experiments with LLMs, supporting popular ranking strategies like\npointwise and listwise prompting. We validate our plugin through HuggingFace\nand OpenAI hosted endpoints.\n","authors":["Kaustubh D. Dhole"],"pdf_url":"https://arxiv.org/pdf/2412.05339v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04746v1","updated":"2024-12-06T03:18:18Z","published":"2024-12-06T03:18:18Z","title":"Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval\n with Semantic Guidance","summary":" Modern music retrieval systems often rely on fixed representations of user\npreferences, limiting their ability to capture users' diverse and uncertain\nretrieval needs. To address this limitation, we introduce Diff4Steer, a novel\ngenerative retrieval framework that employs lightweight diffusion models to\nsynthesize diverse seed embeddings from user queries that represent potential\ndirections for music exploration. Unlike deterministic methods that map user\nquery to a single point in embedding space, Diff4Steer provides a statistical\nprior on the target modality (audio) for retrieval, effectively capturing the\nuncertainty and multi-faceted nature of user preferences. Furthermore,\nDiff4Steer can be steered by image or text inputs, enabling more flexible and\ncontrollable music discovery combined with nearest neighbor search. Our\nframework outperforms deterministic regression methods and LLM-based generative\nretrieval baseline in terms of retrieval and ranking metrics, demonstrating its\neffectiveness in capturing user preferences, leading to more diverse and\nrelevant recommendations. Listening examples are available at\ntinyurl.com/diff4steer.\n","authors":["Xuchan Bao","Judith Yue Li","Zhong Yi Wan","Kun Su","Timo Denk","Joonseok Lee","Dima Kuzmin","Fei Sha"],"pdf_url":"https://arxiv.org/pdf/2412.04746v1.pdf","comment":"NeurIPS 2024 Creative AI Track"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.05436v1","updated":"2024-12-06T21:44:33Z","published":"2024-12-06T21:44:33Z","title":"pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data\n Estimation and Multi-modal Processing","summary":" pyAMPACT (Python-based Automatic Music Performance Analysis and Comparison\nToolkit) links symbolic and audio music representations to facilitate\nscore-informed estimation of performance data in audio as well as general\nlinking of symbolic and audio music representations with a variety of\nannotations. pyAMPACT can read a range of symbolic formats and can output\nnote-linked audio descriptors/performance data into MEI-formatted files. The\naudio analysis uses score alignment to calculate time-frequency regions of\nimportance for each note in the symbolic representation from which to estimate\na range of parameters. These include tuning-, dynamics-, and timbre-related\nperformance descriptors, with timing-related information available from the\nscore alignment. Beyond performance data estimation, pyAMPACT also facilitates\nmulti-modal investigations through its infrastructure for linking symbolic\nrepresentations and annotations to audio.\n","authors":["Johanna Devaney","Daniel McKemie","Alex Morgan"],"pdf_url":"https://arxiv.org/pdf/2412.05436v1.pdf","comment":"International Society for Music Information Retrieval, Late Breaking\n Demo"},{"id":"http://arxiv.org/abs/2412.05035v1","updated":"2024-12-06T13:39:36Z","published":"2024-12-06T13:39:36Z","title":"SMIC: Semantic Multi-Item Compression based on CLIP dictionary","summary":" Semantic compression, a compression scheme where the distortion metric,\ntypically MSE, is replaced with semantic fidelity metrics, tends to become more\nand more popular. Most recent semantic compression schemes rely on the\nfoundation model CLIP. In this work, we extend such a scheme to image\ncollection compression, where inter-item redundancy is taken into account\nduring the coding phase. For that purpose, we first show that CLIP's latent\nspace allows for easy semantic additions and subtractions. From this property,\nwe define a dictionary-based multi-item codec that outperforms state-of-the-art\ngenerative codec in terms of compression rate, around $10^{-5}$ BPP per image,\nwhile not sacrificing semantic fidelity. We also show that the learned\ndictionary is of a semantic nature and works as a semantic projector for the\nsemantic content of images.\n","authors":["Tom Bachard","Thomas Maugey"],"pdf_url":"https://arxiv.org/pdf/2412.05035v1.pdf","comment":"12 pages, 14 figures, 3 tables, journal paper, preprint"},{"id":"http://arxiv.org/abs/2411.19772v2","updated":"2024-12-06T07:24:10Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v2.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.04746v1","updated":"2024-12-06T03:18:18Z","published":"2024-12-06T03:18:18Z","title":"Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval\n with Semantic Guidance","summary":" Modern music retrieval systems often rely on fixed representations of user\npreferences, limiting their ability to capture users' diverse and uncertain\nretrieval needs. To address this limitation, we introduce Diff4Steer, a novel\ngenerative retrieval framework that employs lightweight diffusion models to\nsynthesize diverse seed embeddings from user queries that represent potential\ndirections for music exploration. Unlike deterministic methods that map user\nquery to a single point in embedding space, Diff4Steer provides a statistical\nprior on the target modality (audio) for retrieval, effectively capturing the\nuncertainty and multi-faceted nature of user preferences. Furthermore,\nDiff4Steer can be steered by image or text inputs, enabling more flexible and\ncontrollable music discovery combined with nearest neighbor search. Our\nframework outperforms deterministic regression methods and LLM-based generative\nretrieval baseline in terms of retrieval and ranking metrics, demonstrating its\neffectiveness in capturing user preferences, leading to more diverse and\nrelevant recommendations. Listening examples are available at\ntinyurl.com/diff4steer.\n","authors":["Xuchan Bao","Judith Yue Li","Zhong Yi Wan","Kun Su","Timo Denk","Joonseok Lee","Dima Kuzmin","Fei Sha"],"pdf_url":"https://arxiv.org/pdf/2412.04746v1.pdf","comment":"NeurIPS 2024 Creative AI Track"},{"id":"http://arxiv.org/abs/2411.12825v2","updated":"2024-12-06T01:32:53Z","published":"2024-11-19T19:22:24Z","title":"TopoCode: Topologically Informed Error Detection and Correction in\n Communication Systems","summary":" Traditional error detection and correction codes focus on bit-level fidelity,\nwhich is insufficient for emerging technologies like eXtended Reality (XR) and\nholographic communications requiring high-data-rate, low-latency systems.\nBit-level metrics cannot comprehensively evaluate Quality-of-Service (QoS) in\nthese scenarios. This letter proposes TopoCode which leverages Topological Data\nAnalysis (TDA) and persistent homology to encode topological information for\nmessage-level error detection and correction. It introduces minimal redundancy\nwhile enabling effective data reconstruction, especially in low Signal-to-Noise\nRatio (SNR) conditions. TopoCode offers a promising approach to meet the\ndemands of next-generation communication systems prioritizing semantic accuracy\nand message-level integrity.\n","authors":["Hongzhi Guo"],"pdf_url":"https://arxiv.org/pdf/2411.12825v2.pdf","comment":null}]}}
\ No newline at end of file
diff --git a/favicon.ico b/favicon.ico
new file mode 100644
index 00000000..7f5166c7
Binary files /dev/null and b/favicon.ico differ
diff --git a/index.css b/index.css
new file mode 100644
index 00000000..9ded9d94
--- /dev/null
+++ b/index.css
@@ -0,0 +1,355 @@
+:root {
+ /* Palette: Nord (https://www.nordtheme.com)*/
+ --nord00: #2e3440;
+ --nord01: #3b4252;
+ --nord02: #434c5e;
+ --nord03: #4c566a;
+ --nord04: #d8dee9;
+ --nord05: #e5e9f0;
+ --nord06: #eceff4;
+ --nord07: #8fbcbb;
+ --nord08: #88c0d0;
+ --nord09: #81a1c1;
+ --nord0A: #5e81ac;
+ --nord0B: #bf616a;
+ --nord0C: #d08770;
+ --nord0D: #ebcb8b;
+ --nord0E: #a3be8c;
+ --nord0F: #b48ead;
+
+
+ /* Typograph */
+ --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue",
+ sans-serif;
+ --font-size-scaler: 62.5%;
+ --font-size-m: 1.6rem;
+ --font-size-s: 1.4rem;
+
+ /* Components */
+ --body-color: var(--nord06);
+ --body-bg: var(--nord00);
+
+ --header-title: var(--nord06);
+ --header-container: var(--nord00);
+ --header-title-preffix: var(--nord0F);
+
+ --chip-font: var(--nord08);
+ --chip-color: var(--nord0B);
+
+ --icons: var(--nord06);
+ --icons-hover: var(--nord0F);
+
+ --day-container: var(--nord01);
+ --date: var(--nord09);
+
+ --summary: var(--nord0E);
+ --summary-hover: var(--nord0F);
+
+ --details-open: var(--nord02);
+ --details-content: var(--nord05);
+ --details-a: var(--nord07);
+ --details-a-hover: var(--nord0F);
+
+ --highlight-title: var(--nord0B);
+ --highlight-author: var(--nord0B);
+
+ --article-summary-hover-color: var(--nord0D);
+ --article-summary-color: var(--nord04);
+
+ --article-title-color: var(--nord05);
+ --article-title-hover-color: var(--nord0E);
+
+ --accordion-content-rail-color: var(--nord01);
+ --accordion-content-hover-rail-color: var(--nord0D);
+ --accordion-title-marker-color: var(--nord01);
+ --accordion-title-hover-marker-color: var(--nord0E);
+
+ --footer-color: var(--nord04);
+ --footer-link-hover-color: var(--nord0D);
+}
+
+[data-theme="light"] {
+ /* Theme design */
+
+ --color-primary: var(--nord07);
+ --color-primary-second: var(--nord00);
+ --color-info: var(--nord0A);
+ --color-success: var(--nord0E);
+ --color-warning: var(--nord0C);
+ --color-danger: var(--nord0B);
+
+ --color-text: var(--nord00);
+ --color-hover: var(--nord0D);
+ --color-shadow: var(--nord03);
+
+ --color-primary-h: var(--nord09);
+ --color-primary-s: var(--nord08);
+ --color-primary-l: var(--nord07);
+
+ --color-contrast-higher-h: var(--nord01);
+ --color-contrast-higher-l: var(--nord02);
+ --color-contrast-higher-s: var(--nord03);
+
+ --color-content: white;
+
+ --background: var(--nord06);
+ --background-content: var(--nord05);
+ --background-color: var(--nord04);
+
+ /* Components */
+
+ --chip-font: var(--nord06);
+ --chip-color: var(--nord09);
+
+ --body-color: var(--background-color);
+ --body-bg: var(--background);
+
+ --header-title: var(--color-shadow);
+ --header-container: var(--background);
+ --header-title-preffix: var(--color-primary-h);
+
+ --icons: var(--color-shadow);
+ --icons-hover: var(--color-hover);
+
+ --day-container: var(--background-content);
+ --date: var(--color-primary-l);
+
+ --summary: var(--color-info);
+ --summary-hover: var(--color-success);
+
+ --details-open: var(--color-content);
+ --details-content: var(--color-text);
+ --details-a: var(--color-primary-h);
+ --details-a-hover: var(--color-hover);
+
+ --highlight-title: var(--color-danger);
+ --highlight-author: var(--color-warning);
+
+ --article-summary-color: var(--color-text);
+ --article-summary-hover-color: var(--color-primary-s);
+
+ --article-title-color: var(--color-primary);
+ --article-title-hover-color: var(--color-success);
+
+ --accordion-content-rail-color: var(--color-warning);
+ --accordion-content-hover-rail-color: var(--color-warning);
+ --accordion-title-marker-color: var(--color-success);
+ --accordion-title-hover-marker-color: var(--color-success);
+
+ --footer-color: var(--color-text);
+ --footer-link-hover-color: var(--color-hover);
+}
+
+html {
+ font-size: var(--font-size-scaler);
+}
+
+body {
+ background-color: var(--body-bg);
+ font-family: var(--font-family-default);
+ color: var(--body-color);
+ margin: 0;
+ padding-top: 16px;
+ display: grid;
+}
+
+.header-container {
+ width: 90%;
+ max-width: 1200px;
+ background: var(--header-container);
+ margin: 0 auto;
+}
+
+.header-title {
+ font-size: 32px;
+ font-weight: bold;
+ color: var(--header-title);
+ margin: 0;
+ padding-bottom: 14px;
+}
+
+.header-title-preffix {
+ color: var(--header-title-preffix);
+}
+
+.icons {
+ color: var(--icons);
+ padding-bottom: 16px;
+}
+
+.icons a {
+ color: var(--icons);
+ text-decoration: none;
+}
+
+.icons a:hover {
+ color: var(--icons-hover);
+}
+
+.day-container {
+ padding: 16px 16px 16px 16px;
+ background: var(--day-container);
+ width: 90%;
+ max-width: 1200px;
+ margin: 0 auto;
+ margin-bottom: 8px;
+ border-radius: 10px;
+}
+
+.date {
+ font-size: 24px;
+ font-weight: 700;
+ margin: 0;
+ color: var(--date);
+}
+
+p {
+ margin: 0;
+}
+
+summary {
+ font-weight: 600;
+ color: var(--summary);
+}
+
+summary:hover {
+ text-decoration: underline;
+ cursor: pointer;
+ color: var(--summary-hover);
+}
+
+details {
+ --border-color: transparent;
+
+ padding: 2px 4px;
+ font-size: 20px;
+ border: 1px solid var(--border-color);
+ border-radius: 4px;
+}
+
+details[open] {
+ background-color: var(--details-open);
+ margin-bottom: 8px;
+}
+
+.details-content {
+ padding: 12px 3px;
+ gap: 16px;
+ color: var(--details-content);
+}
+
+details a {
+ color: var(--details-a);
+}
+
+details a:hover {
+ color: var(--details-a-hover);
+}
+
+footer {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ justify-content: space-between;
+}
+
+.description {
+ margin: 0 auto;
+ color: var(--footer-color);
+ font-size: var(--font-size-s);
+ display: flex;
+ padding: 0 16px;
+ text-align: center;
+}
+
+.highlight-author {
+ color: var(--highlight-author);
+ font-weight: bold;
+}
+
+.highlight-title {
+ color: var(--highlight-title);
+ font-weight: bold;
+}
+
+.channel-description {
+ text-align: center;
+ font-size: var(--font-size-scaler);
+}
+
+.article-summary-link {
+ color: var(--article-summary-color);
+ font-size: var(--font-size-s);
+ text-decoration: none;
+}
+
+.article-summary-link:hover {
+ color: var(--article-summary-hover-color);
+ --accordion-content-rail-color: var(--accordion-content-hover-rail-color);
+}
+
+.article-summary-box-outer {
+ display: block;
+ padding: 4px 8px 8px 4px;
+}
+
+.article-summary-box-inner {
+ padding-left: 8px;
+ border-left: 1px solid var(--accordion-content-rail-color);
+ font-size: var(--font-size-m);
+}
+
+.article-expander {
+ padding: 10px 4px;
+ border-radius: 4px;
+}
+
+.article-authors {
+ font-size: var(--font-size-m);
+ padding: 0.25em 1em;
+}
+
+.article-authors a {
+ text-decoration: none;
+}
+
+.article-expander-title {
+ font-size: var(--font-size-m);
+ font-weight: 600;
+}
+
+.article-expander-title:hover {
+ cursor: pointer;
+}
+
+.article-expander-title::marker {
+ color: var(--accordion-title-marker-color);
+}
+
+.article-expander-title:hover::marker {
+ color: var(--accordion-title-hover-marker-color);
+}
+
+/* for switcher */
+.theme-switch {
+ display: inline-block;
+ position: relative;
+}
+
+.theme-switch input {
+ display: none;
+}
+
+/* chip */
+.chip {
+ font-size: 90%;
+ align-items: center;
+ color: var(--chip-font);
+ background: var(--chip-color);
+ border-radius: 5rem;
+ display: inline-flex;
+ padding: .2rem .4rem;
+ vertical-align: middle;
+}
\ No newline at end of file
diff --git a/index.html b/index.html
new file mode 100644
index 00000000..7c00d584
--- /dev/null
+++ b/index.html
@@ -0,0 +1,23986 @@
+
+
+
+
+
+
+
+
+
+
+ Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag
+
+
+ We introduce a novel approach to enhance the capabilities of text-to-image
+models by incorporating a graph-based RAG. Our system dynamically retrieves
+detailed character information and relational data from the knowledge graph,
+enabling the generation of visually accurate and contextually rich images. This
+capability significantly improves upon the limitations of existing T2I models,
+which often struggle with the accurate depiction of complex or culturally
+specific subjects due to dataset constraints. Furthermore, we propose a novel
+self-correcting mechanism for text-to-image models to ensure consistency and
+fidelity in visual outputs, leveraging the rich context from the graph to guide
+corrections. Our qualitative and quantitative experiments demonstrate that
+Context Canvas significantly enhances the capabilities of popular models such
+as Flux, Stable Diffusion, and DALL-E, and improves the functionality of
+ControlNet for fine-grained image editing tasks. To our knowledge, Context
+Canvas represents the first application of graph-based RAG in enhancing T2I
+models, representing a significant advancement for producing high-fidelity,
+context-aware multi-faceted images.
+
+
+
+
+
+
+
+ ☆ Olympus: A Universal Task Router for Computer Vision Tasks
+
+
+
+
+
+
+
+
+ Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr
+
+
+ We introduce Olympus, a new approach that transforms Multimodal Large
+Language Models (MLLMs) into a unified framework capable of handling a wide
+array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates
+over 20 specialized tasks across images, videos, and 3D objects to dedicated
+modules. This instruction-based routing enables complex workflows through
+chained actions without the need for training heavy generative models. Olympus
+easily integrates with existing MLLMs, expanding their capabilities with
+comparable performance. Experimental results demonstrate that Olympus achieves
+an average routing accuracy of 94.75% across 20 tasks and precision of 91.82%
+in chained action scenarios, showcasing its effectiveness as a universal task
+router that can solve a diverse range of computer vision tasks. Project page:
+https://github.com/yuanze-lin/Olympus_page
+
+
+
+ comment: Technical Report
+
+
+
+
+
+
+ ☆ AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web
+ Tutorials
+
+
+ Graphical User Interface (GUI) agents hold great potential for automating
+complex tasks across diverse digital environments, from web applications to
+desktop software. However, the development of such agents is hindered by the
+lack of high-quality, multi-step trajectory data required for effective
+training. Existing approaches rely on expensive and labor-intensive human
+annotation, making them unsustainable at scale. To address this challenge, we
+propose AgentTrek, a scalable data synthesis pipeline that generates
+high-quality GUI agent trajectories by leveraging web tutorials. Our method
+automatically gathers tutorial-like texts from the internet, transforms them
+into task goals with step-by-step instructions, and employs a visual-language
+model agent to simulate their execution in a real digital environment. A
+VLM-based evaluator ensures the correctness of the generated trajectories. We
+demonstrate that training GUI agents with these synthesized trajectories
+significantly improves their grounding and planning performance over the
+current models. Moreover, our approach is more cost-efficient compared to
+traditional human annotation methods. This work underscores the potential of
+guided replay with web tutorials as a viable strategy for large-scale GUI agent
+training, paving the way for more capable and autonomous digital agents.
+
+
+
+ comment: https://agenttrek.github.io
+
+
+
+
+
+
+ ☆ TimeRefine: Temporal Grounding with Time Refining Video LLM
+
+
+ Video temporal grounding aims to localize relevant temporal boundaries in a
+video given a textual prompt. Recent work has focused on enabling Video LLMs to
+perform video temporal grounding via next-token prediction of temporal
+timestamps. However, accurately localizing timestamps in videos remains
+challenging for Video LLMs when relying solely on temporal token prediction.
+Our proposed TimeRefine addresses this challenge in two ways. First, instead of
+directly predicting the start and end timestamps, we reformulate the temporal
+grounding task as a temporal refining task: the model first makes rough
+predictions and then refines them by predicting offsets to the target segment.
+This refining process is repeated multiple times, through which the model
+progressively self-improves its temporal localization accuracy. Second, to
+enhance the model's temporal perception capabilities, we incorporate an
+auxiliary prediction head that penalizes the model more if a predicted segment
+deviates further from the ground truth, thus encouraging the model to make
+closer and more accurate predictions. Our plug-and-play method can be
+integrated into most LLM-based temporal grounding approaches. The experimental
+results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on
+the ActivityNet and Charades-STA datasets, respectively. Code and pretrained
+models will be released.
+
+
+
+
+
+
+
+ ☆ InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
+ Long-term Streaming Video and Audio Interactions
+
+
+
+
+
+
+
+
+ Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang
+
+
+ Creating AI systems that can interact with environments over long periods,
+similar to human cognition, has been a longstanding research goal. Recent
+advancements in multimodal large language models (MLLMs) have made significant
+strides in open-world understanding. However, the challenge of continuous and
+simultaneous streaming perception, memory, and reasoning remains largely
+unexplored. Current MLLMs are constrained by their sequence-to-sequence
+architecture, which limits their ability to process inputs and generate
+responses simultaneously, akin to being unable to think while perceiving.
+Furthermore, relying on long contexts to store historical data is impractical
+for long-term interactions, as retaining all information becomes costly and
+inefficient. Therefore, rather than relying on a single foundation model to
+perform all functions, this project draws inspiration from the concept of the
+Specialized Generalist AI and introduces disentangled streaming perception,
+reasoning, and memory mechanisms, enabling real-time interaction with streaming
+video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive
+(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:
+Processes multimodal information in real-time, storing key details in memory
+and triggering reasoning in response to user queries. (2) Multi-modal Long
+Memory Module: Integrates short-term and long-term memory, compressing
+short-term memories into long-term ones for efficient retrieval and improved
+accuracy. (3) Reasoning Module: Responds to queries and executes reasoning
+tasks, coordinating with the perception and memory modules. This project
+simulates human-like cognition, enabling multimodal large language models to
+provide continuous and adaptive service over time.
+
+
+ We present OpenNER 1.0, a standardized collection of openly available named
+entity recognition (NER) datasets. OpenNER contains 34 datasets spanning 51
+languages, annotated in varying named entity ontologies. We correct annotation
+format issues, standardize the original datasets into a uniform representation,
+map entity type names to be more consistent across corpora, and provide the
+collection in a structure that enables research in multilingual and
+multi-ontology NER. We provide baseline models using three pretrained
+multilingual language models to compare the performance of recent models and
+facilitate future research in NER.
+
+
+
+
+
+
+
+ ☆ DISHONEST: Dissecting misInformation Spread using Homogeneous sOcial
+ NEtworks and Semantic Topic classification
+
+
+ The emergence of the COVID-19 pandemic resulted in a significant rise in the
+spread of misinformation on online platforms such as Twitter. Oftentimes this
+growth is blamed on the idea of the "echo chamber." However, the behavior said
+to characterize these echo chambers exists in two dimensions. The first is in a
+user's social interactions, where they are said to stick with the same clique
+of like-minded users. The second is in the content of their posts, where they
+are said to repeatedly espouse homogeneous ideas. In this study, we link the
+two by using Twitter's network of retweets to study social interactions and
+topic modeling to study tweet content. In order to measure the diversity of a
+user's interactions over time, we develop a novel metric to track the speed at
+which they travel through the social network. The application of these analysis
+methods to misinformation-focused data from the pandemic demonstrates
+correlation between social behavior and tweet content. We believe this
+correlation supports the common intuition about how antisocial users behave,
+and further suggests that it holds even in subcommunities already rife with
+misinformation.
+
+
+
+
+
+
+
+ ☆ DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through
+ Diverse Perspectives and Multi-Agent Interaction
+
+
+
+
+
+
+
+
+ Yu Feng, Phu Mon Htut, Zheng Qi, Wei Xiao, Manuel Mager, Nikolaos Pappas, Kishaloy Halder, Yang Li, Yassine Benajiba, Dan Roth
+
+
+ Quantifying the uncertainty in the factual parametric knowledge of Large
+Language Models (LLMs), especially in a black-box setting, poses a significant
+challenge. Existing methods, which gauge a model's uncertainty through
+evaluating self-consistency in responses to the original query, do not always
+capture true uncertainty. Models might respond consistently to the origin query
+with a wrong answer, yet respond correctly to varied questions from different
+perspectives about the same query, and vice versa. In this paper, we propose a
+novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using
+multi-agent interaction under the assumption that if a model is certain, it
+should consistently recall the answer to the original query across a diverse
+collection of questions about the same original query. We further implement an
+abstention policy to withhold responses when uncertainty is high. Our method
+offers a more accurate prediction of the model's reliability and further
+detects hallucinations, outperforming other self-consistency-based methods.
+Additionally, it demonstrates that existing models often fail to consistently
+retrieve the correct answer to the same query under diverse varied questions
+even when knowing the correct answer.
+
+
+
+
+
+
+
+ ☆ JuStRank: Benchmarking LLM Judges for System Ranking
+
+
+ Given the rapid progress of generative AI, there is a pressing need to
+systematically compare and choose between the numerous models and
+configurations available. The scale and versatility of such evaluations make
+the use of LLM-based judges a compelling solution for this challenge.
+Crucially, this approach requires first to validate the quality of the LLM
+judge itself. Previous work has focused on instance-based assessment of LLM
+judges, where a judge is evaluated over a set of responses, or response pairs,
+while being agnostic to their source systems. We argue that this setting
+overlooks critical factors affecting system-level ranking, such as a judge's
+positive or negative bias towards certain systems. To address this gap, we
+conduct the first large-scale study of LLM judges as system rankers. System
+scores are generated by aggregating judgment scores over multiple system
+outputs, and the judge's quality is assessed by comparing the resulting system
+ranking to a human-based ranking. Beyond overall judge assessment, our analysis
+provides a fine-grained characterization of judge behavior, including their
+decisiveness and bias.
+
+
+
+
+
+
+
+ ★ Does Representation Matter? Exploring Intermediate Layers in Large
+ Language Models
+
+
+ Understanding what defines a good representation in large language models
+(LLMs) is fundamental to both theoretical understanding and practical
+applications. In this paper, we investigate the quality of intermediate
+representations in various LLM architectures, including Transformers and State
+Space Models (SSMs). We find that intermediate layers often yield more
+informative representations for downstream tasks than the final layers. To
+measure the representation quality, we adapt and apply a suite of metrics -
+such as prompt entropy, curvature, and augmentation-invariance - originally
+proposed in other contexts. Our empirical study reveals significant
+architectural differences, how representations evolve throughout training, and
+how factors like input randomness and prompt length affect each layer. Notably,
+we observe a bimodal pattern in the entropy of some intermediate layers and
+consider potential explanations tied to training data. Overall, our results
+illuminate the internal mechanics of LLMs and guide strategies for
+architectural optimization and training.
+
+
+
+ comment: Accepted to 2024 NeurIPs Workshop on Machine Learning and Compression
+
+
+
+
+
+
+ ☆ Foundational Large Language Models for Materials Research
+
+
+
+
+
+
+
+
+ Vaibhav Mishra, Somaditya Singh, Dhruv Ahlawat, Mohd Zaki, Vaibhav Bihani, Hargun Singh Grover, Biswajit Mishra, Santiago Miret, Mausam, N. M. Anoop Krishnan
+
+
+ Materials discovery and development are critical for addressing global
+challenges. Yet, the exponential growth in materials science literature
+comprising vast amounts of textual data has created significant bottlenecks in
+knowledge extraction, synthesis, and scientific reasoning. Large Language
+Models (LLMs) offer unprecedented opportunities to accelerate materials
+research through automated analysis and prediction. Still, their effective
+deployment requires domain-specific adaptation for understanding and solving
+domain-relevant tasks. Here, we present LLaMat, a family of foundational models
+for materials science developed through continued pretraining of LLaMA models
+on an extensive corpus of materials literature and crystallographic data.
+Through systematic evaluation, we demonstrate that LLaMat excels in
+materials-specific NLP and structured information extraction while maintaining
+general linguistic capabilities. The specialized LLaMat-CIF variant
+demonstrates unprecedented capabilities in crystal structure generation,
+predicting stable crystals with high coverage across the periodic table.
+Intriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2,
+we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific
+performance across diverse materials science tasks, including structured
+information extraction from text and tables, more particularly in crystal
+structure generation, a potential adaptation rigidity in overtrained LLMs.
+Altogether, the present work demonstrates the effectiveness of domain
+adaptation towards developing practically deployable LLM copilots for materials
+research. Beyond materials science, our findings reveal important
+considerations for domain adaptation of LLMs, such as model selection, training
+methodology, and domain-specific performance, which may influence the
+development of specialized scientific AI systems.
+
+
+ With the rapid development of artificial intelligence technology, the
+application of deepfake technology in the audio field has gradually increased,
+resulting in a wide range of security risks. Especially in the financial and
+social security fields, the misuse of deepfake audios has raised serious
+concerns. To address this challenge, this study proposes an audio deepfake
+detection method based on multi-frequency channel attention mechanism (MFCA)
+and 2D discrete cosine transform (DCT). By processing the audio signal into a
+melspectrogram, using MobileNet V2 to extract deep features, and combining it
+with the MFCA module to weight different frequency channels in the audio
+signal, this method can effectively capture the fine-grained frequency domain
+features in the audio signal and enhance the Classification capability of fake
+audios. Experimental results show that compared with traditional methods, the
+model proposed in this study shows significant advantages in accuracy,
+precision,recall, F1 score and other indicators. Especially in complex audio
+scenarios, this method shows stronger robustness and generalization
+capabilities and provides a new idea for audio deepfake detection and has
+important practical application value. In the future, more advanced audio
+detection technologies and optimization strategies will be explored to further
+improve the accuracy and generalization capabilities of audio deepfake
+detection.
+
+
+
+
+
+
+
+ ☆ The Impact of Copyrighted Material on Large Language Models: A Norwegian
+ Perspective
+
+
+
+
+
+
+
+
+ Javier de la Rosa, Vladislav Mikhailov, Lemei Zhang, Freddy Wetjen, David Samuel, Peng Liu, Rolv-Arild Braaten, Petter Mæhlum, Magnus Breder Birkenes, Andrey Kutuzov, Tita Enstad, Svein Arne Brygfjeld, Jon Atle Gulla, Stephan Oepen, Erik Velldal, Wilfred Østgulen, Liljia Øvrelid, Aslak Sira Myhre
+
+
+ The use of copyrighted materials in training generative language models
+raises critical legal and ethical questions. This paper presents a framework
+for and the results of empirically assessing the impact of copyrighted
+materials on the performance of large language models (LLMs) for Norwegian. We
+found that both books and newspapers contribute positively when the models are
+evaluated on a diverse set of Norwegian benchmarks, while fiction works
+possibly lead to decreased performance. Our experiments could inform the
+creation of a compensation scheme for authors whose works contribute to AI
+development.
+
+
+
+ comment: pre-print, under review
+
+
+
+
+
+
+ ☆ From Intention To Implementation: Automating Biomedical Research via
+ LLMs
+
+
+ Conventional biomedical research is increasingly labor-intensive due to the
+exponential growth of scientific literature and datasets. Artificial
+intelligence (AI), particularly Large Language Models (LLMs), has the potential
+to revolutionize this process by automating various steps. Still, significant
+challenges remain, including the need for multidisciplinary expertise,
+logicality of experimental design, and performance measurements. This paper
+introduces BioResearcher, the first end-to-end automated system designed to
+streamline the entire biomedical research process involving dry lab
+experiments. BioResearcher employs a modular multi-agent architecture,
+integrating specialized agents for search, literature processing, experimental
+design, and programming. By decomposing complex tasks into logically related
+sub-tasks and utilizing a hierarchical learning approach, BioResearcher
+effectively addresses the challenges of multidisciplinary requirements and
+logical complexity. Furthermore, BioResearcher incorporates an LLM-based
+reviewer for in-process quality control and introduces novel evaluation metrics
+to assess the quality and automation of experimental protocols. BioResearcher
+successfully achieves an average execution success rate of 63.07% across eight
+previously unmet research objectives. The generated protocols averagely
+outperform typical agent systems by 22.0% on five quality metrics. The system
+demonstrates significant potential to reduce researchers' workloads and
+accelerate biomedical discoveries, paving the way for future innovations in
+automated research systems.
+
+
+
+
+
+
+
+ ☆ Unifying AI Tutor Evaluation: An Evaluation Taxonomy for Pedagogical
+ Ability Assessment of LLM-Powered AI Tutors
+
+
+ In this paper, we investigate whether current state-of-the-art large language
+models (LLMs) are effective as AI tutors and whether they demonstrate
+pedagogical abilities necessary for good AI tutoring in educational dialogues.
+Previous efforts towards evaluation have been limited to subjective protocols
+and benchmarks. To bridge this gap, we propose a unified evaluation taxonomy
+with eight pedagogical dimensions based on key learning sciences principles,
+which is designed to assess the pedagogical value of LLM-powered AI tutor
+responses grounded in student mistakes or confusion in the mathematical domain.
+We release MRBench -- a new evaluation benchmark containing 192 conversations
+and 1,596 responses from seven state-of-the-art LLM-based and human tutors,
+providing gold annotations for eight pedagogical dimensions. We assess
+reliability of the popular Prometheus2 LLM as an evaluator and analyze each
+tutor's pedagogical abilities, highlighting which LLMs are good tutors and
+which ones are more suitable as question-answering systems. We believe that the
+presented taxonomy, benchmark, and human-annotated labels will streamline the
+evaluation process and help track the progress in AI tutors' development.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ☆ Text Generation Models for Luxembourgish with Limited Data: A Balanced
+ Multilingual Strategy
+
+
+ This paper addresses the challenges in developing language models for
+less-represented languages, with a focus on Luxembourgish. Despite its active
+development, Luxembourgish faces a digital data scarcity, exacerbated by
+Luxembourg's multilingual context. We propose a novel text generation model
+based on the T5 architecture, combining limited Luxembourgish data with equal
+amounts, in terms of size and type, of German and French data. We hypothesise
+that a model trained on Luxembourgish, German, and French will improve the
+model's cross-lingual transfer learning capabilities and outperform monolingual
+and large multilingual models. To verify this, the study at hand explores
+whether multilingual or monolingual training is more beneficial for
+Luxembourgish language generation. For the evaluation, we introduce LuxGen, a
+text generation benchmark that is the first of its kind for Luxembourgish.
+
+
+
+ comment: Accepted at VarDial 2025
+
+
+
+
+
+
+ ☆ Imitate, Explore, and Self-Improve: A Reproduction Report on
+ Slow-thinking Reasoning Systems
+
+
+ Recently, slow-thinking reasoning systems, such as o1, have demonstrated
+remarkable capabilities in solving complex reasoning tasks. These systems
+typically engage in an extended thinking process before responding to a query,
+allowing them to generate more thorough, accurate, and well-reasoned solutions.
+These systems are primarily developed and maintained by industry, with their
+core techniques not publicly disclosed. In response, an increasing number of
+studies from the research community aim to explore the technical foundations
+underlying these powerful reasoning systems. Building on these prior efforts,
+this paper presents a reproduction report on implementing o1-like reasoning
+systems. We introduce an "imitate, explore, and self-improve" framework as our
+primary technical approach to train the reasoning model. In the initial phase,
+we use distilled long-form thought data to fine-tune the reasoning model,
+enabling it to invoke a slow-thinking mode. The model is then encouraged to
+explore challenging problems by generating multiple rollouts, which can result
+in increasingly more high-quality trajectories that lead to correct answers.
+Furthermore, the model undergoes self-improvement by iteratively refining its
+training dataset. To verify the effectiveness of this approach, we conduct
+extensive experiments on three challenging benchmarks. The experimental results
+demonstrate that our approach achieves competitive performance compared to
+industry-level reasoning systems on these benchmarks.
+
+
+
+ comment: Technical Report on Slow Thinking with LLMs: Part II
+
+
+
+
+
+
+ ☆ Neural Text Normalization for Luxembourgish using Real-Life Variation
+ Data
+
+
+
+
+
+
+
+
+ Anne-Marie Lutgen, Alistair Plum, Christoph Purschke, Barbara Plank
+
+
+ Orthographic variation is very common in Luxembourgish texts due to the
+absence of a fully-fledged standard variety. Additionally, developing NLP tools
+for Luxembourgish is a difficult task given the lack of annotated and parallel
+data, which is exacerbated by ongoing standardization. In this paper, we
+propose the first sequence-to-sequence normalization models using the ByT5 and
+mT5 architectures with training data obtained from word-level real-life
+variation data. We perform a fine-grained, linguistically-motivated evaluation
+to test byte-based, word-based and pipeline-based models for their strengths
+and weaknesses in text normalization. We show that our sequence model using
+real-life variation data is an effective approach for tailor-made normalization
+in Luxembourgish.
+
+
+
+ comment: Accepted at VarDial 2025
+
+
+
+
+
+
+ ☆ From Bench to Bedside: A Review of Clinical Trialsin Drug Discovery and
+ Development
+
+
+
+
+
+
+
+
+ Tianyang Wang, Ming Liu, Benji Peng, Xinyuan Song, Charles Zhang, Xintian Sun, Qian Niu, Junyu Liu, Silin Chen, Keyu Chen, Ming Li, Pohsun Feng, Ziqian Bi, Yunze Wang, Yichao Zhang, Cheng Fei, Lawrence KQ Yan
+
+
+ Clinical trials are an indispensable part of the drug development process,
+bridging the gap between basic research and clinical application. During the
+development of new drugs, clinical trials are used not only to evaluate the
+safety and efficacy of the drug but also to explore its dosage, treatment
+regimens, and potential side effects. This review discusses the various stages
+of clinical trials, including Phase I (safety assessment), Phase II
+(preliminary efficacy evaluation), Phase III (large-scale validation), and
+Phase IV (post-marketing surveillance), highlighting the characteristics of
+each phase and their interrelationships. Additionally, the paper addresses the
+major challenges encountered in clinical trials, such as ethical issues,
+subject recruitment difficulties, diversity and representativeness concerns,
+and proposes strategies for overcoming these challenges. With the advancement
+of technology, innovative technologies such as artificial intelligence, big
+data, and digitalization are gradually transforming clinical trial design and
+implementation, improving trial efficiency and data quality. The article also
+looks forward to the future of clinical trials, particularly the impact of
+emerging therapies such as gene therapy and immunotherapy on trial design, as
+well as the importance of regulatory reforms and global collaboration. In
+conclusion, the core role of clinical trials in drug development will continue
+to drive the progress of innovative drug development and clinical treatment.
+
+
+
+ comment: 11 pages
+
+
+
+
+
+
+ ☆ Word Sense Linking: Disambiguating Outside the Sandbox
+
+
+
+
+
+
+
+
+ Andrei Stefan Bejgu, Edoardo Barba, Luigi Procopio, Alberte Fernández-Castro, Roberto Navigli
+
+
+ Word Sense Disambiguation (WSD) is the task of associating a word in a given
+context with its most suitable meaning among a set of possible candidates.
+While the task has recently witnessed renewed interest, with systems achieving
+performances above the estimated inter-annotator agreement, at the time of
+writing it still struggles to find downstream applications. We argue that one
+of the reasons behind this is the difficulty of applying WSD to plain text.
+Indeed, in the standard formulation, models work under the assumptions that a)
+all the spans to disambiguate have already been identified, and b) all the
+possible candidate senses of each span are provided, both of which are
+requirements that are far from trivial. In this work, we present a new task
+called Word Sense Linking (WSL) where, given an input text and a reference
+sense inventory, systems have to both identify which spans to disambiguate and
+then link them to their most suitable meaning.We put forward a
+transformer-based architecture for the task and thoroughly evaluate both its
+performance and those of state-of-the-art WSD systems scaled to WSL,
+iteratively relaxing the assumptions of WSD. We hope that our work will foster
+easier integration of lexical semantics into downstream applications.
+
+
+
+
+
+
+
+ ☆ Falcon-UI: Understanding GUI Before Following User Instructions
+
+
+ Pursuing human-like interaction for Graphical User Interface (GUI) agents
+requires understanding the GUI context and following user instructions.
+However, existing works typically couple these two aspects and focus more on
+instruct-following abilities, while ignoring the importance of understanding
+the GUI context. In this paper, we introduce an instruction-free GUI navigation
+dataset, termed Insight-UI Dataset, to enhance model comprehension of GUI
+environments. Insight-UI Dataset is automatically generated from the Common
+Crawl corpus, simulating various platforms -- including iOS, Android, Windows,
+and Linux -- across multiple resolutions on 312K domains. Although GUI
+interactions vary by context, diverse interfaces share common internal
+patterns, such as clicking an item to view its details. It implies the
+feasibility of independent GUI operation learning, followed by joint
+optimization with instruction tuning. Thereby, we develop the GUI agent model
+Falcon-UI, which is initially pretrained on Insight-UI Dataset and subsequently
+fine-tuned on Android and Web GUI datasets, including AITW, AITZ, Android
+Control, and Mind2Web. With 7 billion parameters, Falcon-UI achieves accuracy
+comparable to the 72 billion-parameter Qwen2VL on AITZ, validating the
+alignment between GUI context comprehension and agent performance. Our code and
+dataset will be open-sourced.
+
+
+
+
+
+
+
+
+ Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
+
+
+ Recent work has empirically shown that Vision-Language Models (VLMs) struggle
+to fully understand the compositional properties of the human language, usually
+modeling an image caption as a "bag of words". As a result, they perform poorly
+on compositional tasks, which require a deeper understanding of the different
+entities of a sentence (subject, verb, etc.) jointly with their mutual
+relationships in order to be solved. In this paper, we model the dependency
+relations among textual and visual tokens using a Causal Graphical Model (CGM),
+built using a dependency parser, and we train a decoder conditioned by the VLM
+visual encoder. Differently from standard autoregressive or parallel
+predictions, our decoder's generative process is partially-ordered following
+the CGM structure. This structure encourages the decoder to learn only the main
+causal dependencies in a sentence discarding spurious correlations. Using
+extensive experiments on five compositional benchmarks, we show that our method
+significantly outperforms all the state-of-the-art compositional approaches by
+a large margin, and it also improves over methods trained using much larger
+datasets.
+
+
+
+
+
+
+
+ ☆ Training LayoutLM from Scratch for Efficient Named-Entity Recognition in
+ the Insurance Domain
+
+
+ Generic pre-trained neural networks may struggle to produce good results in
+specialized domains like finance and insurance. This is due to a domain
+mismatch between training data and downstream tasks, as in-domain data are
+often scarce due to privacy constraints. In this work, we compare different
+pre-training strategies for LayoutLM. We show that using domain-relevant
+documents improves results on a named-entity recognition (NER) problem using a
+novel dataset of anonymized insurance-related financial documents called
+Payslips. Moreover, we show that we can achieve competitive results using a
+smaller and faster model.
+
+
+
+ comment: Coling 2025 workshop (FinNLP)
+
+
+
+
+
+
+ ☆ Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction
+
+
+ LLMs can generate human-like dialogues, yet their ability to simulate early
+child-adult interactions remains largely unexplored. In this paper, we examined
+how effectively LLMs can capture the distinctive features of child-caregiver
+language in interaction, using both static and interactive benchmarking
+methods. We found that state-of-the-art LLMs like Llama 3 and GPT-4o can
+approximate child-caregiver dialogues at the word and utterance level, but they
+struggle to reproduce the child and caregiver's discursive patterns, exaggerate
+alignment, and fail to reach the level of diversity shown by humans. The
+broader goal of this work is to initiate the development of a comprehensive
+benchmark for LLMs in child-oriented applications.
+
+
+
+
+
+
+
+ ☆ CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of
+ LLMs
+
+
+
+
+
+
+
+
+ Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
+
+
+ Powerful large language models (LLMs) are increasingly expected to be
+deployed with lower computational costs, enabling their capabilities on
+resource-constrained devices. Post-training quantization (PTQ) has emerged as a
+star approach to achieve this ambition, with best methods compressing weights
+to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector
+Quantization (CRVQ), a novel technique that significantly improves the
+performance of PTQ baselines at the cost of only minimal additional bits. This
+state-of-the-art extreme compression method achieves its results through two
+key innovations: (1) carefully selecting and reordering a very small subset of
+critical weight channels, and (2) leveraging multiple codebooks to relax the
+constraint of critical channels. With our method, we demonstrate a 38.9%
+improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer
+lossless 1-bit compression. Furthermore, our approach offers flexible
+customization of quantization bit-width and performance, providing a wider
+range of deployment options for diverse hardware platforms.
+
+
+
+ comment: 5 figures, 4 tables
+
+
+
+
+
+
+ ☆ Learning to Solve Domain-Specific Calculation Problems with
+ Knowledge-Intensive Programs Generator
+
+
+
+
+
+
+
+
+ Chengyuan Liu, Shihang Wang, Lizhi Qing, Jun Lin, Ji Zhang, Fei Wu, Kun Kuang
+
+
+ Domain Large Language Models (LLMs) are developed for domain-specific tasks
+based on general LLMs. But it still requires professional knowledge to
+facilitate the expertise for some domain-specific tasks. In this paper, we
+investigate into knowledge-intensive calculation problems. We find that the
+math problems to be challenging for LLMs, when involving complex
+domain-specific rules and knowledge documents, rather than simple formulations
+of terminologies. Therefore, we propose a pipeline to solve the domain-specific
+calculation problems with Knowledge-Intensive Programs Generator more
+effectively, named as KIPG. It generates knowledge-intensive programs according
+to the domain-specific documents. For each query, key variables are extracted,
+then outcomes which are dependent on domain knowledge are calculated with the
+programs. By iterative preference alignment, the code generator learns to
+improve the logic consistency with the domain knowledge. Taking legal domain as
+an example, we have conducted experiments to prove the effectiveness of our
+pipeline, and extensive analysis on the modules. We also find that the code
+generator is also adaptable to other domains, without training on the new
+knowledge.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ Towards Understanding the Robustness of LLM-based Evaluations under
+ Perturbations
+
+
+ Traditional evaluation metrics like BLEU and ROUGE fall short when capturing
+the nuanced qualities of generated text, particularly when there is no single
+ground truth. In this paper, we explore the potential of Large Language Models
+(LLMs), specifically Google Gemini 1, to serve as automatic evaluators for
+non-standardized metrics in summarization and dialog-based tasks. We conduct
+experiments across multiple prompting strategies to examine how LLMs fare as
+quality evaluators when compared with human judgments on the SummEval and USR
+datasets, asking the model to generate both a score as well as a justification
+for the score. Furthermore, we explore the robustness of the LLM evaluator by
+using perturbed inputs. Our findings suggest that while LLMs show promise,
+their alignment with human evaluators is limited, they are not robust against
+perturbations and significant improvements are required for their standalone
+use as reliable evaluators for subjective metrics.
+
+
+
+ comment: Accepted at ICON 2024
+
+
+
+
+
+
+ ☆ First Train to Generate, then Generate to Train: UnitedSynT5 for
+ Few-Shot NLI
+
+
+ Natural Language Inference (NLI) tasks require identifying the relationship
+between sentence pairs, typically classified as entailment, contradiction, or
+neutrality. While the current state-of-the-art (SOTA) model, Entailment
+Few-Shot Learning (EFL), achieves a 93.1% accuracy on the Stanford Natural
+Language Inference (SNLI) dataset, further advancements are constrained by the
+dataset's limitations. To address this, we propose a novel approach leveraging
+synthetic data augmentation to enhance dataset diversity and complexity. We
+present UnitedSynT5, an advanced extension of EFL that leverages a T5-based
+generator to synthesize additional premise-hypothesis pairs, which are
+rigorously cleaned and integrated into the training data. These augmented
+examples are processed within the EFL framework, embedding labels directly into
+hypotheses for consistency. We train a GTR-T5-XL model on this expanded
+dataset, achieving a new benchmark of 94.7% accuracy on the SNLI dataset,
+94.01% accuracy on the E-SNLI dataset, and 92.57% accuracy on the MultiNLI
+dataset, surpassing the previous SOTA models. This research demonstrates the
+potential of synthetic data augmentation in improving NLI models, offering a
+path forward for further advancements in natural language understanding tasks.
+
+
+
+ comment: 14 pages
+
+
+
+
+
+
+ ☆ Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by
+ Utilizing Generative LLMs COLING2025
+
+
+ Satire detection is essential for accurately extracting opinions from textual
+data and combating misinformation online. However, the lack of diverse corpora
+for satire leads to the problem of stylistic bias which impacts the models'
+detection performances. This study proposes a debiasing approach for satire
+detection, focusing on reducing biases in training data by utilizing generative
+large language models. The approach is evaluated in both cross-domain (irony
+detection) and cross-lingual (English) settings. Results show that the
+debiasing method enhances the robustness and generalizability of the models for
+satire and irony detection tasks in Turkish and English. However, its impact on
+causal language models, such as Llama-3.1, is limited. Additionally, this work
+curates and presents the Turkish Satirical News Dataset with detailed human
+annotations, with case studies on classification, debiasing, and
+explainability.
+
+
+
+ comment: Accepted to BUCC2025 Workshop @COLING2025
+
+
+
+
+
+
+
+ Dmitry Vikhorev, Daria Galimzianova, Svetlana Gorovaia, Elizaveta Zhemchuzhina, Ivan P. Yamshchikov
+
+
+ Humor generation is a challenging task in natural language processing due to
+limited resources and the quality of existing datasets. Available humor
+language resources often suffer from toxicity and duplication, limiting their
+effectiveness for training robust models. This paper proposes CleanComedy, a
+specialized, partially annotated toxicity-filtered corpus of English and
+Russian jokes collected from various sources. We study the effectiveness of our
+data filtering approach through a survey on humor and toxicity levels in
+various joke groups. In addition, we study advances in computer humor
+generation by comparing jokes written by humans with various groups of
+generative jokes, including our baseline models trained on the CleanComedy
+datasets.
+
+
+
+
+
+
+
+ ☆ ReFF: Reinforcing Format Faithfulness in Language Models across Varied
+ Tasks AAAI 2025
+
+
+ Following formatting instructions to generate well-structured content is a
+fundamental yet often unmet capability for large language models (LLMs). To
+study this capability, which we refer to as format faithfulness, we present
+FormatBench, a comprehensive format-related benchmark. Compared to previous
+format-related benchmarks, FormatBench involves a greater variety of tasks in
+terms of application scenes (traditional NLP tasks, creative works, autonomous
+agency tasks), human-LLM interaction styles (single-turn instruction,
+multi-turn chat), and format types (inclusion, wrapping, length, coding).
+Moreover, each task in FormatBench is attached with a format checker program.
+Extensive experiments on the benchmark reveal that state-of-the-art open- and
+closed-source LLMs still suffer from severe deficiency in format faithfulness.
+By virtue of the decidable nature of formats, we propose to Reinforce Format
+Faithfulness (ReFF) to help LLMs generate formatted output as instructed
+without compromising general quality. Without any annotated data, ReFF can
+substantially improve the format faithfulness rate (e.g., from 21.6% in
+original LLaMA3 to 95.0% on caption segmentation task), while keep the general
+quality comparable (e.g., from 47.3 to 46.4 in F1 scores). Combined with
+labeled training data, ReFF can simultaneously improve both format faithfulness
+(e.g., from 21.6% in original LLaMA3 to 75.5%) and general quality (e.g., from
+47.3 to 61.6 in F1 scores). We further offer an interpretability analysis to
+explain how ReFF improves both format faithfulness and general quality.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ☆ When Text Embedding Meets Large Language Model: A Comprehensive Survey
+
+
+ Text embedding has become a foundational technology in natural language
+processing (NLP) during the deep learning era, driving advancements across a
+wide array of downstream tasks. While many natural language understanding
+challenges can now be modeled using generative paradigms and leverage the
+robust generative and comprehension capabilities of large language models
+(LLMs), numerous practical applications, such as semantic matching, clustering,
+and information retrieval, continue to rely on text embeddings for their
+efficiency and effectiveness. In this survey, we categorize the interplay
+between LLMs and text embeddings into three overarching themes: (1)
+LLM-augmented text embedding, enhancing traditional embedding methods with
+LLMs; (2) LLMs as text embedders, utilizing their innate capabilities for
+embedding generation; and (3) Text embedding understanding with LLMs,
+leveraging LLMs to analyze and interpret embeddings. By organizing these
+efforts based on interaction patterns rather than specific downstream
+applications, we offer a novel and systematic overview of contributions from
+various research and application domains in the era of LLMs. Furthermore, we
+highlight the unresolved challenges that persisted in the pre-LLM era with
+pre-trained language models (PLMs) and explore the emerging obstacles brought
+forth by LLMs. Building on this analysis, we outline prospective directions for
+the evolution of text embedding, addressing both theoretical and practical
+opportunities in the rapidly advancing landscape of NLP.
+
+
+ This paper presents PolyIPA, a novel multilingual phoneme-to-grapheme
+conversion model designed for multilingual name transliteration, onomastic
+research, and information retrieval. The model leverages two helper models
+developed for data augmentation: IPA2vec for finding soundalikes across
+languages, and similarIPA for handling phonetic notation variations. Evaluated
+on a test set that spans multiple languages and writing systems, the model
+achieves a mean Character Error Rate of 0.055 and a character-level BLEU score
+of 0.914, with particularly strong performance on languages with shallow
+orthographies. The implementation of beam search further improves practical
+utility, with top-3 candidates reducing the effective error rate by 52.7\% (to
+CER: 0.026), demonstrating the model's effectiveness for cross-linguistic
+applications.
+
+
+
+
+
+
+
+ ☆ Filter-then-Generate: Large Language Models with Structure-Text Adapter
+ for Knowledge Graph Completion COLING 2025
+
+
+
+
+
+
+
+
+ Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng
+
+
+ Large Language Models (LLMs) present massive inherent knowledge and superior
+semantic comprehension capability, which have revolutionized various tasks in
+natural language processing. Despite their success, a critical gap remains in
+enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence
+suggests that LLMs consistently perform worse than conventional KGC approaches,
+even through sophisticated prompt design or tailored instruction-tuning.
+Fundamentally, applying LLMs on KGC introduces several critical challenges,
+including a vast set of entity candidates, hallucination issue of LLMs, and
+under-exploitation of the graph structure. To address these challenges, we
+propose a novel instruction-tuning-based method, namely FtG. Specifically, we
+present a \textit{filter-then-generate} paradigm and formulate the KGC task
+into a multiple-choice question format. In this way, we can harness the
+capability of LLMs while mitigating the issue casused by hallucinations.
+Moreover, we devise a flexible ego-graph serialization prompt and employ a
+structure-text adapter to couple structure and text information in a
+contextualized manner. Experimental results demonstrate that FtG achieves
+substantial performance gain compared to existing state-of-the-art methods. The
+instruction dataset and code are available at
+\url{https://github.com/LB0828/FtG}.
+
+
+
+ comment: COLING 2025 Main Conference
+
+
+
+
+
+
+ ☆ Evaluating Pixel Language Models on Non-Standardized Languages COLING 2025
+
+
+
+
+
+
+
+
+ Alberto Muñoz-Ortiz, Verena Blaschke, Barbara Plank
+
+
+ We explore the potential of pixel-based models for transfer learning from
+standard languages to dialects. These models convert text into images that are
+divided into patches, enabling a continuous vocabulary representation that
+proves especially useful for out-of-vocabulary words common in dialectal data.
+Using German as a case study, we compare the performance of pixel-based models
+to token-based models across various syntactic and semantic tasks. Our results
+show that pixel-based models outperform token-based models in part-of-speech
+tagging, dependency parsing and intent detection for zero-shot dialect
+evaluation by up to 26 percentage points in some scenarios, though not in
+Standard German. However, pixel-based models fall short in topic
+classification. These findings emphasize the potential of pixel-based models
+for handling dialectal data, though further research should be conducted to
+assess their effectiveness in various linguistic contexts.
+
+
+
+
+
+
+
+
+ Zhenni Bi, Kai Han, Chuanjian Liu, Yehui Tang, Yunhe Wang
+
+
+ Large Language Models (LLMs) have shown remarkable abilities across various
+language tasks, but solving complex reasoning problems remains a challenge.
+While existing methods like Chain-of-Thought (CoT) and Tree-of-Thought (ToT)
+enhance reasoning by decomposing problems or structuring prompts, they
+typically perform a single pass of reasoning and may fail to revisit flawed
+paths, compromising accuracy. To address this, we propose a novel reasoning
+framework called Forest-of-Thought (FoT), which integrates multiple reasoning
+trees to leverage collective decision-making for solving complex logical
+problems. FoT utilizes sparse activation strategies to select the most relevant
+reasoning paths, improving both efficiency and accuracy. Additionally, we
+introduce a dynamic self-correction strategy that enables real-time error
+correction and learning from past mistakes, as well as consensus-guided
+decision making strategies to optimize correctness and computational resources.
+Experimental results demonstrate that the FoT framework, combined with these
+strategies, significantly enhances the reasoning capabilities of LLMs, enabling
+them to solve complex tasks with greater precision and efficiency.
+
+
+ The discovery of customer intention from dialogue plays an important role in
+automated support system. However, traditional text clustering methods are
+poorly aligned with human perceptions due to the shift from embedding distance
+to semantic distance, and existing quantitative metrics for text clustering may
+not accurately reflect the true quality of intent clusters. In this paper, we
+leverage the superior language understanding capabilities of Large Language
+Models (LLMs) for designing better-calibrated intent clustering algorithms. We
+first establish the foundation by verifying the robustness of fine-tuned LLM
+utility in semantic coherence evaluation and cluster naming, resulting in an
+accuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled
+ground truth. Then, we propose an iterative clustering algorithm that
+facilitates cluster-level refinement and the continuous discovery of
+high-quality intent clusters. Furthermore, we present several LLM-in-the-loop
+semi-supervised clustering techniques tailored for intent discovery from
+customer service dialogue. Experiments on a large-scale industrial dataset
+comprising 1,507 intent clusters demonstrate the effectiveness of the proposed
+techniques. The methods outperformed existing counterparts, achieving 6.25%
+improvement in quantitative metrics and 12% enhancement in application-level
+performance when constructing an intent classifier.
+
+
+
+
+
+
+
+ ☆ Multi-Task Learning with LLMs for Implicit Sentiment Analysis:
+ Data-level and Task-level Automatic Weight Learning
+
+
+ Implicit sentiment analysis (ISA) presents significant challenges due to the
+absence of salient cue words. Previous methods have struggled with insufficient
+data and limited reasoning capabilities to infer underlying opinions.
+Integrating multi-task learning (MTL) with large language models (LLMs) offers
+the potential to enable models of varying sizes to reliably perceive and
+recognize genuine opinions in ISA. However, existing MTL approaches are
+constrained by two sources of uncertainty: data-level uncertainty, arising from
+hallucination problems in LLM-generated contextual information, and task-level
+uncertainty, stemming from the varying capacities of models to process
+contextual information. To handle these uncertainties, we introduce MT-ISA, a
+novel MTL framework that enhances ISA by leveraging the generation and
+reasoning capabilities of LLMs through automatic MTL. Specifically, MT-ISA
+constructs auxiliary tasks using generative LLMs to supplement sentiment
+elements and incorporates automatic MTL to fully exploit auxiliary data. We
+introduce data-level and task-level automatic weight learning (AWL), which
+dynamically identifies relationships and prioritizes more reliable data and
+critical tasks, enabling models of varying sizes to adaptively learn
+fine-grained weights based on their reasoning capabilities. We investigate
+three strategies for data-level AWL, while also introducing homoscedastic
+uncertainty for task-level AWL. Extensive experiments reveal that models of
+varying sizes achieve an optimal balance between primary prediction and
+auxiliary tasks in MT-ISA. This underscores the effectiveness and adaptability
+of our approach.
+
+
+
+ comment: 11 pages, 6 figures, and 6 tables
+
+
+
+
+
+
+ ☆ Mining Word Boundaries from Speech-Text Parallel Data for Cross-domain
+ Chinese Word Segmentation COLING 2025
+
+
+
+
+
+
+
+
+ Xuebin Wang, Lei Zhang, Zhenghua Li, Shilin Zhou, Chen Gong, Yang Hou
+
+
+ Inspired by early research on exploring naturally annotated data for Chinese
+Word Segmentation (CWS), and also by recent research on integration of speech
+and text processing, this work for the first time proposes to explicitly mine
+word boundaries from speech-text parallel data. We employ the Montreal Forced
+Aligner (MFA) toolkit to perform character-level alignment on speech-text data,
+giving pauses as candidate word boundaries. Based on detailed analysis of
+collected pauses, we propose an effective probability-based strategy for
+filtering unreliable word boundaries. To more effectively utilize word
+boundaries as extra training data, we also propose a robust complete-then-train
+(CTT) strategy. We conduct cross-domain CWS experiments on two target domains,
+i.e., ZX and AISHELL2. We have annotated about 1,000 sentences as the
+evaluation data of AISHELL2. Experiments demonstrate the effectiveness of our
+proposed approach.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ☆ ZigZagkv: Dynamic KV Cache Compression for Long-context Modeling based
+ on Layer Uncertainty
+
+
+
+
+
+
+
+
+ Meizhi Zhong, Xikai Liu, Chen Zhang, Yikun Lei, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
+
+
+ Large Language models (LLMs) have become a research hotspot. To accelerate
+the inference of LLMs, storing computed caches in memory has become the
+standard technique. However, as the inference length increases, growing KV
+caches might lead to out-of-memory issues. Many existing methods address this
+issue through KV cache compression, primarily by preserving key tokens
+throughout all layers to reduce information loss. Most of them allocate a
+uniform budget size for each layer to retain. However, we observe that the
+minimum budget sizes needed to retain essential information vary across layers
+and models based on the perspectives of attention and hidden state output.
+Building on this observation, this paper proposes a simple yet effective KV
+cache compression method that leverages layer uncertainty to allocate budget
+size for each layer. Experimental results show that the proposed method can
+reduce memory usage of the KV caches to only $\sim$20\% when compared to Full
+KV inference while achieving nearly lossless performance.
+
+
+
+
+
+
+
+ ☆ Dialogue Language Model with Large-Scale Persona Data Engineering
+
+
+ Maintaining persona consistency is paramount in the application of
+open-domain dialogue systems, as exemplified by models like ChatGPT. Despite
+significant advancements, the limited scale and diversity of current persona
+dialogue datasets remain challenges to achieving robust persona-consistent
+dialogue models. In this study, drawing inspiration from the success of
+large-scale pre-training, we introduce PPDS, an open-domain persona dialogue
+system that employs extensive generative pre-training on a persona dialogue
+dataset to enhance persona consistency. Specifically, we present a persona
+extraction model designed to autonomously and precisely generate vast persona
+dialogue datasets. Additionally, we unveil a pioneering persona augmentation
+technique to address the invalid persona bias inherent in the constructed
+dataset. Both quantitative and human evaluations consistently highlight the
+superior response quality and persona consistency of our proposed model,
+underscoring its effectiveness.
+
+
+
+
+
+
+
+ ☆ Shiksha: A Technical Domain focused Translation Dataset and Model for
+ Indian Languages
+
+
+ Neural Machine Translation (NMT) models are typically trained on datasets
+with limited exposure to Scientific, Technical and Educational domains.
+Translation models thus, in general, struggle with tasks that involve
+scientific understanding or technical jargon. Their performance is found to be
+even worse for low-resource Indian languages. Finding a translation dataset
+that tends to these domains in particular, poses a difficult challenge. In this
+paper, we address this by creating a multilingual parallel corpus containing
+more than 2.8 million rows of English-to-Indic and Indic-to-Indic high-quality
+translation pairs across 8 Indian languages. We achieve this by bitext mining
+human-translated transcriptions of NPTEL video lectures. We also finetune and
+evaluate NMT models using this corpus and surpass all other publicly available
+models at in-domain tasks. We also demonstrate the potential for generalizing
+to out-of-domain translation tasks by improving the baseline by over 2 BLEU on
+average for these Indian languages on the Flores+ benchmark. We are pleased to
+release our model and dataset via this link: https://huggingface.co/SPRINGLab.
+
+
+
+
+
+
+
+ ☆ Improvement in Sign Language Translation Using Text CTC Alignment
+
+
+ Current sign language translation (SLT) approaches often rely on gloss-based
+supervision with Connectionist Temporal Classification (CTC), limiting their
+ability to handle non-monotonic alignments between sign language video and
+spoken text. In this work, we propose a novel method combining joint
+CTC/Attention and transfer learning. The joint CTC/Attention introduces
+hierarchical encoding and integrates CTC with the attention mechanism during
+decoding, effectively managing both monotonic and non-monotonic alignments.
+Meanwhile, transfer learning helps bridge the modality gap between vision and
+language in SLT. Experimental results on two widely adopted benchmarks,
+RWTH-PHOENIX-Weather 2014 T and CSL-Daily, show that our method achieves
+results comparable to state-of-the-art and outperforms the pure-attention
+baseline. Additionally, this work opens a new door for future research into
+gloss-free SLT using text-based CTC alignment.
+
+
+
+
+
+
+
+ ☆ What Makes Cryptic Crosswords Challenging for LLMs? COLING 2025
+
+
+ Cryptic crosswords are puzzles that rely on general knowledge and the
+solver's ability to manipulate language on different levels, dealing with
+various types of wordplay. Previous research suggests that solving such puzzles
+is challenging even for modern NLP models, including Large Language Models
+(LLMs). However, there is little to no research on the reasons for their poor
+performance on this task. In this paper, we establish the benchmark results for
+three popular LLMs: Gemma2, LLaMA3 and ChatGPT, showing that their performance
+on this task is still significantly below that of humans. We also investigate
+why these models struggle to achieve superior performance. We release our code
+and introduced datasets at
+https://github.com/bodasadallah/decrypting-crosswords.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ☆ Assessing the Robustness of Retrieval-Augmented Generation Systems in
+ K-12 Educational Question Answering with Knowledge Discrepancies
+
+
+ Retrieval-Augmented Generation (RAG) systems have demonstrated remarkable
+potential as question answering systems in the K-12 Education domain, where
+knowledge is typically queried within the restricted scope of authoritative
+textbooks. However, the discrepancy between textbooks and the parametric
+knowledge in Large Language Models (LLMs) could undermine the effectiveness of
+RAG systems. To systematically investigate the robustness of RAG systems under
+such knowledge discrepancies, we present EduKDQA, a question answering dataset
+that simulates knowledge discrepancies in real applications by applying
+hypothetical knowledge updates in answers and source documents. EduKDQA
+includes 3,005 questions covering five subjects, under a comprehensive question
+typology from the perspective of context utilization and knowledge integration.
+We conducted extensive experiments on retrieval and question answering
+performance. We find that most RAG systems suffer from a substantial
+performance drop in question answering with knowledge discrepancies, while
+questions that require integration of contextual knowledge and parametric
+knowledge pose a challenge to LLMs.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World
+ Scenarios
+
+
+
+
+
+
+
+
+ Ruiwen Zhou, Wenyue Hua, Liangming Pan, Sitao Cheng, Xiaobao Wu, En Yu, William Yang Wang
+
+
+ This paper introduces RuleArena, a novel and challenging benchmark designed
+to evaluate the ability of large language models (LLMs) to follow complex,
+real-world rules in reasoning. Covering three practical domains -- airline
+baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs'
+proficiency in handling intricate natural language instructions that demand
+long-context understanding, logical reasoning, and accurate mathematical
+computation. Two key attributes distinguish RuleArena from traditional
+rule-based reasoning benchmarks: (1) it extends beyond standard first-order
+logic representations, and (2) it is grounded in authentic, practical
+scenarios, providing insights into the suitability and reliability of LLMs for
+real-world applications. Our findings reveal several notable limitations in
+LLMs: (1) they struggle to identify and apply the appropriate rules, frequently
+becoming confused by similar but distinct regulations, (2) they cannot
+consistently perform accurate mathematical computations, even when they
+correctly identify the relevant rules, and (3) in general, they perform poorly
+in the benchmark. These results highlight significant challenges in advancing
+LLMs' rule-guided reasoning capabilities in real-life applications.
+
+
+
+ comment: Data and Codes are available at
+ https://github.com/skyriver-2000/RuleArena
+
+
+
+
+
+
+ ☆ Reasoning-Aware Query-Focused Summarization over Multi-Table Data
+
+
+ Query-focused summarization over multi-table data is a challenging yet
+critical task for extracting precise and relevant information from structured
+data. Existing methods often rely on complex preprocessing steps and struggle
+to generalize across domains or handle the logical reasoning required for
+multi-table queries. In this paper, we propose QueryTableSummarizer++, an
+end-to-end generative framework leveraging large language models (LLMs)
+enhanced with table-aware pre-training, query-aligned fine-tuning, and
+reinforcement learning with feedback. Our method eliminates the need for
+intermediate serialization steps and directly generates query-relevant
+summaries. Experiments on a benchmark dataset demonstrate that
+QueryTableSummarizer++ significantly outperforms state-of-the-art baselines in
+terms of BLEU, ROUGE, and F1-score. Additional analyses highlight its
+scalability, generalization across domains, and robust handling of complex
+queries. Human evaluation further validates the superior quality and practical
+applicability of the generated summaries, establishing QueryTableSummarizer++
+as a highly effective solution for multi-table summarization tasks.
+
+
+ Cross-lingual in-context learning (XICL) has emerged as a transformative
+paradigm for leveraging large language models (LLMs) to tackle multilingual
+tasks, especially for low-resource languages. However, existing approaches
+often rely on external retrievers or task-specific fine-tuning, limiting their
+scalability and generalizability. In this paper, we propose a novel
+self-supervised framework that harnesses the generative capabilities of LLMs to
+internally select and utilize task-relevant examples. Our method introduces two
+key objectives: a retrieval-generation alignment loss to optimize the quality
+of selected examples and a semantic coherence loss to ensure cross-lingual
+consistency. Through extensive experiments on multilingual benchmarks, our
+approach achieves state-of-the-art performance, significantly outperforming
+existing baselines. Further analysis highlights its robustness across diverse
+language families and its ability to generalize to unseen tasks. Human
+evaluations confirm the superior fluency, relevance, and semantic correctness
+of outputs generated by our method. This work provides a scalable, effective,
+and generalizable solution for cross-lingual in-context learning.
+
+
+
+
+
+
+
+ ☆ Mojito: Motion Trajectory and Intensity Control for Video Generation
+
+
+ Recent advancements in diffusion models have shown great promise in producing
+high-quality video content. However, efficiently training diffusion models
+capable of integrating directional guidance and controllable motion intensity
+remains a challenging and under-explored area. This paper introduces Mojito, a
+diffusion model that incorporates both \textbf{Mo}tion tra\textbf{j}ectory and
+\textbf{i}ntensi\textbf{t}y contr\textbf{o}l for text to video generation.
+Specifically, Mojito features a Directional Motion Control module that
+leverages cross-attention to efficiently direct the generated object's motion
+without additional training, alongside a Motion Intensity Modulator that uses
+optical flow maps generated from videos to guide varying levels of motion
+intensity. Extensive experiments demonstrate Mojito's effectiveness in
+achieving precise trajectory and intensity control with high computational
+efficiency, generating motion patterns that closely match specified directions
+and intensities, providing realistic dynamics that align well with natural
+motion in real-world scenarios.
+
+
+ Recently, LoRA has emerged as a crucial technique for fine-tuning large
+pre-trained models, yet its performance in multi-task learning scenarios often
+falls short. In contrast, the MoE architecture presents a natural solution to
+this issue. However, it introduces challenges such as mutual interference of
+data across multiple domains and knowledge forgetting of various tasks.
+Additionally, MoE significantly increases the number of parameters, posing a
+computational cost challenge. Therefore, in this paper, we propose MoSLD, a
+mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these
+challenges by sharing the upper projection matrix in LoRA among different
+experts, encouraging the model to learn general knowledge across tasks, while
+still allowing the lower projection matrix to focus on the unique features of
+each task. The application of dropout alleviates the imbalanced update of
+parameter matrix and mitigates parameter overfitting in LoRA. Extensive
+experiments demonstrate that our model exhibits excellent performance in both
+single-task and multi-task scenarios, with robust out-of-domain generalization
+capabilities.
+
+
+
+ comment: Accept by COLING 2025
+
+
+
+
+
+
+ ☆ Multi-Scale Heterogeneous Text-Attributed Graph Datasets From Diverse
+ Domains
+
+
+ Heterogeneous Text-Attributed Graphs (HTAGs), where different types of
+entities are not only associated with texts but also connected by diverse
+relationships, have gained widespread popularity and application across various
+domains. However, current research on text-attributed graph learning
+predominantly focuses on homogeneous graphs, which feature a single node and
+edge type, thus leaving a gap in understanding how methods perform on HTAGs.
+One crucial reason is the lack of comprehensive HTAG datasets that offer
+original textual content and span multiple domains of varying sizes. To this
+end, we introduce a collection of challenging and diverse benchmark datasets
+for realistic and reproducible evaluation of machine learning models on HTAGs.
+Our HTAG datasets are multi-scale, span years in duration, and cover a wide
+range of domains, including movie, community question answering, academic,
+literature, and patent networks. We further conduct benchmark experiments on
+these datasets with various graph neural networks. All source data, dataset
+construction codes, processed HTAGs, data loaders, benchmark codes, and
+evaluation setup are publicly available at GitHub and Hugging Face.
+
+
+
+
+
+
+
+ ☆ From Text to Trajectory: Exploring Complex Constraint Representation and
+ Decomposition in Safe Reinforcement Learning NeurIPS 2024
+
+
+ Safe reinforcement learning (RL) requires the agent to finish a given task
+while obeying specific constraints. Giving constraints in natural language form
+has great potential for practical scenarios due to its flexible transfer
+capability and accessibility. Previous safe RL methods with natural language
+constraints typically need to design cost functions manually for each
+constraint, which requires domain expertise and lacks flexibility. In this
+paper, we harness the dual role of text in this task, using it not only to
+provide constraint but also as a training signal. We introduce the
+Trajectory-level Textual Constraints Translator (TTCT) to replace the manually
+designed cost function. Our empirical results demonstrate that TTCT effectively
+comprehends textual constraint and trajectory, and the policies trained by TTCT
+can achieve a lower violation rate than the standard cost function. Extra
+studies are conducted to demonstrate that the TTCT has zero-shot transfer
+capability to adapt to constraint-shift environments.
+
+
+
+ comment: Accepted by NeurIPS 2024
+
+
+
+
+
+
+ ☆ Phi-4 Technical Report
+
+
+
+
+
+
+
+
+ Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, Yi Zhang
+
+
+ We present phi-4, a 14-billion parameter language model developed with a
+training recipe that is centrally focused on data quality. Unlike most language
+models, where pre-training is based primarily on organic data sources such as
+web content or code, phi-4 strategically incorporates synthetic data throughout
+the training process. While previous models in the Phi family largely distill
+the capabilities of a teacher model (specifically GPT-4), phi-4 substantially
+surpasses its teacher model on STEM-focused QA capabilities, giving evidence
+that our data-generation and post-training techniques go beyond distillation.
+Despite minimal changes to the phi-3 architecture, phi-4 achieves strong
+performance relative to its size -- especially on reasoning-focused benchmarks
+-- due to improved data, training curriculum, and innovations in the
+post-training scheme.
+
+
+
+
+
+
+
+ ☆ AI-assisted Knowledge Discovery in Biomedical Literature to Support
+ Decision-making in Precision Oncology
+
+
+ The delivery of appropriate targeted therapies to cancer patients requires
+the complete analysis of the molecular profiling of tumors and the patient's
+clinical characteristics in the context of existing knowledge and recent
+findings described in biomedical literature and several other sources. We
+evaluated the potential contributions of specific natural language processing
+solutions to support knowledge discovery from biomedical literature. Two models
+from the Bidirectional Encoder Representations from Transformers (BERT) family,
+two Large Language Models, and PubTator 3.0 were tested for their ability to
+support the named entity recognition (NER) and the relation extraction (RE)
+tasks. PubTator 3.0 and the BioBERT model performed best in the NER task (best
+F1-score equal to 0.93 and 0.89, respectively), while BioBERT outperformed all
+other solutions in the RE task (best F1-score 0.79) and a specific use case it
+was applied to by recognizing nearly all entity mentions and most of the
+relations.
+
+
+
+ comment: Accepted at AMIA Annual Symposium 2024
+
+
+
+
+
+
+ ☆ A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning
+ Instructions
+
+
+ Synthesizing high-quality reasoning data for continual training has been
+proven to be effective in enhancing the performance of Large Language Models
+(LLMs). However, previous synthetic approaches struggle to easily scale up data
+and incur high costs in the pursuit of high quality. In this paper, we propose
+the Graph-based Synthetic Data Pipeline (GSDP), an economical and scalable
+framework for high-quality reasoning data synthesis. Inspired by knowledge
+graphs, we extracted knowledge points from seed data and constructed a
+knowledge point relationships graph to explore their interconnections. By
+exploring the implicit relationships among knowledge, our method achieves
+$\times$255 data expansion. Furthermore, GSDP led by open-source models,
+achieves synthesis quality comparable to GPT-4-0613 while maintaining
+$\times$100 lower costs. To tackle the most challenging mathematical reasoning
+task, we present the GSDP-MATH dataset comprising over 1.91 million pairs of
+math problems and answers. After fine-tuning on GSDP-MATH, GSDP-7B based on
+Mistral-7B achieves 37.7% accuracy on MATH and 78.4% on GSM8K, demonstrating
+the effectiveness of our method. The dataset and models trained in this paper
+will be available.
+
+
+
+
+
+
+
+ ☆ Exploring Large Language Models on Cross-Cultural Values in Connection
+ with Training Methodology
+
+
+ Large language models (LLMs) closely interact with humans, and thus need an
+intimate understanding of the cultural values of human society. In this paper,
+we explore how open-source LLMs make judgments on diverse categories of
+cultural values across countries, and its relation to training methodology such
+as model sizes, training corpus, alignment, etc. Our analysis shows that LLMs
+can judge socio-cultural norms similar to humans but less so on social systems
+and progress. In addition, LLMs tend to judge cultural values biased toward
+Western culture, which can be improved with training on the multilingual
+corpus. We also find that increasing model size helps a better understanding of
+social values, but smaller models can be enhanced by using synthetic data. Our
+analysis reveals valuable insights into the design methodology of LLMs in
+connection with their understanding of cultural values.
+
+
+
+
+
+
+
+ ♻ ☆ Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large
+ Language Models Reasoning
+
+
+
+
+
+
+
+
+ Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, Linda Ruth Petzold
+
+
+ Instruction Fine-Tuning (IFT) significantly enhances the zero-shot
+capabilities of pretrained Large Language Models (LLMs). While coding data is
+known to boost LLM reasoning abilities during pretraining, its role in
+activating internal reasoning capacities during IFT remains understudied. This
+paper investigates a key question: How does coding data impact LLMs' reasoning
+capacities during IFT stage? To explore this, we thoroughly examine the impact
+of coding data across different coding data proportions, model families, sizes,
+and reasoning domains, from various perspectives. Specifically, we create three
+IFT datasets with increasing coding data proportions, fine-tune six LLM
+backbones across different families and scales on these datasets, evaluate the
+tuned models' performance across twelve tasks in three reasoning domains, and
+analyze the outcomes from three broad-to-granular perspectives: overall,
+domain-level, and task-specific. Our holistic analysis provides valuable
+insights into each perspective. First, coding data tuning enhances the overall
+reasoning capabilities of LLMs across different model families and scales.
+Moreover, while the impact of coding data varies by domain, it shows consistent
+trends within each domain across different model families and scales.
+Additionally, coding data generally provides comparable task-specific benefits
+across model families, with optimal proportions in IFT datasets being
+task-dependent.
+
+
+
+
+
+
+
+
+ Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer
+
+
+ This paper focuses on creating synthetic data to improve the quality of image
+captions. Existing works typically have two shortcomings. First, they caption
+images from scratch, ignoring existing alt-text metadata, and second, lack
+transparency if the captioners' training data (e.g. GPT) is unknown. In this
+paper, we study a principled approach Altogether based on the key idea to edit
+and re-align existing alt-texts associated with the images. To generate
+training data, we perform human annotation where annotators start with the
+existing alt-text and re-align it to the image content in multiple rounds,
+consequently constructing captions with rich visual concepts. This differs from
+prior work that carries out human annotation as a one-time description task
+solely based on images and annotator knowledge. We train a captioner on this
+data that generalizes the process of re-aligning alt-texts at scale. Our
+results show our Altogether approach leads to richer image captions that also
+improve text-to-image generation and zero-shot image classification tasks.
+
+
+
+ comment: accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine
+
+
+
+
+
+
+ ♻ ☆ LCFO: Long Context and Long Form Output Dataset and Benchmarking
+
+
+
+
+
+
+
+
+ Marta R. Costa-jussà, Pierre Andrews, Mariano Coria Meglioli, Joy Chen, Joe Chuang, David Dale, Christophe Ropers, Alexandre Mourachko, Eduardo Sánchez, Holger Schwenk, Tuan Tran, Arina Turkatenko, Carleigh Wood
+
+
+ This paper presents the Long Context and Form Output (LCFO) benchmark, a
+novel evaluation framework for assessing gradual summarization and summary
+expansion capabilities across diverse domains. LCFO consists of long input
+documents (5k words average length), each of which comes with three summaries
+of different lengths (20%, 10%, and 5% of the input text), as well as
+approximately 15 questions and answers (QA) related to the input content.
+Notably, LCFO also provides alignments between specific QA pairs and
+corresponding summaries in 7 domains. The primary motivation behind providing
+summaries of different lengths is to establish a controllable framework for
+generating long texts from shorter inputs, i.e. summary expansion. To establish
+an evaluation metric framework for summarization and summary expansion, we
+provide human evaluation scores for human-generated outputs, as well as results
+from various state-of-the-art large language models (LLMs). GPT-4o-mini
+achieves best human scores among automatic systems in both summarization and
+summary expansion tasks (~ +10% and +20%, respectively). It even surpasses
+human output quality in the case of short summaries (~ +7%). Overall automatic
+metrics achieve low correlations with human evaluation scores (~ 0.4) but
+moderate correlation on specific evaluation aspects such as fluency and
+attribution (~ 0.6). The LCFO benchmark offers a standardized platform for
+evaluating summarization and summary expansion performance, as well as
+corresponding automatic metrics, thereby providing an important evaluation
+framework to advance generative AI.
+
+
+
+
+
+
+
+ ♻ ☆ Improving the Validity of Automatically Generated Feedback via
+ Reinforcement Learning
+
+
+
+
+
+
+
+
+ Alexander Scarlatos, Digory Smith, Simon Woodhead, Andrew Lan
+
+
+ Automatically generating feedback via large language models (LLMs) in
+intelligent tutoring systems and online learning platforms has the potential to
+improve the learning outcomes of many students. However, both feedback
+generation and evaluation are challenging: feedback content has to be valid
+especially in subjects like math, which requires models to understand the
+problem, the solution, and where the student's error lies. Feedback also has to
+be pedagogically valid to reflect effective tutoring strategies, such as
+explaining possible misconceptions and encouraging the student, among other
+desirable features. In this work, we address both problems of automatically
+generating and evaluating feedback while considering both correctness and
+alignment. First, we propose a rubric for evaluating math feedback and show
+that GPT-4 is able to effectively use it to annotate human-written and
+LLM-generated feedback. Second, we propose a framework for feedback generation
+that optimizes both correctness and alignment using reinforcement learning
+(RL). Specifically, we use GPT-4's annotations to create preferences over
+feedback pairs in an augmented dataset for training via direct preference
+optimization (DPO). We show that our methods significantly increase the
+correctness and alignment of generated feedback with Llama 2, an open-source
+LLM, qualitatively analyze our generation and evaluation systems using case
+studies, and outline several areas for future work.
+
+
+
+ comment: Best student paper award, Published in AIED 2024: The 25th
+ International Conference on Artificial Intelligence in Education
+
+
+
+
+
+
+ ♻ ☆ Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
+
+
+
+
+
+
+
+
+ Adriana Caraeni, Alexander Scarlatos, Andrew Lan
+
+
+ Recent advances in generative artificial intelligence (AI) have shown promise
+in accurately grading open-ended student responses. However, few prior works
+have explored grading handwritten responses due to a lack of data and the
+challenge of combining visual and textual information. In this work, we
+leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to
+automatically grade handwritten responses to college-level math exams. Using
+real student responses to questions in a probability theory exam, we evaluate
+GPT-4o's alignment with ground-truth scores from human graders using various
+prompting techniques. We find that while providing rubrics improves alignment,
+the model's overall accuracy is still too low for real-world settings, showing
+there is significant room for growth in this task.
+
+
+
+ comment: Published in LAK 2025: The 15th International Learning Analytics and
+ Knowledge Conference
+
+
+
+
+
+
+ ♻ ☆ SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with
+ Customisable Fairness Calibration COLING 2025
+
+
+ The development of unbiased large language models is widely recognized as
+crucial, yet existing benchmarks fall short in detecting biases due to limited
+scope, contamination, and lack of a fairness baseline. SAGED(bias) is the first
+holistic benchmarking pipeline to address these problems. The pipeline
+encompasses five core stages: scraping materials, assembling benchmarks,
+generating responses, extracting numeric features, and diagnosing with
+disparity metrics. SAGED includes metrics for max disparity, such as impact
+ratio, and bias concentration, such as Max Z-scores. Noticing that metric tool
+bias and contextual bias in prompts can distort evaluation, SAGED implements
+counterfactual branching and baseline calibration for mitigation. For
+demonstration, we use SAGED on G20 Countries with popular 8b-level models
+including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we
+find that while Mistral and Qwen2 show lower max disparity and higher bias
+concentration than Gemma2 and Llama3.1, all models are notably biased against
+countries like Russia and (except for Qwen2) China. With further experiments to
+have models role-playing U.S. presidents, we see bias amplifies and shifts in
+heterogeneous directions. Moreover, we see Qwen2 and Mistral not engage in
+role-playing, while Llama3.1 and Gemma2 role-play Trump notably more
+intensively than Biden and Harris, indicating role-playing performance bias in
+these models.
+
+
+
+ comment: COLING 2025 Main Conference
+
+
+
+
+
+
+ ♻ ☆ Few-Shot Domain Adaptation for Named-Entity Recognition via Joint
+ Constrained k-Means and Subspace Selection COLING 2025
+
+
+ Named-entity recognition (NER) is a task that typically requires large
+annotated datasets, which limits its applicability across domains with varying
+entity definitions. This paper addresses few-shot NER, aiming to transfer
+knowledge to new domains with minimal supervision. Unlike previous approaches
+that rely solely on limited annotated data, we propose a weakly supervised
+algorithm that combines small labeled datasets with large amounts of unlabeled
+data. Our method extends the k-means algorithm with label supervision, cluster
+size constraints and domain-specific discriminative subspace selection. This
+unified framework achieves state-of-the-art results in few-shot NER on several
+English datasets.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Dspy-based Neural-Symbolic Pipeline to Enhance Spatial Reasoning in LLMs
+
+
+ Large Language Models (LLMs) have demonstrated remarkable capabilities across
+various tasks, yet they often struggle with spatial reasoning. This paper
+presents a novel neural-symbolic framework that enhances LLMs' spatial
+reasoning abilities through iterative feedback between LLMs and Answer Set
+Programming (ASP). We evaluate our approach on two benchmark datasets: StepGame
+and SparQA, implementing three distinct strategies: (1) direct prompting
+baseline, (2) Facts+Rules prompting, and (3) DSPy-based LLM+ASP pipeline with
+iterative refinement. Our experimental results demonstrate that the LLM+ASP
+pipeline significantly outperforms baseline methods, achieving an average 82%
+accuracy on StepGame and 69% on SparQA, marking improvements of 40-50% and
+8-15% respectively over direct prompting. The success stems from three key
+innovations: (1) effective separation of semantic parsing and logical reasoning
+through a modular pipeline, (2) iterative feedback mechanism between LLMs and
+ASP solvers that improves program rate, and (3) robust error handling that
+addresses parsing, grounding, and solving failures. Additionally, we propose
+Facts+Rules as a lightweight alternative that achieves comparable performance
+on complex SparQA dataset, while reducing computational overhead.Our analysis
+across different LLM architectures (Deepseek, Llama3-70B, GPT-4.0 mini)
+demonstrates the framework's generalizability and provides insights into the
+trade-offs between implementation complexity and reasoning capability,
+contributing to the development of more interpretable and reliable AI systems.
+
+
+
+
+
+
+
+ ♻ ☆ EVQAScore: Efficient Video Question Answering Data Evaluation
+
+
+ Video question-answering (QA) is a core task in video understanding.
+Evaluating the quality of video QA and video caption data quality for training
+video large language models (VideoLLMs) is an essential challenge. Although
+various methods have been proposed for assessing video caption quality, there
+remains a lack of dedicated evaluation methods for Video QA. To address this
+gap, we introduce EVQAScore, a reference-free method that leverages keyword
+extraction to assess both video caption and video QA data quality.
+Additionally, we incorporate frame sampling and rescaling techniques to enhance
+the efficiency and robustness of our evaluation, this enables our score to
+evaluate the quality of extremely long videos. Our approach achieves
+state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for
+Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on
+the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using
+EVQAScore for data selection, we achieved SOTA results with only 12.5\% of the
+original data volume, outperforming the previous SOTA method PAC-S and 100\% of
+data.
+
+
+
+
+
+
+
+ ♻ ☆ Detection of Non-recorded Word Senses in English and Swedish
+
+
+
+
+
+
+
+
+ Jonathan Lautenschlager, Emma Sköldberg, Simon Hengchen, Dominik Schlechtweg
+
+
+ This study addresses the task of Unknown Sense Detection in English and
+Swedish. The primary objective of this task is to determine whether the meaning
+of a particular word usage is documented in a dictionary or not. For this
+purpose, sense entries are compared with word usages from modern and historical
+corpora using a pre-trained Word-in-Context embedder that allows us to model
+this task in a few-shot scenario. Additionally, we use human annotations on the
+target corpora to adapt hyperparameters and evaluate our models using 5-fold
+cross-validation. Compared to a random sample from a corpus, our model is able
+to considerably increase the detected number of word usages with non-recorded
+senses.
+
+
+
+ comment: 9 pages
+
+
+
+
+
+
+ ♻ ☆ Importance Weighting Can Help Large Language Models Self-Improve
+
+
+ Large language models (LLMs) have shown remarkable capability in numerous
+tasks and applications. However, fine-tuning LLMs using high-quality datasets
+under external supervision remains prohibitively expensive. In response, LLM
+self-improvement approaches have been vibrantly developed recently. The typical
+paradigm of LLM self-improvement involves training LLM on self-generated data,
+part of which may be detrimental and should be filtered out due to the unstable
+data quality. While current works primarily employs filtering strategies based
+on answer correctness, in this paper, we demonstrate that filtering out correct
+but with high distribution shift extent (DSE) samples could also benefit the
+results of self-improvement. Given that the actual sample distribution is
+usually inaccessible, we propose a new metric called DS weight to approximate
+DSE, inspired by the Importance Weighting methods. Consequently, we integrate
+DS weight with self-consistency to comprehensively filter the self-generated
+samples and fine-tune the language model. Experiments show that with only a
+tiny valid set (up to 5\% size of the training set) to compute DS weight, our
+approach can notably promote the reasoning ability of current LLM
+self-improvement methods. The resulting performance is on par with methods that
+rely on external supervision from pre-trained reward models.
+
+
+
+
+
+
+
+ ♻ ☆ Exploring Language Model Generalization in Low-Resource Extractive QA COLING 2025
+
+
+ In this paper, we investigate Extractive Question Answering (EQA) with Large
+Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains
+that require specific knowledge such as medicine and law in a zero-shot fashion
+without additional in-domain training? To this end, we devise a series of
+experiments to explain the performance gap empirically. Our findings suggest
+that: (a) LLMs struggle with dataset demands of closed domains such as
+retrieving long answer spans; (b) Certain LLMs, despite showing strong overall
+performance, display weaknesses in meeting basic requirements as discriminating
+between domain-specific senses of words which we link to pre-processing
+decisions; (c) Scaling model parameters is not always effective for cross
+domain generalization; and (d) Closed-domain datasets are quantitatively much
+different than open-domain EQA datasets and current LLMs struggle to deal with
+them. Our findings point out important directions for improving existing LLMs.
+
+
+ We study extractive question-answering in the medical domain (Medical-EQA).
+This problem has two main challenges: (i) domain specificity, as most AI models
+lack necessary domain knowledge, and (ii) extraction-based answering style,
+which restricts most autoregressive LLMs due to potential hallucinations. To
+handle those challenges, we propose TOP-Training, a target-oriented
+pre-training paradigm that stands out among all domain adaptation techniques
+with two desirable features: (i) TOP-Training moves one step further than
+popular domain-oriented fine-tuning since it not only moves closer to the
+target domain, but also familiarizes itself with the target dataset, and (ii)
+it does not assume the existence of a large set of unlabeled instances from the
+target domain. Specifically, for a target Medical-EQA dataset, we extract its
+entities and leverage large language models (LLMs) to generate synthetic texts
+containing those entities; we then demonstrate that pretraining on this
+synthetic text data yields better performance on the target Medical-EQA
+benchmarks. Overall, our contributions are threefold: (i) TOP-Training, a new
+pretraining technique to effectively adapt LLMs to better solve a target
+problem, (ii) TOP-Training has a wide application scope because it does not
+require the target problem to have a large set of unlabeled data, and (iii) our
+experiments highlight the limitations of autoregressive LLMs, emphasizing
+TOP-Training as a means to unlock the true potential of bidirectional LLMs.
+
+
+
+
+
+
+
+
+ Hui Ma, Bo Zhang, Bo Xu, Jian Wang, Hongfei Lin, Xiao Sun
+
+
+ Empathetic response generation, aiming to understand the user's situation and
+feelings and respond empathically, is crucial in building human-like dialogue
+systems. Traditional approaches typically employ maximum likelihood estimation
+as the optimization objective during training, yet fail to align the empathy
+levels between generated and target responses. To this end, we propose an
+empathetic response generation framework using reinforcement learning (EmpRL).
+The framework develops an effective empathy reward function and generates
+empathetic responses by maximizing the expected reward through reinforcement
+learning. EmpRL utilizes the pre-trained T5 model as the generator and further
+fine-tunes it to initialize the policy. To align the empathy levels between
+generated and target responses within a given context, an empathy reward
+function containing three empathy communication mechanisms -- emotional
+reaction, interpretation, and exploration -- is constructed using pre-designed
+and pre-trained empathy identifiers. During reinforcement learning training,
+the proximal policy optimization algorithm is used to fine-tune the policy,
+enabling the generation of empathetic responses. Both automatic and human
+evaluations demonstrate that the proposed EmpRL framework significantly
+improves the quality of generated responses, enhances the similarity in empathy
+levels between generated and target responses, and produces empathetic
+responses covering both affective and cognitive aspects.
+
+
+
+
+
+
+
+ ♻ ☆ How Likely Do LLMs with CoT Mimic Human Reasoning? COLING 2025
+
+
+ Chain-of-thought emerges as a promising technique for eliciting reasoning
+capabilities from Large Language Models (LLMs). However, it does not always
+improve task performance or accurately represent reasoning processes, leaving
+unresolved questions about its usage. In this paper, we diagnose the underlying
+mechanism by comparing the reasoning process of LLMs with humans, using causal
+analysis to understand the relationships between the problem instruction,
+reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often
+deviate from the ideal causal chain, resulting in spurious correlations and
+potential consistency errors (inconsistent reasoning and answers). We also
+examine various factors influencing the causal structure, finding that
+in-context learning with examples strengthens it, while post-training
+techniques like supervised fine-tuning and reinforcement learning on human
+feedback weaken it. To our surprise, the causal structure cannot be
+strengthened by enlarging the model size only, urging research on new
+techniques. We hope that this preliminary study will shed light on
+understanding and improving the reasoning process in LLM.
+
+
+
+ comment: COLING 2025 Camera Version (8 pages, 3 figures, 18 tables)
+
+
+
+
+
+
+ ♻ ☆ NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech
+ Detection using Ensembling of BERT-based models
+
+
+ This paper explores hate speech detection in Devanagari-scripted languages,
+focusing on Hindi and Nepali, for Subtask B of the CHIPSAL@COLING 2025 Shared
+Task. Using a range of transformer-based models such as XLM-RoBERTa, MURIL, and
+IndicBERT, we examine their effectiveness in navigating the nuanced boundary
+between hate speech and free expression. Our best performing model, implemented
+as ensemble of multilingual BERT models achieve Recall of 0.7762 (Rank 3/31 in
+terms of recall) and F1 score of 0.6914 (Rank 17/31). To address class
+imbalance, we used backtranslation for data augmentation, and cosine similarity
+to preserve label consistency after augmentation. This work emphasizes the need
+for hate speech detection in Devanagari-scripted languages and presents a
+foundation for further research.
+
+
+
+
+
+
+
+ ♻ ☆ ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity
+ within Large Language Models
+
+
+ Activation sparsity refers to the existence of considerable
+weakly-contributed elements among activation outputs. As a prevalent property
+of the models using the ReLU activation function, activation sparsity has been
+proven a promising paradigm to boost model inference efficiency. Nevertheless,
+most large language models (LLMs) adopt activation functions without intrinsic
+activation sparsity (e.g., GELU and Swish). Some recent efforts have explored
+introducing ReLU or its variants as the substitutive activation function to
+help LLMs achieve activation sparsity and inference acceleration, but few can
+simultaneously obtain high sparsity and comparable model performance. This
+paper introduces a simple and effective sparsification method named "ProSparse"
+to push LLMs for higher activation sparsity while maintaining comparable
+performance. Specifically, after substituting the activation function of LLMs
+with ReLU, ProSparse adopts progressive sparsity regularization with a factor
+smoothly increasing along the multi-stage sine curves. This can enhance
+activation sparsity and mitigate performance degradation by avoiding radical
+shifts in activation distributions. With ProSparse, we obtain high sparsity of
+89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size
+MiniCPM-1B, respectively, achieving comparable performance to their original
+Swish-activated versions. These present the most sparsely activated models
+among open-source LLaMA versions and competitive end-size models, considerably
+surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference
+acceleration experiments further demonstrate the significant practical
+acceleration potential of LLMs with higher activation sparsity, obtaining up to
+4.52$\times$ inference speedup.
+
+
+
+ comment: 19 pages, 4 figures, 9 tables
+
+
+
+
+
+
+ ♻ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission
+ of the Global South
+
+
+ Recent advances in generative AI have sparked renewed interest and expanded
+possibilities for music generation. However, the performance and versatility of
+these systems across musical genres are heavily influenced by the availability
+of training data. We conducted an extensive analysis of over one million hours
+of audio datasets used in AI music generation research and manually reviewed
+more than 200 papers from eleven prominent AI and music conferences and
+organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,
+NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and
+inclusion of the musical genres of the Global South in AI research. Our
+findings reveal a stark imbalance: approximately 86% of the total dataset hours
+and over 93% of researchers focus primarily on music from the Global North.
+However, around 40% of these datasets include some form of non-Western music,
+genres from the Global South account for only 14.6% of the data. Furthermore,
+approximately 51% of the papers surveyed concentrate on symbolic music
+generation, a method that often fails to capture the cultural nuances inherent
+in music from regions such as South Asia, the Middle East, and Africa. As AI
+increasingly shapes the creation and dissemination of music, the significant
+underrepresentation of music genres in datasets and research presents a serious
+threat to global musical diversity. We also propose some important steps to
+mitigate these risks and foster a more inclusive future for AI-driven music
+generation.
+
+
+
+ comment: Submitted to CACM, 12 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ Large language models as oracles for instantiating ontologies with
+ domain-specific knowledge
+
+
+
+
+
+
+
+
+ Giovanni Ciatto, Andrea Agiollo, Matteo Magnini, Andrea Omicini
+
+
+ Background. Endowing intelligent systems with semantic data commonly requires
+designing and instantiating ontologies with domain-specific knowledge.
+Especially in the early phases, those activities are typically performed
+manually by human experts possibly leveraging on their own experience. The
+resulting process is therefore time-consuming, error-prone, and often biased by
+the personal background of the ontology designer. Objective. To mitigate that
+issue, we propose a novel domain-independent approach to automatically
+instantiate ontologies with domain-specific knowledge, by leveraging on large
+language models (LLMs) as oracles. Method. Starting from (i) an initial schema
+composed by inter-related classes and properties and (ii) a set of query
+templates, our method queries the LLM multiple times, and generates instances
+for both classes and properties from its replies. Thus, the ontology is
+automatically filled with domain-specific knowledge, compliant to the initial
+schema. As a result, the ontology is quickly and automatically enriched with
+manifold instances, which experts may consider to keep, adjust, discard, or
+complement according to their own needs and expertise. Contribution. We
+formalise our method in general way and instantiate it over various LLMs, as
+well as on a concrete case study. We report experiments rooted in the
+nutritional domain where an ontology of food meals and their ingredients is
+automatically instantiated from scratch, starting from a categorisation of
+meals and their relationships. There, we analyse the quality of the generated
+ontologies and compare ontologies attained by exploiting different LLMs.
+Experimentally, our approach achieves a quality metric that is up to five times
+higher than the state-of-the-art, while reducing erroneous entities and
+relations by up to ten times. Finally, we provide a SWOT analysis of the
+proposed method.
+
+
+
+
+
+
+
+ ♻ ☆ UniBias: Unveiling and Mitigating LLM Bias through Internal Attention
+ and FFN Manipulation NeurIPS 2024
+
+
+ Large language models (LLMs) have demonstrated impressive capabilities in
+various tasks using the in-context learning (ICL) paradigm. However, their
+effectiveness is often compromised by inherent bias, leading to prompt
+brittleness, i.e., sensitivity to design settings such as example selection,
+order, and prompt formatting. Previous studies have addressed LLM bias through
+external adjustment of model outputs, but the internal mechanisms that lead to
+such bias remain unexplored. Our work delves into these mechanisms,
+particularly investigating how feedforward neural networks (FFNs) and attention
+heads result in the bias of LLMs. By Interpreting the contribution of
+individual FFN vectors and attention heads, we identify the biased LLM
+components that skew LLMs' prediction toward specific labels. To mitigate these
+biases, we introduce UniBias, an inference-only method that effectively
+identifies and eliminates biased FFN vectors and attention heads. Extensive
+experiments across 12 NLP datasets demonstrate that UniBias significantly
+enhances ICL performance and alleviates prompt brittleness of LLMs.
+
+
+
+ comment: Accepted to NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ TCM-FTP: Fine-Tuning Large Language Models for Herbal Prescription
+ Prediction
+
+
+
+
+
+
+
+
+ Xingzhi Zhou, Xin Dong, Chunhao Li, Yuning Bai, Yulong Xu, Ka Chun Cheung, Simon See, Xinpeng Song, Runshun Zhang, Xuezhong Zhou, Nevin L. Zhang
+
+
+ Traditional Chinese medicine (TCM) has relied on specific combinations of
+herbs in prescriptions to treat various symptoms and signs for thousands of
+years. Predicting TCM prescriptions poses a fascinating technical challenge
+with significant practical implications. However, this task faces limitations
+due to the scarcity of high-quality clinical datasets and the complex
+relationship between symptoms and herbs. To address these issues, we introduce
+\textit{DigestDS}, a novel dataset comprising practical medical records from
+experienced experts in digestive system diseases. We also propose a method,
+TCM-FTP (TCM Fine-Tuning Pre-trained), to leverage pre-trained large language
+models (LLMs) via supervised fine-tuning on \textit{DigestDS}. Additionally, we
+enhance computational efficiency using a low-rank adaptation technique.
+Moreover, TCM-FTP incorporates data augmentation by permuting herbs within
+prescriptions, exploiting their order-agnostic nature. Impressively, TCM-FTP
+achieves an F1-score of 0.8031, significantly outperforming previous methods.
+Furthermore, it demonstrates remarkable accuracy in dosage prediction,
+achieving a normalized mean square error of 0.0604. In contrast, LLMs without
+fine-tuning exhibit poor performance. Although LLMs have demonstrated
+wide-ranging capabilities, our work underscores the necessity of fine-tuning
+for TCM prescription prediction and presents an effective way to accomplish
+this.
+
+
+
+ comment: Camera-ready version to be published in BIBM 2024
+
+
+
+
+
+
+ ♻ ☆ TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
+
+
+ It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS
+works typically employ complex data processing pipelines to obtain high-quality
+training data. These sophisticated pipelines require excellent models at each
+stage (e.g., speech denoising, speech enhancement, speaker diarization, and
+punctuation models), which themselves demand high-quality training data and are
+rarely open-sourced. Even with state-of-the-art models, issues persist, such as
+incomplete background noise removal and misalignment between punctuation and
+actual speech pauses. Moreover, the stringent filtering strategies often retain
+only 10-30\% of the original data, significantly impeding data scaling efforts.
+In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to
+design a simplified yet effective TTS data processing pipeline that maintains
+data quality while substantially reducing data acquisition costs, achieving a
+data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS
+systems also incur higher deployment costs compared to conventional approaches.
+Current systems typically use LLMs solely for text-to-token generation, while
+requiring separate models (e.g., flow matching models) for token-to-waveform
+generation, which cannot be directly executed by LLM inference engines, further
+complicating deployment. To address these challenges, we eliminate redundant
+modules in both LLM and flow components, replacing the flow model backbone with
+an LLM architecture. Building upon this simplified flow backbone, we propose a
+unified architecture for both streaming and non-streaming inference,
+significantly reducing deployment costs. Finally, we explore the feasibility of
+unifying TTS and ASR tasks using the same data for training, thanks to the
+simplified pipeline and the S3Tokenizer that reduces the quality requirements
+for TTS training data.
+
+
+
+ comment: Technical Report
+
+
+
+
+
+
+ ♻ ☆ ProSwitch: Knowledge-Guided Instruction Tuning to Switch Between
+ Professional and Non-Professional Responses
+
+
+ Large Language Models (LLMs) have demonstrated efficacy in various linguistic
+applications, including question answering and controlled text generation.
+However, studies into their ability to switch between opposite styles of
+responses in professional domains remain underexplored. This study introduces a
+novel approach, named ProSwitch, which enables a language model to switch
+between professional and non-professional answers, by tuning and evaluating
+through the guidance of domain and style knowledge. ProSwitch unfolds in three
+phases: LLM-augmented preparation to collect domain knowledge and QA pairs,
+instruction tuning to optimize LLMs with multiple levels of knowledge, and
+comprehensive evaluation to assess both style discrimination and
+reference-based quality of the generated text. Comparative analysis of
+ProSwitch against general and specialized LLMs reveals that our approach
+outperforms baselines in switching between professional and non-professional
+responses.
+
+
+
+
+
+
+
+
+ Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V. Le, Qijun Tan, Yuan Liu
+
+
+ Current RLHF frameworks for aligning large language models (LLMs) typically
+assume a fixed prompt distribution, which is sub-optimal and limits the
+scalability of alignment and generalizability of models. To address this, we
+introduce a general open-ended RLHF framework that casts alignment as an
+asymmetric game between two players: (i) a creator that generates increasingly
+informative prompt distributions using reward signals, and (ii) a solver that
+learns to produce more preferred responses on prompts produced by the creator.
+This framework of Evolving Alignment via Asymmetric Self-Play (eva), results in
+a simple and efficient approach that can utilize any existing RLHF algorithm
+for scalable alignment. eva outperforms state-of-the-art methods on widely-used
+benchmarks, without the need of any additional human crafted prompts.
+Specifically, eva improves the win rate of Gemma-2-9B-it on Arena-Hard from
+51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7%
+with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and
+matching claude-3-opus. This improvement is persistent even when new human
+crafted prompts are introduced. Finally, we show eva is effective and robust
+under various ablation settings.
+
+
+ Entity matching (EM) is a critical step in entity resolution (ER). Recently,
+entity matching based on large language models (LLMs) has shown great promise.
+However, current LLM-based entity matching approaches typically follow a binary
+matching paradigm that ignores the global consistency among record
+relationships. In this paper, we investigate various methodologies for
+LLM-based entity matching that incorporate record interactions from different
+perspectives. Specifically, we comprehensively compare three representative
+strategies: matching, comparing, and selecting, and analyze their respective
+advantages and challenges in diverse scenarios. Based on our findings, we
+further design a compound entity matching framework (ComEM) that leverages the
+composition of multiple strategies and LLMs. ComEM benefits from the advantages
+of different sides and achieves improvements in both effectiveness and
+efficiency. Experimental results on 8 ER datasets and 10 LLMs verify the
+superiority of incorporating record interactions through the selecting
+strategy, as well as the further cost-effectiveness brought by ComEM.
+
+
+
+ comment: Accepted at COLING 2025. Our code is available at
+ https://github.com/tshu-w/ComEM
+
+
+
+
+
+
+ ♻ ☆ Understanding the RoPE Extensions of Long-Context LLMs: An Attention
+ Perspective
+
+
+
+
+
+
+
+
+ Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang
+
+
+ Enabling LLMs to handle lengthy context is currently a research hotspot. Most
+LLMs are built upon rotary position embedding (RoPE), a popular position
+encoding method. Therefore, a prominent path is to extrapolate the RoPE trained
+on comparably short texts to far longer texts. A heavy bunch of efforts have
+been dedicated to boosting the extrapolation via extending the formulations of
+the RoPE, however, few of them have attempted to showcase their inner workings
+comprehensively. In this paper, we are driven to offer a straightforward yet
+in-depth understanding of RoPE extensions from an attention perspective and on
+two benchmarking tasks. A broad array of experiments reveals several valuable
+findings: 1) Maintaining attention patterns to those at the pretrained length
+improves extrapolation; 2) Large attention uncertainty leads to retrieval
+errors; 3) Using longer continual pretraining lengths for RoPE extensions could
+reduce attention uncertainty and significantly enhance extrapolation.
+
+
+ Large language models (LLMs) have exhibited remarkable few-shot learning
+capabilities and unified the paradigm of NLP tasks through the in-context
+learning (ICL) technique. Despite the success of ICL, the quality of the
+exemplar demonstrations can significantly influence the LLM's performance.
+Existing exemplar selection methods mainly focus on the semantic similarity
+between queries and candidate exemplars. On the other hand, the logical
+connections between reasoning steps can be beneficial to depict the
+problem-solving process as well. In this paper, we proposes a novel method
+named Reasoning Graph-enhanced Exemplar Retrieval (RGER). RGER first quires LLM
+to generate an initial response, then expresses intermediate problem-solving
+steps to a graph structure. After that, it employs graph kernel to select
+exemplars with semantic and structural similarity. Extensive experiments
+demonstrate the structural relationship is helpful to the alignment of queries
+and candidate exemplars. The efficacy of RGER on math and logit reasoning tasks
+showcases its superiority over state-of-the-art retrieval-based approaches. Our
+code is released at https://github.com/Yukang-Lin/RGER.
+
+
+
+
+
+
+
+ ♻ ☆ Training on the Test Task Confounds Evaluation and Emergence
+
+
+
+
+
+
+
+
+ Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt
+
+
+ We study a fundamental problem in the evaluation of large language models
+that we call training on the test task. Unlike wrongful practices like training
+on the test data, leakage, or data contamination, training on the test task is
+not a malpractice. Rather, the term describes a growing set of practices that
+utilize knowledge about evaluation tasks at training time. We demonstrate that
+training on the test task confounds both relative model evaluations and claims
+about emergent capabilities. We argue that the seeming superiority of one model
+family over another may be explained by a different degree of training on the
+test task. To this end, we propose an effective method to adjust for the effect
+of training on the test task on benchmark evaluations. Put simply, to fine-tune
+each model under comparison on the same task-relevant data before evaluation.
+We then show that instances of emergent behavior disappear gradually as models
+train on the test task. Our work promotes a new perspective on the evaluation
+of large language models with broad implications for benchmarking and the study
+of emergent capabilities
+
+
+
+
+
+
+
+ ♻ ☆ Deep Learning and Machine Learning, Advancing Big Data Analytics and
+ Management: Unveiling AI's Potential Through Tools, Techniques, and
+ Applications
+
+
+
+
+
+
+
+
+ Pohsun Feng, Ziqian Bi, Yizhu Wen, Xuanhe Pan, Benji Peng, Ming Liu, Jiawei Xu, Keyu Chen, Junyu Liu, Caitlyn Heqi Yin, Sen Zhang, Jinlang Wang, Qian Niu, Ming Li, Tianyang Wang
+
+
+ Artificial intelligence (AI), machine learning, and deep learning have become
+transformative forces in big data analytics and management, enabling
+groundbreaking advancements across diverse industries. This article delves into
+the foundational concepts and cutting-edge developments in these fields, with a
+particular focus on large language models (LLMs) and their role in natural
+language processing, multimodal reasoning, and autonomous decision-making.
+Highlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores
+their applications in data analysis, model design, and optimization.
+ The integration of advanced algorithms like neural networks, reinforcement
+learning, and generative models has enhanced the capabilities of AI systems to
+process, visualize, and interpret complex datasets. Additionally, the emergence
+of technologies like edge computing and automated machine learning (AutoML)
+democratizes access to AI, empowering users across skill levels to engage with
+intelligent systems. This work also underscores the importance of ethical
+considerations, transparency, and fairness in the deployment of AI
+technologies, paving the way for responsible innovation.
+ Through practical insights into hardware configurations, software
+environments, and real-world applications, this article serves as a
+comprehensive resource for researchers and practitioners. By bridging
+theoretical underpinnings with actionable strategies, it showcases the
+potential of AI and LLMs to revolutionize big data management and drive
+meaningful advancements across domains such as healthcare, finance, and
+autonomous systems.
+
+
+
+ comment: This book contains 155 pages and 9 figures
+
+
+
+
+
+
+ ♻ ☆ Controlled Evaluation of Syntactic Knowledge in Multilingual Language
+ Models COLING 2025
+
+
+ Language models (LMs) are capable of acquiring elements of human-like
+syntactic knowledge. Targeted syntactic evaluation tests have been employed to
+measure how well they form generalizations about syntactic phenomena in
+high-resource languages such as English. However, we still lack a thorough
+understanding of LMs' capacity for syntactic generalizations in low-resource
+languages, which are responsible for much of the diversity of syntactic
+patterns worldwide. In this study, we develop targeted syntactic evaluation
+tests for three low-resource languages (Basque, Hindi, and Swahili) and use
+them to evaluate five families of open-access multilingual Transformer LMs. We
+find that some syntactic tasks prove relatively easy for LMs while others
+(agreement in sentences containing indirect objects in Basque, agreement across
+a prepositional phrase in Swahili) are challenging. We additionally uncover
+issues with publicly available Transformers, including a bias toward the
+habitual aspect in Hindi in multilingual BERT and underperformance compared to
+similar-sized models in XGLM-4.5B.
+
+
+
+ comment: LoResLM workshop at COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Exploring the Limitations of Detecting Machine-Generated Text COLING 2025
+
+
+ Recent improvements in the quality of the generations by large language
+models have spurred research into identifying machine-generated text. Such work
+often presents high-performing detectors. However, humans and machines can
+produce text in different styles and domains, yet the performance impact of
+such on machine generated text detection systems remains unclear. In this
+paper, we audit the classification performance for detecting machine-generated
+text by evaluating on texts with varying writing styles. We find that
+classifiers are highly sensitive to stylistic changes and differences in text
+complexity, and in some cases degrade entirely to random classifiers. We
+further find that detection systems are particularly susceptible to misclassify
+easy-to-read texts while they have high performance for complex texts, leading
+to concerns about the reliability of detection systems. We recommend that
+future work attends to stylistic factors and reading difficulty levels of
+human-written and machine-generated text.
+
+
+
+ comment: Accepted to COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Guiding Vision-Language Model Selection for Visual Question-Answering
+ Across Tasks, Domains, and Knowledge Types COLING
+
+
+ Visual Question-Answering (VQA) has become key to user experience,
+particularly after improved generalization capabilities of Vision-Language
+Models (VLMs). But evaluating VLMs for an application requirement using a
+standardized framework in practical settings is still challenging. This paper
+aims to solve that using an end-to-end framework. We present VQA360 - a novel
+dataset derived from established VQA benchmarks, annotated with task types,
+application domains, and knowledge types, for a comprehensive evaluation. We
+also introduce GoEval, a multimodal evaluation metric developed using GPT-4o,
+achieving a correlation factor of 56.71% with human judgments. Our experiments
+with state-of-the-art VLMs reveal that no single model excels universally,
+thus, making a right choice a key design decision. Proprietary models such as
+Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source
+models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive
+strengths, while providing additional advantages. Our framework can also be
+extended to other tasks.
+
+
+
+ comment: Accepted at The First Workshop of Evaluation of Multi-Modal
+ Generation (EvalMG) in 31st International Conference on Computational
+ Linguistics (COLING), 2025. 8 pages + references + 6 pages of Appendix
+
+
+
+
+
+
+ ♻ ☆ Rumor Detection on Social Media with Temporal Propagation Structure
+ Optimization COLING'25
+
+
+ Traditional methods for detecting rumors on social media primarily focus on
+analyzing textual content, often struggling to capture the complexity of online
+interactions. Recent research has shifted towards leveraging graph neural
+networks to model the hierarchical conversation structure that emerges during
+rumor propagation. However, these methods tend to overlook the temporal aspect
+of rumor propagation and may disregard potential noise within the propagation
+structure. In this paper, we propose a novel approach that incorporates
+temporal information by constructing a weighted propagation tree, where the
+weight of each edge represents the time interval between connected posts.
+Drawing upon the theory of structural entropy, we transform this tree into a
+coding tree. This transformation aims to preserve the essential structure of
+rumor propagation while reducing noise. Finally, we introduce a recursive
+neural network to learn from the coding tree for rumor veracity prediction.
+Experimental results on two common datasets demonstrate the superiority of our
+approach.
+
+
+
+ comment: COLING'25
+
+
+
+
+
+
+ ♻ ☆ VickreyFeedback: Cost-efficient Data Construction for Reinforcement
+ Learning from Human Feedback
+
+
+ This paper addresses the cost-efficiency aspect of Reinforcement Learning
+from Human Feedback (RLHF). RLHF leverages datasets of human preferences over
+outputs of large language models (LLM)s to instill human expectations into
+LLMs. Although preference annotation comes with a monetized cost, the economic
+utility of a preference dataset has not been considered by far. What
+exacerbates this situation is that, given complex intransitive or cyclic
+relationships in preference datasets, existing algorithms for fine-tuning LLMs
+are still far from capturing comprehensive preferences. This raises severe
+cost-efficiency concerns in production environments, where preference data
+accumulate over time. In this paper, we discuss the fine-tuning of LLMs as a
+monetized economy and introduce an auction mechanism to improve the efficiency
+of preference data collection in dollar terms. We show that introducing an
+auction mechanism can play an essential role in enhancing the cost-efficiency
+of RLHF, while maintaining satisfactory model performance. Experimental results
+demonstrate that our proposed auction-based protocol is cost-effective for
+fine-tuning LLMs concentrating on high-quality feedback.
+
+
+
+ comment: 16 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Return of EM: Entity-driven Answer Set Expansion for QA Evaluation COLING 2025
+
+
+ Recently, directly using large language models (LLMs) has been shown to be
+the most reliable method to evaluate QA models. However, it suffers from
+limited interpretability, high cost, and environmental harm. To address these,
+we propose to use soft EM with entity-driven answer set expansion. Our approach
+expands the gold answer set to include diverse surface forms, based on the
+observation that the surface forms often follow particular patterns depending
+on the entity type. The experimental results show that our method outperforms
+traditional evaluation methods by a large margin. Moreover, the reliability of
+our evaluation method is comparable to that of LLM-based ones, while offering
+the benefits of high interpretability and reduced environmental harm.
+
+
+ Large Language Models (LLMs) have achieved remarkable success with their
+billion-level parameters, yet they incur high inference overheads. The
+emergence of activation sparsity in LLMs provides a natural approach to reduce
+this cost by involving only parts of the parameters for inference. However,
+existing methods only focus on utilizing this naturally formed activation
+sparsity in a post-training setting, overlooking the potential for further
+amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can
+learn to be efficient by achieving more structured activation sparsity. To
+achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient
+(LTE), designed to train efficiency-aware LLMs to learn to activate fewer
+neurons and achieve a better trade-off between sparsity and performance.
+Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based
+models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations.
+Extensive evaluation on language understanding, language generation, and
+instruction tuning tasks show that LTE consistently outperforms SOTA baselines.
+Along with our hardware-aware custom kernel implementation, LTE reduces
+LLaMA2-7B inference latency by 25% at 50% sparsity.
+
+
+ Text-to-Image (T2I) models have shown great performance in generating images
+based on textual prompts. However, these models are vulnerable to unsafe input
+to generate unsafe content like sexual, harassment and illegal-activity images.
+Existing studies based on image checker, model fine-tuning and embedding
+blocking are impractical in real-world applications. Hence, we propose the
+first universal prompt optimizer for safe T2I (POSI) generation in black-box
+scenario. We first construct a dataset consisting of toxic-clean prompt pairs
+by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting
+toxic prompt to clean prompt while preserving semantic information, we design a
+novel reward function measuring toxicity and text alignment of generated images
+and train the optimizer through Proximal Policy Optimization. Experiments show
+that our approach can effectively reduce the likelihood of various T2I models
+in generating inappropriate images, with no significant impact on text
+alignment. It is also flexible to be combined with methods to achieve better
+performance. Our code is available at https://github.com/wu-zongyu/POSI.
+
+
+
+
+
+
+
+
+ Minh Le, Tien Ngoc Luu, An Nguyen The, Thanh-Thien Le, Trang Nguyen, Tung Thanh Nguyen, Linh Ngo Van, Thien Huu Nguyen
+
+
+ To address catastrophic forgetting in Continual Relation Extraction (CRE),
+many current approaches rely on memory buffers to rehearse previously learned
+knowledge while acquiring new tasks. Recently, prompt-based methods have
+emerged as potent alternatives to rehearsal-based strategies, demonstrating
+strong empirical performance. However, upon analyzing existing prompt-based
+approaches for CRE, we identified several critical limitations, such as
+inaccurate prompt selection, inadequate mechanisms for mitigating forgetting in
+shared parameters, and suboptimal handling of cross-task and within-task
+variances. To overcome these challenges, we draw inspiration from the
+relationship between prefix-tuning and mixture of experts, proposing a novel
+approach that employs a prompt pool for each task, capturing variations within
+each task while enhancing cross-task variances. Furthermore, we incorporate a
+generative model to consolidate prior knowledge within shared parameters,
+eliminating the need for explicit data storage. Extensive experiments validate
+the efficacy of our approach, demonstrating superior performance over
+state-of-the-art prompt-based and rehearsal-free methods in continual relation
+extraction.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Trustful LLMs: Customizing and Grounding Text Generation with Knowledge
+ Bases and Dual Decoders
+
+
+ Although people are impressed by the content generation skills of large
+language models, the use of LLMs, such as ChatGPT, is limited by the domain
+grounding of the content. The correctness and groundedness of the generated
+content need to be based on a verified context, such as results from
+Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to
+a customized domain is that the generated responses are often incomplete, or
+the additions are not verified and may even be hallucinated. Prior studies on
+hallucination detection have focused on evaluation metrics, which are not
+easily adaptable to dynamic domains and can be vulnerable to attacks like
+jail-breaking. In this work, we propose 1) a post-processing algorithm that
+leverages knowledge triplets in RAG context to correct hallucinations and 2) a
+dual-decoder model that fuses RAG context to guide the generation process.
+
+
+
+
+
+
+
+
+ Pedro H. V. Valois, Lincon S. Souza, Erica K. Shimomoto, Kazuhiro Fukui
+
+
+ Interpretability is a key challenge in fostering trust for Large Language
+Models (LLMs), which stems from the complexity of extracting reasoning from
+model's parameters. We present the Frame Representation Hypothesis, a
+theoretically robust framework grounded in the Linear Representation Hypothesis
+(LRH) to interpret and control LLMs by modeling multi-token words. Prior
+research explored LRH to connect LLM representations with linguistic concepts,
+but was limited to single token analysis. As most words are composed of several
+tokens, we extend LRH to multi-token words, thereby enabling usage on any
+textual data with thousands of concepts. To this end, we propose words can be
+interpreted as frames, ordered sequences of vectors that better capture
+token-word relationships. Then, concepts can be represented as the average of
+word frames sharing a common concept. We showcase these tools through Top-k
+Concept-Guided Decoding, which can intuitively steer text generation using
+concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3
+families, demonstrating gender and language biases, exposing harmful content,
+but also potential to remediate them, leading to safer and more transparent
+LLMs. Code is available at
+https://github.com/phvv-me/frame-representation-hypothesis.git
+
+
+
+
+
+
+
+ ♻ ☆ If You Can't Use Them, Recycle Them: Optimizing Merging at Scale
+ Mitigates Performance Tradeoffs
+
+
+
+
+
+
+
+
+ Muhammad Khalifa, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
+
+
+ Model merging has shown great promise at combining expert models, but the
+benefit of merging is unclear when merging ``generalist'' models trained on
+many tasks. We explore merging in the context of large (~100B) models, by
+recycling checkpoints that exhibit tradeoffs among different tasks. Such
+checkpoints are often created in the process of developing a frontier model,
+and many suboptimal ones are usually discarded. Given a pool of model
+checkpoints obtained from different training runs (e.g., different stages,
+objectives, hyperparameters, and data mixtures), which naturally show tradeoffs
+across different language capabilities (e.g., instruction following vs. code
+generation), we investigate whether merging can recycle such suboptimal models
+into a Pareto-optimal one. Our optimization algorithm tunes the weight of each
+checkpoint in a linear combination, resulting in a Pareto-optimal models that
+outperforms both individual models and merge-based baselines. Further analysis
+shows that good merges tend to include almost all checkpoints with non-zero
+weights, indicating that even seemingly bad initial checkpoints can contribute
+to good final merges.
+
+
+
+ comment: 13 pages, 9 figures
+
+
+
+
+
+
+ ♻ ☆ Revolutionizing Finance with LLMs: An Overview of Applications and
+ Insights
+
+
+
+
+
+
+
+
+ Huaqin Zhao, Zhengliang Liu, Zihao Wu, Yiwei Li, Tianze Yang, Peng Shu, Shaochen Xu, Haixing Dai, Lin Zhao, Hanqi Jiang, Yi Pan, Junhao Chen, Yifan Zhou, Gengchen Mai, Ninghao Liu, Tianming Liu
+
+
+ In recent years, Large Language Models (LLMs) like ChatGPT have seen
+considerable advancements and have been applied in diverse fields. Built on the
+Transformer architecture, these models are trained on extensive datasets,
+enabling them to understand and generate human language effectively. In the
+financial domain, the deployment of LLMs is gaining momentum. These models are
+being utilized for automating financial report generation, forecasting market
+trends, analyzing investor sentiment, and offering personalized financial
+advice. Leveraging their natural language processing capabilities, LLMs can
+distill key insights from vast financial data, aiding institutions in making
+informed investment choices and enhancing both operational efficiency and
+customer satisfaction. In this study, we provide a comprehensive overview of
+the emerging integration of LLMs into various financial tasks. Additionally, we
+conducted holistic tests on multiple financial tasks through the combination of
+natural language instructions. Our findings show that GPT-4 effectively follow
+prompt instructions across various financial tasks. This survey and evaluation
+of LLMs in the financial domain aim to deepen the understanding of LLMs'
+current role in finance for both financial practitioners and LLM researchers,
+identify new research and application prospects, and highlight how these
+technologies can be leveraged to solve practical challenges in the finance
+industry.
+
+
+
+
+
+
+
+ ♻ ☆ AI-Press: A Multi-Agent News Generating and Feedback Simulation System
+ Powered by Large Language Models
+
+
+ The rise of various social platforms has transformed journalism. The growing
+demand for news content has led to the increased use of large language models
+(LLMs) in news production due to their speed and cost-effectiveness. However,
+LLMs still encounter limitations in professionalism and ethical judgment in
+news generation. Additionally, predicting public feedback is usually difficult
+before news is released. To tackle these challenges, we introduce AI-Press, an
+automated news drafting and polishing system based on multi-agent collaboration
+and Retrieval-Augmented Generation. We develop a feedback simulation system
+that generates public feedback considering demographic distributions. Through
+extensive quantitative and qualitative evaluations, our system shows
+significant improvements in news-generating capabilities and verifies the
+effectiveness of public feedback simulation.
+
+
+
+ comment: 18 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark
+ for Evaluating Long-Context Large Language Models COLING 2025
+
+
+
+
+
+
+
+
+ Mingyang Song, Mao Zheng, Xuan Luo
+
+
+ Despite recent efforts to develop large language models with robust
+long-context capabilities, the lack of long-context benchmarks means that
+relatively little is known about their performance. To alleviate this gap, in
+this paper, we propose \textbf{Counting-Stars}, a multi-evidence,
+position-aware, and scalable benchmark designed to evaluate the multi-evidence
+retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises
+two counting-based multiple pieces of evidence retrieval tasks: searching and
+reasoning. Using Counting-Stars, we conducted experiments to evaluate several
+long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4,
+and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro
+achieves the best overall results, while GPT-4 Turbo exhibits the most stable
+performance across various tasks. Furthermore, our analysis of these LLMs,
+which have been extended to handle long-context scenarios, indicates that
+significant room for improvement remains as the length of the input context and
+the complexity of the tasks increase.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for
+ Large Language Models Aligned with Human Cognitive Principles
+
+
+ Assessing the effectiveness of large language models (LLMs) in performing
+different tasks is crucial for understanding their strengths and weaknesses.
+This paper presents Hierarchical Prompting Taxonomy (HPT), grounded on human
+cognitive principles and designed to assess LLMs by examining the cognitive
+demands of various tasks. The HPT utilizes the Hierarchical Prompting Framework
+(HPF), which structures five unique prompting strategies in a hierarchical
+order based on their cognitive requirement on LLMs when compared to human
+mental capabilities. It assesses the complexity of tasks with the Hierarchical
+Prompting Index (HPI), which demonstrates the cognitive competencies of LLMs
+across diverse datasets and offers insights into the cognitive demands that
+datasets place on different LLMs. This approach enables a comprehensive
+evaluation of an LLMs problem solving abilities and the intricacy of a dataset,
+offering a standardized metric for task complexity. Extensive experiments with
+multiple datasets and LLMs show that HPF enhances LLM performance by 2% to 63%
+compared to baseline performance, with GSM8k being the most cognitively complex
+task among reasoning and coding tasks with an average HPI of 3.20 confirming
+the effectiveness of HPT. To support future research and reproducibility in
+this domain, the implementations of HPT and HPF are available here.
+
+
+
+
+
+
+
+ ♻ ☆ Probability of Differentiation Reveals Brittleness of Homogeneity Bias
+ in GPT-4
+
+
+ Homogeneity bias in Large Language Models (LLMs) refers to their tendency to
+homogenize the representations of some groups compared to others. Previous
+studies documenting this bias have predominantly used encoder models, which may
+have inadvertently introduced biases. To address this limitation, we prompted
+GPT-4 to generate single word/expression completions associated with 18
+situation cues-specific, measurable elements of environments that influence how
+individuals perceive situations and compared the variability of these
+completions using probability of differentiation. This approach directly
+assessed homogeneity bias from the model's outputs, bypassing encoder models.
+Across five studies, we find that homogeneity bias is highly volatile across
+situation cues and writing prompts, suggesting that the bias observed in past
+work may reflect those within encoder models rather than LLMs. Furthermore, we
+find that homogeneity bias in LLMs is brittle, as even minor and arbitrary
+changes in prompts can significantly alter the expression of biases. Future
+work should further explore how variations in syntactic features and topic
+choices in longer text generations influence homogeneity bias in LLMs.
+
+
+
+
+
+
+
+ ♻ ☆ PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking
+ Large Language Models
+
+
+ The emergence of Large Language Models (LLMs) in the medical domain has
+stressed a compelling need for standard datasets to evaluate their
+question-answering (QA) performance. Although there have been several benchmark
+datasets for medical QA, they either cover common knowledge across different
+departments or are specific to another department rather than pediatrics.
+Moreover, some of them are limited to objective questions and do not measure
+the generation capacity of LLMs. Therefore, they cannot comprehensively assess
+the QA ability of LLMs in pediatrics. To fill this gap, we construct
+PediaBench, the first Chinese pediatric dataset for LLM evaluation.
+Specifically, it contains 4,565 objective questions and 1,632 subjective
+questions spanning 12 pediatric disease groups. It adopts an integrated scoring
+criterion based on different difficulty levels to thoroughly assess the
+proficiency of an LLM in instruction following, knowledge understanding,
+clinical case analysis, etc. Finally, we validate the effectiveness of
+PediaBench with extensive experiments on 20 open-source and commercial LLMs.
+Through an in-depth analysis of experimental results, we offer insights into
+the ability of LLMs to answer pediatric questions in the Chinese context,
+highlighting their limitations for further improvements. Our code and data are
+published at https://github.com/ACMISLab/PediaBench.
+
+
+
+ comment: 21 pages, 12 figures
+
+
+
+
+
+
+ ♻ ☆ From Generation to Judgment: Opportunities and Challenges of
+ LLM-as-a-judge
+
+
+ Assessment and evaluation have long been critical challenges in artificial
+intelligence (AI) and natural language processing (NLP). However, traditional
+methods, whether matching-based or embedding-based, often fall short of judging
+subtle attributes and delivering satisfactory results. Recent advancements in
+Large Language Models (LLMs) inspire the "LLM-as-a-judge" paradigm, where LLMs
+are leveraged to perform scoring, ranking, or selection across various tasks
+and applications. This paper provides a comprehensive survey of LLM-based
+judgment and assessment, offering an in-depth overview to advance this emerging
+field. We begin by giving detailed definitions from both input and output
+perspectives. Then we introduce a comprehensive taxonomy to explore
+LLM-as-a-judge from three dimensions: what to judge, how to judge and where to
+judge. Finally, we compile benchmarks for evaluating LLM-as-a-judge and
+highlight key challenges and promising directions, aiming to provide valuable
+insights and inspire future research in this promising research area. Paper
+list and more resources about LLM-as-a-judge can be found at
+\url{https://github.com/llm-as-a-judge/Awesome-LLM-as-a-judge} and
+\url{https://llm-as-a-judge.github.io}.
+
+
+ Automatically generating multiview illusions is a compelling challenge, where
+a single piece of visual content offers distinct interpretations from different
+viewing perspectives. Traditional methods, such as shadow art and wire art,
+create interesting 3D illusions but are limited to simple visual outputs (i.e.,
+figure-ground or line drawing), restricting their artistic expressiveness and
+practical versatility. Recent diffusion-based illusion generation methods can
+generate more intricate designs but are confined to 2D images. In this work, we
+present a simple yet effective approach for creating 3D multiview illusions
+based on user-provided text prompts or images. Our method leverages a
+pre-trained text-to-image diffusion model to optimize the textures and geometry
+of neural 3D representations through differentiable rendering. When viewed from
+multiple angles, this produces different interpretations. We develop several
+techniques to improve the quality of the generated 3D multiview illusions. We
+demonstrate the effectiveness of our approach through extensive experiments and
+showcase illusion generation with diverse 3D forms.
+
+
+ Visual diffusion models achieve remarkable progress, yet they are typically
+trained at limited resolutions due to the lack of high-resolution data and
+constrained computation resources, hampering their ability to generate
+high-fidelity images or videos at higher resolutions. Recent efforts have
+explored tuning-free strategies to exhibit the untapped potential
+higher-resolution visual generation of pre-trained models. However, these
+methods are still prone to producing low-quality visual content with repetitive
+patterns. The key obstacle lies in the inevitable increase in high-frequency
+information when the model generates visual content exceeding its training
+resolution, leading to undesirable repetitive patterns deriving from the
+accumulated errors. To tackle this challenge, we propose FreeScale, a
+tuning-free inference paradigm to enable higher-resolution visual generation
+via scale fusion. Specifically, FreeScale processes information from different
+receptive scales and then fuses it by extracting desired frequency components.
+Extensive experiments validate the superiority of our paradigm in extending the
+capabilities of higher-resolution visual generation for both image and video
+models. Notably, compared with the previous best-performing method, FreeScale
+unlocks the generation of 8k-resolution images for the first time.
+
+
+
+
+
+
+
+ ☆ Doe-1: Closed-Loop Autonomous Driving with Large World Model
+
+
+
+
+
+
+
+
+ Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu
+
+
+ End-to-end autonomous driving has received increasing attention due to its
+potential to learn from large amounts of data. However, most existing methods
+are still open-loop and suffer from weak scalability, lack of high-order
+interactions, and inefficient decision-making. In this paper, we explore a
+closed-loop framework for autonomous driving and propose a large Driving wOrld
+modEl (Doe-1) for unified perception, prediction, and planning. We formulate
+autonomous driving as a next-token generation problem and use multi-modal
+tokens to accomplish different tasks. Specifically, we use free-form texts
+(i.e., scene descriptions) for perception and generate future predictions
+directly in the RGB space with image tokens. For planning, we employ a
+position-aware tokenizer to effectively encode action into discrete tokens. We
+train a multi-modal transformer to autoregressively generate perception,
+prediction, and planning tokens in an end-to-end and unified manner.
+Experiments on the widely used nuScenes dataset demonstrate the effectiveness
+of Doe-1 in various tasks including visual question-answering,
+action-conditioned video generation, and motion planning. Code:
+https://github.com/wzzheng/Doe.
+
+
+
+ comment: Code is available at: https://github.com/wzzheng/Doe
+
+
+
+
+
+
+ ☆ GenEx: Generating an Explorable World
+
+
+
+
+
+
+
+
+ Taiming Lu, Tianmin Shu, Junfei Xiao, Luoxin Ye, Jiahao Wang, Cheng Peng, Chen Wei, Daniel Khashabi, Rama Chellappa, Alan Yuille, Jieneng Chen
+
+
+ Understanding, navigating, and exploring the 3D physical real world has long
+been a central challenge in the development of artificial intelligence. In this
+work, we take a step toward this goal by introducing GenEx, a system capable of
+planning complex embodied world exploration, guided by its generative
+imagination that forms priors (expectations) about the surrounding
+environments. GenEx generates an entire 3D-consistent imaginative environment
+from as little as a single RGB image, bringing it to life through panoramic
+video streams. Leveraging scalable 3D world data curated from Unreal Engine,
+our generative model is rounded in the physical world. It captures a continuous
+360-degree environment with little effort, offering a boundless landscape for
+AI agents to explore and interact with. GenEx achieves high-quality world
+generation, robust loop consistency over long trajectories, and demonstrates
+strong 3D capabilities such as consistency and active 3D mapping. Powered by
+generative imagination of the world, GPT-assisted agents are equipped to
+perform complex embodied tasks, including both goal-agnostic exploration and
+goal-driven navigation. These agents utilize predictive expectation regarding
+unseen parts of the physical world to refine their beliefs, simulate different
+outcomes based on potential decisions, and make more informed choices. In
+summary, we demonstrate that GenEx provides a transformative platform for
+advancing embodied AI in imaginative spaces and brings potential for extending
+these capabilities to real-world exploration.
+
+
+
+ comment: Website: GenEx.world
+
+
+
+
+
+
+ ☆ OmniDrag: Enabling Motion Control for Omnidirectional Image-to-Video
+ Generation
+
+
+ As virtual reality gains popularity, the demand for controllable creation of
+immersive and dynamic omnidirectional videos (ODVs) is increasing. While
+previous text-to-ODV generation methods achieve impressive results, they
+struggle with content inaccuracies and inconsistencies due to reliance solely
+on textual inputs. Although recent motion control techniques provide
+fine-grained control for video generation, directly applying these methods to
+ODVs often results in spatial distortion and unsatisfactory performance,
+especially with complex spherical motions. To tackle these challenges, we
+propose OmniDrag, the first approach enabling both scene- and object-level
+motion control for accurate, high-quality omnidirectional image-to-video
+generation. Building on pretrained video diffusion models, we introduce an
+omnidirectional control module, which is jointly fine-tuned with temporal
+attention layers to effectively handle complex spherical motion. In addition,
+we develop a novel spherical motion estimator that accurately extracts
+motion-control signals and allows users to perform drag-style ODV generation by
+simply drawing handle and target points. We also present a new dataset, named
+Move360, addressing the scarcity of ODV data with large scene and object
+motions. Experiments demonstrate the significant superiority of OmniDrag in
+achieving holistic scene-level and fine-grained object-level control for ODV
+generation. The project page is available at
+https://lwq20020127.github.io/OmniDrag.
+
+
+
+
+
+
+
+ ☆ LoRACLR: Contrastive Adaptation for Customization of Diffusion Models
+
+
+ Recent advances in text-to-image customization have enabled high-fidelity,
+context-rich generation of personalized images, allowing specific concepts to
+appear in a variety of scenarios. However, current methods struggle with
+combining multiple personalized models, often leading to attribute entanglement
+or requiring separate training to preserve concept distinctiveness. We present
+LoRACLR, a novel approach for multi-concept image generation that merges
+multiple LoRA models, each fine-tuned for a distinct concept, into a single,
+unified model without additional individual fine-tuning. LoRACLR uses a
+contrastive objective to align and merge the weight spaces of these models,
+ensuring compatibility while minimizing interference. By enforcing distinct yet
+cohesive representations for each concept, LoRACLR enables efficient, scalable
+model composition for high-quality, multi-concept image synthesis. Our results
+highlight the effectiveness of LoRACLR in accurately merging multiple concepts,
+advancing the capabilities of personalized image generation.
+
+
+ This study seeks to automate camera movement control for filming existing
+subjects into attractive videos, contrasting with the creation of non-existent
+content by directly generating the pixels. We select drone videos as our test
+case due to their rich and challenging motion patterns, distinctive viewing
+angles, and precise controls. Existing AI videography methods struggle with
+limited appearance diversity in simulation training, high costs of recording
+expert operations, and difficulties in designing heuristic-based goals to cover
+all scenarios. To avoid these issues, we propose a scalable method that
+involves collecting real-world training data to improve diversity, extracting
+camera trajectories automatically to minimize annotation costs, and training an
+effective architecture that does not rely on heuristics. Specifically, we
+collect 99k high-quality trajectories by running 3D reconstruction on online
+videos, connecting camera poses from consecutive frames to formulate 3D camera
+paths, and using Kalman filter to identify and remove low-quality data.
+Moreover, we introduce DVGFormer, an auto-regressive transformer that leverages
+the camera path and images from all past frames to predict camera movement in
+the next frame. We evaluate our system across 38 synthetic natural scenes and 7
+real city 3D scans. We show that our system effectively learns to perform
+challenging camera movements such as navigating through obstacles, maintaining
+low altitude to increase perceived speed, and orbiting towers and buildings,
+which are very useful for recording high-quality videos. Data and code are
+available at dvgformer.github.io.
+
+
+
+
+
+
+
+ ☆ Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
+
+
+
+
+
+
+
+
+ Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, Aleksander Holynski
+
+
+ Learning to understand dynamic 3D scenes from imagery is crucial for
+applications ranging from robotics to scene reconstruction. Yet, unlike other
+problems where large-scale supervised training has enabled rapid progress,
+directly supervising methods for recovering 3D motion remains challenging due
+to the fundamental difficulty of obtaining ground truth annotations. We present
+a system for mining high-quality 4D reconstructions from internet stereoscopic,
+wide-angle videos. Our system fuses and filters the outputs of camera pose
+estimation, stereo depth estimation, and temporal tracking methods into
+high-quality dynamic 3D reconstructions. We use this method to generate
+large-scale data in the form of world-consistent, pseudo-metric 3D point clouds
+with long-term motion trajectories. We demonstrate the utility of this data by
+training a variant of DUSt3R to predict structure and 3D motion from real-world
+image pairs, showing that training on our reconstructed data enables
+generalization to diverse real-world scenes. Project page:
+https://stereo4d.github.io
+
+
+
+
+
+
+
+ ☆ SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices
+ with Efficient Architectures and Training
+
+
+ Existing text-to-image (T2I) diffusion models face several limitations,
+including large model sizes, slow runtime, and low-quality generation on mobile
+devices. This paper aims to address all of these challenges by developing an
+extremely small and fast T2I model that generates high-resolution and
+high-quality images on mobile platforms. We propose several techniques to
+achieve this goal. First, we systematically examine the design choices of the
+network architecture to reduce model parameters and latency, while ensuring
+high-quality generation. Second, to further improve generation quality, we
+employ cross-architecture knowledge distillation from a much larger model,
+using a multi-level approach to guide the training of our model from scratch.
+Third, we enable a few-step generation by integrating adversarial guidance with
+knowledge distillation. For the first time, our model SnapGen, demonstrates the
+generation of 1024x1024 px images on a mobile device around 1.4 seconds. On
+ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for
+256x256 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our
+model with merely 379M parameters, surpasses large-scale models with billions
+of parameters at a significantly smaller size (e.g., 7x smaller than SDXL, 14x
+smaller than IF-XL).
+
+
+
+
+
+
+
+ ☆ EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via
+ Multimodal LLM
+
+
+ Significant achievements in personalization of diffusion models have been
+witnessed. Conventional tuning-free methods mostly encode multiple reference
+images by averaging their image embeddings as the injection condition, but such
+an image-independent operation cannot perform interaction among images to
+capture consistent visual elements within multiple references. Although the
+tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent
+elements within multiple images through the training process, it necessitates
+specific finetuning for each distinct image group. This paper introduces
+EasyRef, a novel plug-and-play adaptation method that enables diffusion models
+to be conditioned on multiple reference images and the text prompt. To
+effectively exploit consistent visual elements within multiple images, we
+leverage the multi-image comprehension and instruction-following capabilities
+of the multimodal large language model (MLLM), prompting it to capture
+consistent visual elements based on the instruction. Besides, injecting the
+MLLM's representations into the diffusion process through adapters can easily
+generalize to unseen domains, mining the consistent visual elements within
+unseen data. To mitigate computational costs and enhance fine-grained detail
+preservation, we introduce an efficient reference aggregation strategy and a
+progressive training scheme. Finally, we introduce MRBench, a new
+multi-reference image generation benchmark. Experimental results demonstrate
+EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based
+methods like LoRA, achieving superior aesthetic quality and robust zero-shot
+generalization across diverse domains.
+
+
+
+ comment: Tech report
+
+
+
+
+
+
+ ☆ V2PE: Improving Multimodal Long-Context Capability of Vision-Language
+ Models with Variable Visual Position Encoding
+
+
+ Vision-Language Models (VLMs) have shown promising capabilities in handling
+various multimodal tasks, yet they struggle in long-context scenarios,
+particularly in tasks involving videos, high-resolution images, or lengthy
+image-text documents. In our work, we first conduct an empirical analysis of
+the long-context capabilities of VLMs using our augmented long-context
+multimodal datasets. Our findings reveal that directly applying the positional
+encoding mechanism used for textual tokens to visual tokens is suboptimal, and
+VLM performance degrades sharply when the position encoding exceeds the model's
+context window. To address this, we propose Variable Visual Position Encoding
+(V2PE), a novel positional encoding approach that employs variable and smaller
+increments for visual tokens, enabling more efficient management of long
+multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to
+enhances VLMs' ability to effectively understand and reason over long
+multimodal contexts. We further integrate V2PE with our augmented long-context
+multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned
+model achieves strong performance on both standard and long-context multimodal
+tasks. Notably, when the sequence length of the training dataset is increased
+to 256K tokens, the model is capable of processing multimodal sequences up to
+1M tokens, highlighting its potential for real-world long-context applications.
+
+
+
+ comment: The code and models will be available at
+ https://github.com/OpenGVLab/V2PE
+
+
+
+
+
+
+
+ Kavana Venkatesh, Yusuf Dalva, Ismini Lourentzou, Pinar Yanardag
+
+
+ We introduce a novel approach to enhance the capabilities of text-to-image
+models by incorporating a graph-based RAG. Our system dynamically retrieves
+detailed character information and relational data from the knowledge graph,
+enabling the generation of visually accurate and contextually rich images. This
+capability significantly improves upon the limitations of existing T2I models,
+which often struggle with the accurate depiction of complex or culturally
+specific subjects due to dataset constraints. Furthermore, we propose a novel
+self-correcting mechanism for text-to-image models to ensure consistency and
+fidelity in visual outputs, leveraging the rich context from the graph to guide
+corrections. Our qualitative and quantitative experiments demonstrate that
+Context Canvas significantly enhances the capabilities of popular models such
+as Flux, Stable Diffusion, and DALL-E, and improves the functionality of
+ControlNet for fine-grained image editing tasks. To our knowledge, Context
+Canvas represents the first application of graph-based RAG in enhancing T2I
+models, representing a significant advancement for producing high-fidelity,
+context-aware multi-faceted images.
+
+
+ Rectified flow models have emerged as a dominant approach in image
+generation, showcasing impressive capabilities in high-quality image synthesis.
+However, despite their effectiveness in visual generation, rectified flow
+models often struggle with disentangled editing of images. This limitation
+prevents the ability to perform precise, attribute-specific modifications
+without affecting unrelated aspects of the image. In this paper, we introduce
+FluxSpace, a domain-agnostic image editing method leveraging a representation
+space with the ability to control the semantics of images generated by
+rectified flow transformers, such as Flux. By leveraging the representations
+learned by the transformer blocks within the rectified flow models, we propose
+a set of semantically interpretable representations that enable a wide range of
+image editing tasks, from fine-grained image editing to artistic creation. This
+work offers a scalable and effective image editing approach, along with its
+disentanglement capabilities.
+
+
+
+
+
+
+
+ ☆ Olympus: A Universal Task Router for Computer Vision Tasks
+
+
+
+
+
+
+
+
+ Yuanze Lin, Yunsheng Li, Dongdong Chen, Weijian Xu, Ronald Clark, Philip H. S. Torr
+
+
+ We introduce Olympus, a new approach that transforms Multimodal Large
+Language Models (MLLMs) into a unified framework capable of handling a wide
+array of computer vision tasks. Utilizing a controller MLLM, Olympus delegates
+over 20 specialized tasks across images, videos, and 3D objects to dedicated
+modules. This instruction-based routing enables complex workflows through
+chained actions without the need for training heavy generative models. Olympus
+easily integrates with existing MLLMs, expanding their capabilities with
+comparable performance. Experimental results demonstrate that Olympus achieves
+an average routing accuracy of 94.75% across 20 tasks and precision of 91.82%
+in chained action scenarios, showcasing its effectiveness as a universal task
+router that can solve a diverse range of computer vision tasks. Project page:
+https://github.com/yuanze-lin/Olympus_page
+
+
+
+ comment: Technical Report
+
+
+
+
+
+
+ ☆ PVC: Progressive Visual Token Compression for Unified Image and Video
+ Processing in Large Vision-Language Models
+
+
+
+
+
+
+
+
+ Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai
+
+
+ Large Vision-Language Models (VLMs) have been extended to understand both
+images and videos. Visual token compression is leveraged to reduce the
+considerable token length of visual inputs. To meet the needs of different
+tasks, existing high-performance models usually process images and videos
+separately with different token compression strategies, limiting the
+capabilities of combining images and videos. To this end, we extend each image
+into a "static" video and introduce a unified token compression strategy called
+Progressive Visual Token Compression (PVC), where the tokens of each frame are
+progressively encoded and adaptively compressed to supplement the information
+not extracted from previous frames. Video tokens are efficiently compressed
+with exploiting the inherent temporal redundancy. Images are repeated as static
+videos, and the spatial details can be gradually supplemented in multiple
+frames. PVC unifies the token compressing of images and videos. With a limited
+number of tokens per frame (64 tokens by default), spatial details and temporal
+changes can still be preserved. Experiments show that our model achieves
+state-of-the-art performance across various video understanding benchmarks,
+including long video tasks and fine-grained short video tasks. Meanwhile, our
+unified token compression strategy incurs no performance loss on image
+benchmarks, particularly in detail-sensitive tasks.
+
+
+
+
+
+
+
+ ☆ Representing Long Volumetric Video with Temporal Gaussian Hierarchy SIGGRAPH
+
+
+ This paper aims to address the challenge of reconstructing long volumetric
+videos from multi-view RGB videos. Recent dynamic view synthesis methods
+leverage powerful 4D representations, like feature grids or point cloud
+sequences, to achieve high-quality rendering results. However, they are
+typically limited to short (1~2s) video clips and often suffer from large
+memory footprints when dealing with longer videos. To solve this issue, we
+propose a novel 4D representation, named Temporal Gaussian Hierarchy, to
+compactly model long volumetric videos. Our key observation is that there are
+generally various degrees of temporal redundancy in dynamic scenes, which
+consist of areas changing at different speeds. Motivated by this, our approach
+builds a multi-level hierarchy of 4D Gaussian primitives, where each level
+separately describes scene regions with different degrees of content change,
+and adaptively shares Gaussian primitives to represent unchanged scene content
+over different temporal segments, thus effectively reducing the number of
+Gaussian primitives. In addition, the tree-like structure of the Gaussian
+hierarchy allows us to efficiently represent the scene at a particular moment
+with a subset of Gaussian primitives, leading to nearly constant GPU memory
+usage during the training or rendering regardless of the video length.
+Extensive experimental results demonstrate the superiority of our method over
+alternative methods in terms of training cost, rendering speed, and storage
+usage. To our knowledge, this work is the first approach capable of efficiently
+handling minutes of volumetric video data while maintaining state-of-the-art
+rendering quality. Our project page is available at:
+https://zju3dv.github.io/longvolcap.
+
+
+
+
+
+
+
+
+ Carlos Esteves, Mohammed Suhail, Ameesh Makadia
+
+
+ Image tokenizers map images to sequences of discrete tokens, and are a
+crucial component of autoregressive transformer-based image generation. The
+tokens are typically associated with spatial locations in the input image,
+arranged in raster scan order, which is not ideal for autoregressive modeling.
+In this paper, we propose to tokenize the image spectrum instead, obtained from
+a discrete wavelet transform (DWT), such that the sequence of tokens represents
+the image in a coarse-to-fine fashion. Our tokenizer brings several advantages:
+1) it leverages that natural images are more compressible at high frequencies,
+2) it can take and reconstruct images of different resolutions without
+retraining, 3) it improves the conditioning for next-token prediction --
+instead of conditioning on a partial line-by-line reconstruction of the image,
+it takes a coarse reconstruction of the full image, 4) it enables partial
+decoding where the first few generated tokens can reconstruct a coarse version
+of the image, 5) it enables autoregressive models to be used for image
+upsampling. We evaluate the tokenizer reconstruction metrics as well as
+multiscale image generation, text-guided image upsampling and editing.
+
+
+
+
+
+
+
+ ☆ Feat2GS: Probing Visual Foundation Models with Gaussian Splatting
+
+
+ Given that visual foundation models (VFMs) are trained on extensive datasets
+but often limited to 2D images, a natural question arises: how well do they
+understand the 3D world? With the differences in architecture and training
+protocols (i.e., objectives, proxy tasks), a unified framework to fairly and
+comprehensively probe their 3D awareness is urgently needed. Existing works on
+3D probing suggest single-view 2.5D estimation (e.g., depth and normal) or
+two-view sparse 2D correspondence (e.g., matching and tracking). Unfortunately,
+these tasks ignore texture awareness, and require 3D data as ground-truth,
+which limits the scale and diversity of their evaluation set. To address these
+issues, we introduce Feat2GS, which readout 3D Gaussians attributes from VFM
+features extracted from unposed images. This allows us to probe 3D awareness
+for geometry and texture via novel view synthesis, without requiring 3D data.
+Additionally, the disentanglement of 3DGS parameters - geometry
+($\boldsymbol{x}, \alpha, \Sigma$) and texture ($\boldsymbol{c}$) - enables
+separate analysis of texture and geometry awareness. Under Feat2GS, we conduct
+extensive experiments to probe the 3D awareness of several VFMs, and
+investigate the ingredients that lead to a 3D aware VFM. Building on these
+findings, we develop several variants that achieve state-of-the-art across
+diverse datasets. This makes Feat2GS useful for probing VFMs, and as a
+simple-yet-effective baseline for novel-view synthesis. Code and data will be
+made available at https://fanegg.github.io/Feat2GS/.
+
+
+ The remarkable success of Large Language Models (LLMs) has extended to the
+multimodal domain, achieving outstanding performance in image understanding and
+generation. Recent efforts to develop unified Multimodal Large Language Models
+(MLLMs) that integrate these capabilities have shown promising results.
+However, existing approaches often involve complex designs in model
+architecture or training pipeline, increasing the difficulty of model training
+and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful
+encoder-free MLLM capable of both image understanding and generation. To
+address challenges identified in existing encoder-free unified MLLMs, we
+introduce the token folding mechanism and the vision-expert-based progressive
+alignment pretraining strategy, which effectively support high-resolution image
+understanding while reducing training complexity. After being trained on
+large-scale mixed image-text data with a unified next-token prediction
+objective, SynerGen-VL achieves or surpasses the performance of existing
+encoder-free unified MLLMs with comparable or smaller parameter sizes, and
+narrows the gap with task-specific state-of-the-art models, highlighting a
+promising path toward future unified MLLMs. Our code and models shall be
+released.
+
+
+
+
+
+
+
+ ☆ Do Multimodal Large Language Models See Like Humans?
+
+
+
+
+
+
+
+
+ Jiaying Lin, Shuquan Ye, Rynson W. H. Lau
+
+
+ Multimodal Large Language Models (MLLMs) have achieved impressive results on
+various vision tasks, leveraging recent advancements in large language models.
+However, a critical question remains unaddressed: do MLLMs perceive visual
+information similarly to humans? Current benchmarks lack the ability to
+evaluate MLLMs from this perspective. To address this challenge, we introduce
+HVSBench, a large-scale benchmark designed to assess the alignment between
+MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror
+human vision. HVSBench curated over 85K multimodal samples, spanning 13
+categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing,
+Free-Viewing, and Searching. Extensive experiments demonstrate the
+effectiveness of our benchmark in providing a comprehensive evaluation of
+MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models
+show significant room for improvement, with most achieving only moderate
+results. Our experiments reveal that HVSBench presents a new and significant
+challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate
+research on human-aligned and explainable MLLMs, marking a key step in
+understanding how MLLMs perceive and process visual information.
+
+
+
+
+
+
+
+
+ Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, Andreas Geiger, Kashyap Chitta
+
+
+ End-to-end driving systems have made rapid progress, but have so far not been
+applied to the challenging new CARLA Leaderboard 2.0. Further, while there is a
+large body of literature on end-to-end architectures and training strategies,
+the impact of the training dataset is often overlooked. In this work, we make a
+first attempt at end-to-end driving for Leaderboard 2.0. Instead of
+investigating architectures, we systematically analyze the training dataset,
+leading to new insights: (1) Expert style significantly affects downstream
+policy performance. (2) In complex data sets, the frames should not be weighted
+on the basis of simplistic criteria such as class frequencies. (3) Instead,
+estimating whether a frame changes the target labels compared to previous
+frames can reduce the size of the dataset without removing important
+information. By incorporating these findings, our model ranks first and second
+respectively on the map and sensors tracks of the 2024 CARLA Challenge, and
+sets a new state-of-the-art on the Bench2Drive test routes. Finally, we uncover
+a design flaw in the current evaluation metrics and propose a modification for
+future challenges. Our dataset, code, and pre-trained models are publicly
+available at https://github.com/autonomousvision/carla_garage.
+
+
+
+ comment: Technical report for the CVPR 2024 Workshop on Foundation Models for
+ Autonomous Systems. Runner-up of the track 'CARLA Autonomous Driving
+ Challenge' in the 2024 Autonomous Grand Challenge
+ (https://opendrivelab.com/challenge2024/)
+
+
+
+
+
+
+ ☆ TimeRefine: Temporal Grounding with Time Refining Video LLM
+
+
+ Video temporal grounding aims to localize relevant temporal boundaries in a
+video given a textual prompt. Recent work has focused on enabling Video LLMs to
+perform video temporal grounding via next-token prediction of temporal
+timestamps. However, accurately localizing timestamps in videos remains
+challenging for Video LLMs when relying solely on temporal token prediction.
+Our proposed TimeRefine addresses this challenge in two ways. First, instead of
+directly predicting the start and end timestamps, we reformulate the temporal
+grounding task as a temporal refining task: the model first makes rough
+predictions and then refines them by predicting offsets to the target segment.
+This refining process is repeated multiple times, through which the model
+progressively self-improves its temporal localization accuracy. Second, to
+enhance the model's temporal perception capabilities, we incorporate an
+auxiliary prediction head that penalizes the model more if a predicted segment
+deviates further from the ground truth, thus encouraging the model to make
+closer and more accurate predictions. Our plug-and-play method can be
+integrated into most LLM-based temporal grounding approaches. The experimental
+results demonstrate that TimeRefine achieves 3.6% and 5.0% mIoU improvements on
+the ActivityNet and Charades-STA datasets, respectively. Code and pretrained
+models will be released.
+
+
+
+
+
+
+
+ ☆ Owl-1: Omni World Model for Consistent Long Video Generation
+
+
+
+
+
+
+
+
+ Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, Jiwen Lu
+
+
+ Video generation models (VGMs) have received extensive attention recently and
+serve as promising candidates for general-purpose large vision models. While
+they can only generate short videos each time, existing methods achieve long
+video generation by iteratively calling the VGMs, using the last-frame output
+as the condition for the next-round generation. However, the last frame only
+contains short-term fine-grained information about the scene, resulting in
+inconsistency in the long horizon. To address this, we propose an Omni World
+modeL (Owl-1) to produce long-term coherent and comprehensive conditions for
+consistent long video generation. As videos are observations of the underlying
+evolving world, we propose to model the long-term developments in a latent
+space and use VGMs to film them into videos. Specifically, we represent the
+world with a latent state variable which can be decoded into explicit video
+observations. These observations serve as a basis for anticipating temporal
+dynamics which in turn update the state variable. The interaction between
+evolving dynamics and persistent state enhances the diversity and consistency
+of the long videos. Extensive experiments show that Owl-1 achieves comparable
+performance with SOTA methods on VBench-I2V and VBench-Long, validating its
+ability to generate high-quality video observations. Code:
+https://github.com/huang-yh/Owl.
+
+
+
+ comment: Code is available at: https://github.com/huang-yh/Owl
+
+
+
+
+
+
+ ☆ RatBodyFormer: Rodent Body Surface from Keypoints
+
+
+ Rat behavior modeling goes to the heart of many scientific studies, yet the
+textureless body surface evades automatic analysis as it literally has no
+keypoints that detectors can find. The movement of the body surface, however,
+is a rich source of information for deciphering the rat behavior. We introduce
+two key contributions to automatically recover densely 3D sampled rat body
+surface points, passively. The first is RatDome, a novel multi-camera system
+for rat behavior capture, and a large-scale dataset captured with it that
+consists of pairs of 3D keypoints and 3D body surface points. The second is
+RatBodyFormer, a novel network to transform detected keypoints to 3D body
+surface points. RatBodyFormer is agnostic to the exact locations of the 3D body
+surface points in the training data and is trained with masked-learning. We
+experimentally validate our framework with a number of real-world experiments.
+Our results collectively serve as a novel foundation for automated rat behavior
+analysis and will likely have far-reaching implications for biomedical and
+neuroscientific research.
+
+
+
+
+
+
+
+ ☆ LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video
+ Generation Priors
+
+
+ Single-image 3D reconstruction remains a fundamental challenge in computer
+vision due to inherent geometric ambiguities and limited viewpoint information.
+Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D
+priors learned from large-scale video data. However, leveraging these priors
+effectively faces three key challenges: (1) degradation in quality across large
+camera motions, (2) difficulties in achieving precise camera control, and (3)
+geometric distortions inherent to the diffusion process that damage 3D
+consistency. We address these challenges by proposing LiftImage3D, a framework
+that effectively releases LVDMs' generative priors while ensuring 3D
+consistency. Specifically, we design an articulated trajectory strategy to
+generate video frames, which decomposes video sequences with large camera
+motions into ones with controllable small motions. Then we use robust neural
+matching models, i.e. MASt3R, to calibrate the camera poses of generated frames
+and produce corresponding point clouds. Finally, we propose a distortion-aware
+3D Gaussian splatting representation, which can learn independent distortions
+between frames and output undistorted canonical Gaussians. Extensive
+experiments demonstrate that LiftImage3D achieves state-of-the-art performance
+on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and
+generalizes well to diverse in-the-wild images, from cartoon illustrations to
+complex real-world scenes.
+
+
+
+
+
+
+
+ ☆ InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for
+ Long-term Streaming Video and Audio Interactions
+
+
+
+
+
+
+
+
+ Pan Zhang, Xiaoyi Dong, Yuhang Cao, Yuhang Zang, Rui Qian, Xilin Wei, Lin Chen, Yifei Li, Junbo Niu, Shuangrui Ding, Qipeng Guo, Haodong Duan, Xin Chen, Han Lv, Zheng Nie, Min Zhang, Bin Wang, Wenwei Zhang, Xinyue Zhang, Jiaye Ge, Wei Li, Jingwen Li, Zhongying Tu, Conghui He, Xingcheng Zhang, Kai Chen, Yu Qiao, Dahua Lin, Jiaqi Wang
+
+
+ Creating AI systems that can interact with environments over long periods,
+similar to human cognition, has been a longstanding research goal. Recent
+advancements in multimodal large language models (MLLMs) have made significant
+strides in open-world understanding. However, the challenge of continuous and
+simultaneous streaming perception, memory, and reasoning remains largely
+unexplored. Current MLLMs are constrained by their sequence-to-sequence
+architecture, which limits their ability to process inputs and generate
+responses simultaneously, akin to being unable to think while perceiving.
+Furthermore, relying on long contexts to store historical data is impractical
+for long-term interactions, as retaining all information becomes costly and
+inefficient. Therefore, rather than relying on a single foundation model to
+perform all functions, this project draws inspiration from the concept of the
+Specialized Generalist AI and introduces disentangled streaming perception,
+reasoning, and memory mechanisms, enabling real-time interaction with streaming
+video and audio input. The proposed framework InternLM-XComposer2.5-OmniLive
+(IXC2.5-OL) consists of three key modules: (1) Streaming Perception Module:
+Processes multimodal information in real-time, storing key details in memory
+and triggering reasoning in response to user queries. (2) Multi-modal Long
+Memory Module: Integrates short-term and long-term memory, compressing
+short-term memories into long-term ones for efficient retrieval and improved
+accuracy. (3) Reasoning Module: Responds to queries and executes reasoning
+tasks, coordinating with the perception and memory modules. This project
+simulates human-like cognition, enabling multimodal large language models to
+provide continuous and adaptive service over time.
+
+
+ Recovering the geometry and materials of objects from a single image is
+challenging due to its under-constrained nature. In this paper, we present
+Neural LightRig, a novel framework that boosts intrinsic estimation by
+leveraging auxiliary multi-lighting conditions from 2D diffusion priors.
+Specifically, 1) we first leverage illumination priors from large-scale
+diffusion models to build our multi-light diffusion model on a synthetic
+relighting dataset with dedicated designs. This diffusion model generates
+multiple consistent images, each illuminated by point light sources in
+different directions. 2) By using these varied lighting images to reduce
+estimation uncertainty, we train a large G-buffer model with a U-Net backbone
+to accurately predict surface normals and materials. Extensive experiments
+validate that our approach significantly outperforms state-of-the-art methods,
+enabling accurate surface normal and PBR material estimation with vivid
+relighting effects. Code and dataset are available on our project page at
+https://projects.zxhezexin.com/neural-lightrig.
+
+
+
+
+
+
+
+
+ Fiona Ryan, Ajay Bati, Sangmin Lee, Daniel Bolya, Judy Hoffman, James M. Rehg
+
+
+ We address the problem of gaze target estimation, which aims to predict where
+a person is looking in a scene. Predicting a person's gaze target requires
+reasoning both about the person's appearance and the contents of the scene.
+Prior works have developed increasingly complex, hand-crafted pipelines for
+gaze target estimation that carefully fuse features from separate scene
+encoders, head encoders, and auxiliary models for signals like depth and pose.
+Motivated by the success of general-purpose feature extractors on a variety of
+visual tasks, we propose Gaze-LLE, a novel transformer framework that
+streamlines gaze target estimation by leveraging features from a frozen DINOv2
+encoder. We extract a single feature representation for the scene, and apply a
+person-specific positional prompt to decode gaze with a lightweight module. We
+demonstrate state-of-the-art performance across several gaze benchmarks and
+provide extensive analysis to validate our design choices. Our code is
+available at: http://github.com/fkryan/gazelle .
+
+
+ The standard practice for developing contemporary MLLMs is to feed features
+from vision encoder(s) into the LLM and train with natural language
+supervision. In this work, we posit an overlooked opportunity to optimize the
+intermediate LLM representations through a vision perspective (objective),
+i.e., solely natural language supervision is sub-optimal for the MLLM's visual
+understanding ability. To that end, we propose OLA-VLM, the first approach
+distilling knowledge into the LLM's hidden representations from a set of target
+visual representations. Firstly, we formulate the objective during the
+pretraining stage in MLLMs as a coupled optimization of predictive visual
+embedding and next text-token prediction. Secondly, we investigate MLLMs
+trained solely with natural language supervision and identify a positive
+correlation between the quality of visual representations within these models
+and their downstream performance. Moreover, upon probing our OLA-VLM, we
+observe improved representation quality owing to the embedding optimization.
+Thirdly, we demonstrate that our OLA-VLM outperforms the single and
+multi-encoder baselines, proving our approach's superiority over explicitly
+feeding the corresponding features to the LLM. Particularly, OLA-VLM boosts
+performance by an average margin of up to 2.5% on various benchmarks, with a
+notable improvement of 8.7% on the Depth task in CV-Bench. Our code is
+open-sourced at https://github.com/SHI-Labs/OLA-VLM .
+
+
+ This paper describes a semi-automatic pipeline to generate challenging
+question-answer-decoy sets for understanding long videos. Many existing video
+datasets and models are focused on short clips (10s-30s). While some long video
+datasets do exist, they can often be solved by powerful image models applied
+per frame (and often to very few frames) in a video, and are usually manually
+annotated at high cost. In order to mitigate both these problems, we propose a
+scalable dataset creation pipeline which leverages large models (VLMs and
+LLMs), to automatically generate dense, time-aligned video captions, as well as
+tough question answer decoy sets for video segments (up to 15 minutes in
+length). Our dataset Neptune covers a broad range of long video reasoning
+abilities and consists of a subset that emphasizes multimodal reasoning. Since
+existing metrics for open-ended question answering are either rule-based or may
+rely on proprietary models, we provide a new open source model-based metric GEM
+to score open-ended responses on Neptune. Benchmark evaluations reveal that
+most current open-source long video models perform poorly on Neptune,
+particularly on questions testing temporal ordering, counting and state
+changes. Through Neptune, we aim to spur the development of more advanced
+models capable of understanding long videos. The dataset is available at
+https://github.com/google-deepmind/neptune
+
+
+
+
+
+
+
+ ☆ FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D
+ Reconstruction
+
+
+ Existing sparse-view reconstruction models heavily rely on accurate known
+camera poses. However, deriving camera extrinsics and intrinsics from
+sparse-view images presents significant challenges. In this work, we present
+FreeSplatter, a highly scalable, feed-forward reconstruction framework capable
+of generating high-quality 3D Gaussians from uncalibrated sparse-view images
+and recovering their camera parameters in mere seconds. FreeSplatter is built
+upon a streamlined transformer architecture, comprising sequential
+self-attention blocks that facilitate information exchange among multi-view
+image tokens and decode them into pixel-wise 3D Gaussian primitives. The
+predicted Gaussian primitives are situated in a unified reference frame,
+allowing for high-fidelity 3D modeling and instant camera parameter estimation
+using off-the-shelf solvers. To cater to both object-centric and scene-level
+reconstruction, we train two model variants of FreeSplatter on extensive
+datasets. In both scenarios, FreeSplatter outperforms state-of-the-art
+baselines in terms of reconstruction quality and pose estimation accuracy.
+Furthermore, we showcase FreeSplatter's potential in enhancing the productivity
+of downstream applications, such as text/image-to-3D content creation.
+
+
+
+
+
+
+
+
+ Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu
+
+
+ We explore a novel video creation experience, namely Video Creation by
+Demonstration. Given a demonstration video and a context image from a different
+scene, we generate a physically plausible video that continues naturally from
+the context image and carries out the action concepts from the demonstration.
+To enable this capability, we present $\delta$-Diffusion, a self-supervised
+training approach that learns from unlabeled videos by conditional future frame
+prediction. Unlike most existing video generation controls that are based on
+explicit signals, we adopts the form of implicit latent control for maximal
+flexibility and expressiveness required by general videos. By leveraging a
+video foundation model with an appearance bottleneck design on top, we extract
+action latents from demonstration videos for conditioning the generation
+process with minimal appearance leakage. Empirically, $\delta$-Diffusion
+outperforms related baselines in terms of both human preference and large-scale
+machine evaluations, and demonstrates potentials towards interactive world
+simulation. Sampled video generation results are available at
+https://delta-diffusion.github.io/.
+
+
+
+ comment: Project page at https://delta-diffusion.github.io/
+
+ Multimodal incremental learning needs to digest the information from multiple
+modalities while concurrently learning new knowledge without forgetting the
+previously learned information. There are numerous challenges for this task,
+mainly including the larger storage size of multimodal data in exemplar-based
+methods and the computational requirement of finetuning on huge multimodal
+models. In this paper, we leverage the parameter-efficient tuning scheme to
+reduce the burden of fine-tuning and propose the exemplar masking framework to
+efficiently replay old knowledge. Specifically, the non-important tokens are
+masked based on the attention weights and the correlation across different
+modalities, significantly reducing the storage size of an exemplar and
+consequently saving more exemplars under the same memory buffer. Moreover, we
+design a multimodal data augmentation technique to diversify exemplars for
+replaying prior knowledge. In experiments, we not only evaluate our method in
+existing multimodal datasets but also extend the ImageNet-R dataset to a
+multimodal dataset as a real-world application, where captions are generated by
+querying multimodal large language models (e.g., InstructBLIP). Extensive
+experiments show that our exemplar masking framework is more efficient and
+robust to catastrophic forgetting under the same limited memory buffer. Code is
+available at https://github.com/YiLunLee/Exemplar_Masking_MCIL.
+
+
+
+
+
+
+
+ ☆ Meshtron: High-Fidelity, Artist-Like 3D Mesh Generation at Scale
+
+
+
+
+
+
+
+
+ Zekun Hao, David W. Romero, Tsung-Yi Lin, Ming-Yu Liu
+
+
+ Meshes are fundamental representations of 3D surfaces. However, creating
+high-quality meshes is a labor-intensive task that requires significant time
+and expertise in 3D modeling. While a delicate object often requires over
+$10^4$ faces to be accurately modeled, recent attempts at generating
+artist-like meshes are limited to $1.6$K faces and heavy discretization of
+vertex coordinates. Hence, scaling both the maximum face count and vertex
+coordinate resolution is crucial to producing high-quality meshes of realistic,
+complex 3D objects. We present Meshtron, a novel autoregressive mesh generation
+model able to generate meshes with up to 64K faces at 1024-level coordinate
+resolution --over an order of magnitude higher face count and $8{\times}$
+higher coordinate resolution than current state-of-the-art methods. Meshtron's
+scalability is driven by four key components: (1) an hourglass neural
+architecture, (2) truncated sequence training, (3) sliding window inference,
+(4) a robust sampling strategy that enforces the order of mesh sequences. This
+results in over $50{\%}$ less training memory, $2.5{\times}$ faster throughput,
+and better consistency than existing works. Meshtron generates meshes of
+detailed, complex 3D objects at unprecedented levels of resolution and
+fidelity, closely resembling those created by professional artists, and opening
+the door to more realistic generation of detailed 3D assets for animation,
+gaming, and virtual environments.
+
+
+
+
+
+
+
+ ☆ SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
+
+
+
+
+
+
+
+
+ Xueting Li, Ye Yuan, Shalini De Mello, Gilles Daviet, Jonathan Leaf, Miles Macklin, Jan Kautz, Umar Iqbal
+
+
+ We introduce SimAvatar, a framework designed to generate simulation-ready
+clothed 3D human avatars from a text prompt. Current text-driven human avatar
+generation methods either model hair, clothing, and the human body using a
+unified geometry or produce hair and garments that are not easily adaptable for
+simulation within existing simulation pipelines. The primary challenge lies in
+representing the hair and garment geometry in a way that allows leveraging
+established prior knowledge from foundational image diffusion models (e.g.,
+Stable Diffusion) while being simulation-ready using either physics or neural
+simulators. To address this task, we propose a two-stage framework that
+combines the flexibility of 3D Gaussians with simulation-ready hair strands and
+garment meshes. Specifically, we first employ three text-conditioned 3D
+generative models to generate garment mesh, body shape and hair strands from
+the given text prompt. To leverage prior knowledge from foundational diffusion
+models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair
+strands and learn the avatar appearance through optimization. To drive the
+avatar given a pose sequence, we first apply physics simulators onto the
+garment meshes and hair strands. We then transfer the motion onto 3D Gaussians
+through carefully designed mechanisms for each body part. As a result, our
+synthesized avatars have vivid texture and realistic dynamic motion. To the
+best of our knowledge, our method is the first to produce highly realistic,
+fully simulation-ready 3D avatars, surpassing the capabilities of current
+approaches.
+
+
+
+
+
+
+
+
+ Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang
+
+
+ The application of Large Vision-Language Models (LVLMs) for analyzing images
+and videos is an exciting and rapidly evolving field. In recent years, we've
+seen significant growth in high-quality image-text datasets for fine-tuning
+image understanding, but there is still a lack of comparable datasets for
+videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which
+may not efficiently handle the complexities of longer videos. In this study, we
+introduce a large-scale synthetic dataset created from proprietary models,
+using carefully designed prompts to tackle a wide range of questions. We also
+explore a dynamic visual token compression architecture that strikes a balance
+between computational efficiency and performance. Our proposed \model{}
+achieves state-of-the-art results across various video tasks and shows
+impressive generalization, setting new baselines in multi-image understanding.
+Notably, \model{} delivers an absolute improvement of 2.7\% over
+LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at
+https://github.com/Hon-Wong/ByteVideoLLM
+
+
+
+
+
+
+
+ ☆ Can Modern LLMs Act as Agent Cores in Radiology~Environments?
+
+
+ Advancements in large language models (LLMs) have paved the way for LLM-based
+agent systems that offer enhanced accuracy and interpretability across various
+domains. Radiology, with its complex analytical requirements, is an ideal field
+for the application of these agents. This paper aims to investigate the
+pre-requisite question for building concrete radiology agents which is, `Can
+modern LLMs act as agent cores in radiology environments?' To investigate it,
+we introduce RadABench with three-fold contributions: First, we present
+RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based
+agents, generated from an extensive taxonomy encompassing 6 anatomies, 5
+imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we
+propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a
+prompt-driven workflow and the capability to simulate a wide range of radiology
+toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark
+from 5 perspectives with multiple metrics. Our findings indicate that while
+current LLMs demonstrate strong capabilities in many areas, they are still not
+sufficiently advanced to serve as the central agent core in a fully operational
+radiology agent system. Additionally, we identify key factors influencing the
+performance of LLM-based agent cores, offering insights for clinicians on how
+to apply agent systems in real-world radiology practices effectively. All of
+our code and data are open-sourced in
+https://github.com/MAGIC-AI4Med/RadABench.
+
+
+
+ comment: 22 pages,7 figures
+
+
+
+
+
+
+ ☆ Efficient and Comprehensive Feature Extraction in Large Vision-Language
+ Model for Clinical Pathology Analysis
+
+
+ Pathological diagnosis is vital for determining disease characteristics,
+guiding treatment, and assessing prognosis, relying heavily on detailed,
+multi-scale analysis of high-resolution whole slide images (WSI). However,
+traditional pure vision models face challenges of redundant feature extraction,
+whereas existing large vision-language models (LVLMs) are limited by input
+resolution constraints, hindering their efficiency and accuracy. To overcome
+these issues, we propose two innovative strategies: the mixed task-guided
+feature enhancement, which directs feature extraction toward lesion-related
+details across scales, and the prompt-guided detail feature completion, which
+integrates coarse- and fine-grained features from WSI based on specific prompts
+without compromising inference speed. Leveraging a comprehensive dataset of
+490,000 samples from diverse pathology tasks-including cancer detection,
+grading, vascular and neural invasion identification, and so on-we trained the
+pathology-specialized LVLM, OmniPath. Extensive experiments demonstrate that
+this model significantly outperforms existing methods in diagnostic accuracy
+and efficiency, offering an interactive, clinically aligned approach for
+auxiliary diagnosis in a wide range of pathology applications.
+
+
+ As information becomes more accessible, user-generated videos are increasing
+in length, placing a burden on viewers to sift through vast content for
+valuable insights. This trend underscores the need for an algorithm to extract
+key video information efficiently. Despite significant advancements in
+highlight detection, moment retrieval, and video summarization, current
+approaches primarily focus on selecting specific time intervals, often
+overlooking the relevance between segments and the potential for segment
+arranging. In this paper, we introduce a novel task called Video Trimming (VT),
+which focuses on detecting wasted footage, selecting valuable segments, and
+composing them into a final video with a coherent story. To address this task,
+we propose Agent-based Video Trimming (AVT), structured into three phases:
+Video Structuring, Clip Filtering, and Story Composition. Specifically, we
+employ a Video Captioning Agent to convert video slices into structured textual
+descriptions, a Filtering Module to dynamically discard low-quality footage
+based on the structured information of each clip, and a Video Arrangement Agent
+to select and compile valid clips into a coherent final narrative. For
+evaluation, we develop a Video Evaluation Agent to assess trimmed videos,
+conducting assessments in parallel with human evaluations. Additionally, we
+curate a new benchmark dataset for video trimming using raw user videos from
+the internet. As a result, AVT received more favorable evaluations in user
+studies and demonstrated superior mAP and precision on the YouTube Highlights,
+TVSum, and our own dataset for the highlight detection task. The code and
+models are available at https://ylingfeng.github.io/AVT.
+
+
+
+
+
+
+
+ ☆ GEAL: Generalizable 3D Affordance Learning with Cross-Modal Consistency
+
+
+ Identifying affordance regions on 3D objects from semantic cues is essential
+for robotics and human-machine interaction. However, existing 3D affordance
+learning methods struggle with generalization and robustness due to limited
+annotated data and a reliance on 3D backbones focused on geometric encoding,
+which often lack resilience to real-world noise and data corruption. We propose
+GEAL, a novel framework designed to enhance the generalization and robustness
+of 3D affordance learning by leveraging large-scale pre-trained 2D models. We
+employ a dual-branch architecture with Gaussian splatting to establish
+consistent mappings between 3D point clouds and 2D representations, enabling
+realistic 2D renderings from sparse point clouds. A granularity-adaptive fusion
+module and a 2D-3D consistency alignment module further strengthen cross-modal
+alignment and knowledge transfer, allowing the 3D branch to benefit from the
+rich semantics and generalization capacity of 2D models. To holistically assess
+the robustness, we introduce two new corruption-based benchmarks: PIAD-C and
+LASO-C. Extensive experiments on public datasets and our benchmarks show that
+GEAL consistently outperforms existing methods across seen and novel object
+categories, as well as corrupted data, demonstrating robust and adaptable
+affordance prediction under diverse conditions. Code and corruption datasets
+have been made publicly available.
+
+
+
+
+
+
+
+ ☆ Vision Transformers for Efficient Indoor Pathloss Radio Map Prediction
+
+
+
+
+
+
+
+
+ Edvard Ghukasyan, Hrant Khachatrian, Rafayel Mkrtchyan, Theofanis P. Raptis
+
+
+ Vision Transformers (ViTs) have demonstrated remarkable success in achieving
+state-of-the-art performance across various image-based tasks and beyond. In
+this study, we employ a ViT-based neural network to address the problem of
+indoor pathloss radio map prediction. The network's generalization ability is
+evaluated across diverse settings, including unseen buildings, frequencies, and
+antennas with varying radiation patterns. By leveraging extensive data
+augmentation techniques and pretrained DINOv2 weights, we achieve promising
+results, even under the most challenging scenarios.
+
+
+
+ comment: Work partly supported by the RA Science Committee grant No. 22rl-052
+ (DISTAL) and the EU under Italian National Recovery and Resilience Plan of
+ NextGenerationEU on "Telecommunications of the Future" (PE00000001 - program
+ "RESTART")
+
+
+
+
+
+
+ ☆ Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition
+
+
+ As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond
+single-domain capabilities is essential to meet the demands for more versatile
+and efficient AI. However, previous omni-models have insufficiently explored
+speech, neglecting its integration with multi-modality. We introduce Lyra, an
+efficient MLLM that enhances multimodal abilities, including advanced
+long-speech comprehension, sound understanding, cross-modality efficiency, and
+seamless speech interaction. To achieve efficiency and speech-centric
+capabilities, Lyra employs three strategies: (1) leveraging existing
+open-source large models and a proposed multi-modality LoRA to reduce training
+costs and data requirements; (2) using a latent multi-modality regularizer and
+extractor to strengthen the relationship between speech and other modalities,
+thereby enhancing model performance; and (3) constructing a high-quality,
+extensive dataset that includes 1.5M multi-modal (language, vision, audio) data
+samples and 12K long speech samples, enabling Lyra to handle complex long
+speech inputs and achieve more robust omni-cognition. Compared to other
+omni-methods, Lyra achieves state-of-the-art performance on various
+vision-language, vision-speech, and speech-language benchmarks, while also
+using fewer computational resources and less training data.
+
+
+
+ comment: Tech report
+
+
+
+
+
+
+ ☆ Video Seal: Open and Efficient Video Watermarking
+
+
+
+
+
+
+
+
+ Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, Alexandre Mourachko
+
+
+ The proliferation of AI-generated content and sophisticated video editing
+tools has made it both important and challenging to moderate digital platforms.
+Video watermarking addresses these challenges by embedding imperceptible
+signals into videos, allowing for identification. However, the rare open tools
+and methods often fall short on efficiency, robustness, and flexibility. To
+reduce these gaps, this paper introduces Video Seal, a comprehensive framework
+for neural video watermarking and a competitive open-sourced model. Our
+approach jointly trains an embedder and an extractor, while ensuring the
+watermark robustness by applying transformations in-between, e.g., video
+codecs. This training is multistage and includes image pre-training, hybrid
+post-training and extractor fine-tuning. We also introduce temporal watermark
+propagation, a technique to convert any image watermarking model to an
+efficient video watermarking model without the need to watermark every
+high-resolution frame. We present experimental results demonstrating the
+effectiveness of the approach in terms of speed, imperceptibility, and
+robustness. Video Seal achieves higher robustness compared to strong baselines
+especially under challenging distortions combining geometric transformations
+and video compression. Additionally, we provide new insights such as the impact
+of video compression during training, and how to compare methods operating on
+different payloads. Contributions in this work - including the codebase,
+models, and a public demo - are open-sourced under permissive licenses to
+foster further research and development in the field.
+
+
+
+ comment: Code available at https://github.com/facebookresearch/videoseal
+
+
+
+
+
+
+ ☆ New keypoint-based approach for recognising British Sign Language (BSL)
+ from sequences ICCV
+
+
+
+
+
+
+
+
+ Oishi Deb, KR Prajwal, Andrew Zisserman
+
+
+ In this paper, we present a novel keypoint-based classification model
+designed to recognise British Sign Language (BSL) words within continuous
+signing sequences. Our model's performance is assessed using the BOBSL dataset,
+revealing that the keypoint-based approach surpasses its RGB-based counterpart
+in computational efficiency and memory usage. Furthermore, it offers expedited
+training times and demands fewer computational resources. To the best of our
+knowledge, this is the inaugural application of a keypoint-based model for BSL
+word classification, rendering direct comparisons with existing works
+unavailable.
+
+
+
+ comment: International Conference on Computer Vision (ICCV) - HANDS Workshop
+
+
+
+
+
+
+ ☆ OFTSR: One-Step Flow for Image Super-Resolution with Tunable
+ Fidelity-Realism Trade-offs
+
+
+ Recent advances in diffusion and flow-based generative models have
+demonstrated remarkable success in image restoration tasks, achieving superior
+perceptual quality compared to traditional deep learning approaches. However,
+these methods either require numerous sampling steps to generate high-quality
+images, resulting in significant computational overhead, or rely on model
+distillation, which usually imposes a fixed fidelity-realism trade-off and thus
+lacks flexibility. In this paper, we introduce OFTSR, a novel flow-based
+framework for one-step image super-resolution that can produce outputs with
+tunable levels of fidelity and realism. Our approach first trains a conditional
+flow-based super-resolution model to serve as a teacher model. We then distill
+this teacher model by applying a specialized constraint. Specifically, we force
+the predictions from our one-step student model for same input to lie on the
+same sampling ODE trajectory of the teacher model. This alignment ensures that
+the student model's single-step predictions from initial states match the
+teacher's predictions from a closer intermediate state. Through extensive
+experiments on challenging datasets including FFHQ (256$\times$256), DIV2K, and
+ImageNet (256$\times$256), we demonstrate that OFTSR achieves state-of-the-art
+performance for one-step image super-resolution, while having the ability to
+flexibly tune the fidelity-realism trade-off. Code and pre-trained models are
+available at https://github.com/yuanzhi-zhu/OFTSR and
+https://huggingface.co/Yuanzhi/OFTSR, respectively.
+
+
+
+
+
+
+
+ ☆ Embeddings are all you need! Achieving High Performance Medical Image
+ Classification through Training-Free Embedding Analysis
+
+
+
+
+
+
+
+
+ Raj Hansini Khoiwal, Alan B. McMillan
+
+
+ Developing artificial intelligence (AI) and machine learning (ML) models for
+medical imaging typically involves extensive training and testing on large
+datasets, consuming significant computational time, energy, and resources.
+There is a need for more efficient methods that can achieve comparable or
+superior diagnostic performance without the associated resource burden. We
+investigated the feasibility of replacing conventional training procedures with
+an embedding-based approach that leverages concise and semantically meaningful
+representations of medical images. Using pre-trained foundational
+models-specifically, convolutional neural networks (CNN) like ResNet and
+multimodal models like Contrastive Language-Image Pre-training (CLIP)-we
+generated image embeddings for multi-class classification tasks. Simple linear
+classifiers were then applied to these embeddings. The approach was evaluated
+across diverse medical imaging modalities, including retinal images,
+mammography, dermatoscopic images, and chest radiographs. Performance was
+compared to benchmark models trained and tested using traditional methods. The
+embedding-based models surpassed the benchmark area under the receiver
+operating characteristic curve (AUC-ROC) scores by up to 87 percentage in
+multi-class classification tasks across the various medical imaging modalities.
+Notably, CLIP embedding models achieved the highest AUC-ROC scores,
+demonstrating superior classification performance while significantly reducing
+computational demands. Our study indicates that leveraging embeddings from
+pre-trained foundational models can effectively replace conventional,
+resource-intensive training and testing procedures in medical image analysis.
+This embedding-based approach offers a more efficient alternative for image
+segmentation, classification, and prediction, potentially accelerating AI
+technology integration into clinical practice.
+
+
+
+ comment: 15 pages, 7 figures, 3 tables
+
+
+
+
+
+
+ ☆ MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental
+ Learning AAAI 2025
+
+
+
+
+
+
+
+
+ Hai-Long Sun, Da-Wei Zhou, Hanbin Zhao, Le Gan, De-Chuan Zhan, Han-Jia Ye
+
+
+ Class-Incremental Learning (CIL) requires models to continually acquire
+knowledge of new classes without forgetting old ones. Despite Pre-trained
+Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting
+still occurs as the model learns new concepts. Existing work seeks to utilize
+lightweight components to adjust the PTM, while the forgetting phenomenon still
+comes from {\em parameter and retrieval} levels. Specifically, iterative
+updates of the model result in parameter drift, while mistakenly retrieving
+irrelevant modules leads to the mismatch during inference. To this end, we
+propose MOdel Surgery (MOS) to rescue the model from forgetting previous
+knowledge. By training task-specific adapters, we continually adjust the PTM to
+downstream tasks. To mitigate parameter-level forgetting, we present an adapter
+merging approach to learn task-specific adapters, which aims to bridge the gap
+between different components while reserve task-specific information. Besides,
+to address retrieval-level forgetting, we introduce a training-free
+self-refined adapter retrieval mechanism during inference, which leverages the
+model's inherent ability for better adapter retrieval. By jointly rectifying
+the model with those steps, MOS can robustly resist catastrophic forgetting in
+the learning process. Extensive experiments on seven benchmark datasets
+validate MOS's state-of-the-art performance. Code is available at:
+https://github.com/sun-hailong/AAAI25-MOS
+
+
+
+ comment: Accepted to AAAI 2025. Code is available at:
+ https://github.com/sun-hailong/AAAI25-MOS
+
+ Textual-based prompt learning methods primarily employ multiple learnable
+soft prompts and hard class tokens in a cascading manner as text prompt inputs,
+aiming to align image and text (category) spaces for downstream tasks. However,
+current training is restricted to aligning images with predefined known
+categories and cannot be associated with unknown categories. In this work, we
+propose utilizing universal attributes as a bridge to enhance the alignment
+between images and unknown categories. Specifically, we introduce an
+Attribute-embedded Textual Prompt learning method for vision-language models,
+named ATPrompt. This approach expands the learning space of soft prompts from
+the original one-dimensional category level into the multi-dimensional
+attribute level by incorporating multiple universal attribute tokens into the
+learnable soft prompts. Through this modification, we transform the text prompt
+from a category-centric form to an attribute-category hybrid form. To finalize
+the attributes for downstream tasks, we propose a differentiable attribute
+search method that learns to identify representative and suitable attributes
+from a candidate pool summarized by a large language model. As an easy-to-use
+plug-in technique, ATPrompt can seamlessly replace the existing prompt format
+of textual-based methods, offering general improvements at a negligible
+computational cost. Extensive experiments on 11 datasets demonstrate the
+effectiveness of our method.
+
+
+ The dissertation presents four key contributions toward fairness and
+robustness in vision learning. First, to address the problem of large-scale
+data requirements, the dissertation presents a novel Fairness Domain Adaptation
+approach derived from two major novel research findings of Bijective Maximum
+Likelihood and Fairness Adaptation Learning. Second, to enable the capability
+of open-world modeling of vision learning, this dissertation presents a novel
+Open-world Fairness Continual Learning Framework. The success of this research
+direction is the result of two research lines, i.e., Fairness Continual
+Learning and Open-world Continual Learning. Third, since visual data are often
+captured from multiple camera views, robust vision learning methods should be
+capable of modeling invariant features across views. To achieve this desired
+goal, the research in this thesis will present a novel Geometry-based
+Cross-view Adaptation framework to learn robust feature representations across
+views. Finally, with the recent increase in large-scale videos and multimodal
+data, understanding the feature representations and improving the robustness of
+large-scale visual foundation models is critical. Therefore, this thesis will
+present novel Transformer-based approaches to improve the robust feature
+representations against multimodal and temporal data. Then, a novel Domain
+Generalization Approach will be presented to improve the robustness of visual
+foundation models. The research's theoretical analysis and experimental results
+have shown the effectiveness of the proposed approaches, demonstrating their
+superior performance compared to prior studies. The contributions in this
+dissertation have advanced the fairness and robustness of machine vision
+learning.
+
+
+
+ comment: PhD Dissertation
+
+
+
+
+
+
+ ☆ Multimodal Music Generation with Explicit Bridges and Retrieval
+ Augmentation
+
+
+
+
+
+
+
+
+ Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu
+
+
+ Multimodal music generation aims to produce music from diverse input
+modalities, including text, videos, and images. Existing methods use a common
+embedding space for multimodal fusion. Despite their effectiveness in other
+modalities, their application in multimodal music generation faces challenges
+of data scarcity, weak cross-modal alignment, and limited controllability. This
+paper addresses these issues by using explicit bridges of text and music for
+multimodal alignment. We introduce a novel method named Visuals Music Bridge
+(VMB). Specifically, a Multimodal Music Description Model converts visual
+inputs into detailed textual descriptions to provide the text bridge; a
+Dual-track Music Retrieval module that combines broad and targeted retrieval
+strategies to provide the music bridge and enable user control. Finally, we
+design an Explicitly Conditioned Music Generation framework to generate music
+based on the two bridges. We conduct experiments on video-to-music,
+image-to-music, text-to-music, and controllable music generation tasks, along
+with experiments on controllability. The results demonstrate that VMB
+significantly enhances music quality, modality, and customization alignment
+compared to previous methods. VMB sets a new standard for interpretable and
+expressive multimodal music generation with applications in various multimedia
+fields. Demos and code are available at https://github.com/wbs2788/VMB.
+
+
+
+
+
+
+
+ ☆ A Plug-and-Play Algorithm for 3D Video Super-Resolution of Single-Photon
+ LiDAR data
+
+
+
+
+
+
+
+
+ Alice Ruget, Lewis Wilson, Jonathan Leach, Rachael Tobin, Aongus Mccarthy, Gerald S. Buller, Steve Mclaughlin, Abderrahim Halimi
+
+
+ Single-photon avalanche diodes (SPADs) are advanced sensors capable of
+detecting individual photons and recording their arrival times with picosecond
+resolution using time-correlated Single-Photon Counting detection techniques.
+They are used in various applications, such as LiDAR, and can capture
+high-speed sequences of binary single-photon images, offering great potential
+for reconstructing 3D environments with high motion dynamics. To complement
+single-photon data, they are often paired with conventional passive cameras,
+which capture high-resolution (HR) intensity images at a lower frame rate.
+However, 3D reconstruction from SPAD data faces challenges. Aggregating
+multiple binary measurements improves precision and reduces noise but can cause
+motion blur in dynamic scenes. Additionally, SPAD arrays often have lower
+resolution than passive cameras. To address these issues, we propose a novel
+computational imaging algorithm to improve the 3D reconstruction of moving
+scenes from SPAD data by addressing the motion blur and increasing the native
+spatial resolution. We adopt a plug-and-play approach within an optimization
+scheme alternating between guided video super-resolution of the 3D scene, and
+precise image realignment using optical flow. Experiments on synthetic data
+show significantly improved image resolutions across various signal-to-noise
+ratios and photon levels. We validate our method using real-world SPAD
+measurements on three practical situations with dynamic objects. First on
+fast-moving scenes in laboratory conditions at short range; second very low
+resolution imaging of people with a consumer-grade SPAD sensor from
+STMicroelectronics; and finally, HR imaging of people walking outdoors in
+daylight at a range of 325 meters under eye-safe illumination conditions using
+a short-wave infrared SPAD camera. These results demonstrate the robustness and
+versatility of our approach.
+
+
+
+
+
+
+
+
+ Dan Jacobellis, Neeraja J. Yadwadkar
+
+
+ Modern sensors produce increasingly rich streams of high-resolution data. Due
+to resource constraints, machine learning systems discard the vast majority of
+this information via resolution reduction. Compressed-domain learning allows
+models to operate on compact latent representations, allowing higher effective
+resolution for the same budget. However, existing compression systems are not
+ideal for compressed learning. Linear transform coding and end-to-end learned
+compression systems reduce bitrate, but do not uniformly reduce dimensionality;
+thus, they do not meaningfully increase efficiency. Generative autoencoders
+reduce dimensionality, but their adversarial or perceptual objectives lead to
+significant information loss. To address these limitations, we introduce WaLLoC
+(Wavelet Learned Lossy Compression), a neural codec architecture that combines
+linear transform coding with nonlinear dimensionality-reducing autoencoders.
+WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck
+between an invertible wavelet packet transform. Across several key metrics,
+WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion
+models. WaLLoC does not require perceptual or adversarial losses to represent
+high-frequency detail, providing compatibility with modalities beyond RGB
+images and stereo audio. WaLLoC's encoder consists almost entirely of linear
+operations, making it exceptionally efficient and suitable for mobile
+computing, remote sensing, and learning directly from compressed data. We
+demonstrate WaLLoC's capability for compressed-domain learning across several
+tasks, including image classification, colorization, document understanding,
+and music source separation. Our code, experiments, and pre-trained audio and
+image codecs are available at https://ut-sysml.org/walloc
+
+
+
+ comment: Accepted as paper to 2025 IEEE Data Compression Conference
+
+
+
+
+
+
+ ☆ MultiEYE: Dataset and Benchmark for OCT-Enhanced Retinal Disease
+ Recognition from Fundus Images
+
+
+
+
+
+
+
+
+ Lehan Wang, Chongchong Qi, Chubin Ou, Lin An, Mei Jin, Xiangbin Kong, Xiaomeng Li
+
+
+ Existing multi-modal learning methods on fundus and OCT images mostly require
+both modalities to be available and strictly paired for training and testing,
+which appears less practical in clinical scenarios. To expand the scope of
+clinical applications, we formulate a novel setting, "OCT-enhanced disease
+recognition from fundus images", that allows for the use of unpaired
+multi-modal data during the training phase and relies on the widespread fundus
+photographs for testing. To benchmark this setting, we present the first large
+multi-modal multi-class dataset for eye disease diagnosis, MultiEYE, and
+propose an OCT-assisted Conceptual Distillation Approach (OCT-CoDA), which
+employs semantically rich concepts to extract disease-related knowledge from
+OCT images and leverage them into the fundus model. Specifically, we regard the
+image-concept relation as a link to distill useful knowledge from the OCT
+teacher model to the fundus student model, which considerably improves the
+diagnostic performance based on fundus images and formulates the cross-modal
+knowledge transfer into an explainable process. Through extensive experiments
+on the multi-disease classification task, our proposed OCT-CoDA demonstrates
+remarkable results and interpretability, showing great potential for clinical
+application. Our dataset and code are available at
+https://github.com/xmed-lab/MultiEYE.
+
+
+
+ comment: Accepted at IEEE TMI
+
+
+
+
+
+
+ ☆ SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos
+
+
+ In this paper, we introduce \textbf{SLAM3R}, a novel and effective monocular
+RGB SLAM system for real-time and high-quality dense 3D reconstruction. SLAM3R
+provides an end-to-end solution by seamlessly integrating local 3D
+reconstruction and global coordinate registration through feed-forward neural
+networks. Given an input video, the system first converts it into overlapping
+clips using a sliding window mechanism. Unlike traditional pose
+optimization-based methods, SLAM3R directly regresses 3D pointmaps from RGB
+images in each window and progressively aligns and deforms these local
+pointmaps to create a globally consistent scene reconstruction - all without
+explicitly solving any camera parameters. Experiments across datasets
+consistently show that SLAM3R achieves state-of-the-art reconstruction accuracy
+and completeness while maintaining real-time performance at 20+ FPS. Code and
+weights at: \url{https://github.com/PKU-VCL-3DV/SLAM3R}.
+
+
+
+
+
+
+
+ ☆ UFO: Enhancing Diffusion-Based Video Generation with a Uniform Frame
+ Organizer
+
+
+ Recently, diffusion-based video generation models have achieved significant
+success. However, existing models often suffer from issues like weak
+consistency and declining image quality over time. To overcome these
+challenges, inspired by aesthetic principles, we propose a non-invasive plug-in
+called Uniform Frame Organizer (UFO), which is compatible with any
+diffusion-based video generation model. The UFO comprises a series of adaptive
+adapters with adjustable intensities, which can significantly enhance the
+consistency between the foreground and background of videos and improve image
+quality without altering the original model parameters when integrated. The
+training for UFO is simple, efficient, requires minimal resources, and supports
+stylized training. Its modular design allows for the combination of multiple
+UFOs, enabling the customization of personalized video generation models.
+Furthermore, the UFO also supports direct transferability across different
+models of the same specification without the need for specific retraining. The
+experimental results indicate that UFO effectively enhances video generation
+quality and demonstrates its superiority in public video generation benchmarks.
+The code will be publicly available at https://github.com/Delong-liu-bupt/UFO.
+
+
+ Knowledge Distillation (KD) is essential in transferring dark knowledge from
+a large teacher to a small student network, such that the student can be much
+more efficient than the teacher but with comparable accuracy. Existing KD
+methods, however, rely on a large teacher trained specifically for the target
+task, which is both very inflexible and inefficient. In this paper, we argue
+that a SSL-pretrained model can effectively act as the teacher and its dark
+knowledge can be captured by the coordinate system or linear subspace where the
+features lie in. We then need only one forward pass of the teacher, and then
+tailor the coordinate system (TCS) for the student network. Our TCS method is
+teacher-free and applies to diverse architectures, works well for KD and
+practical few-shot learning, and allows cross-architecture distillation with
+large capacity gap. Experiments show that TCS achieves significantly higher
+accuracy than state-of-the-art KD methods, while only requiring roughly half of
+their training time and GPU memory costs.
+
+
+ The segmentation and classification of cardiac magnetic resonance imaging are
+critical for diagnosing heart conditions, yet current approaches face
+challenges in accuracy and generalizability. In this study, we aim to further
+advance the segmentation and classification of cardiac magnetic resonance
+images by introducing a novel deep learning-based approach. Using a multi-stage
+process with U-Net and ResNet models for segmentation, followed by Gaussian
+smoothing, the method improved segmentation accuracy, achieving a Dice
+coefficient of 0.974 for the left ventricle and 0.947 for the right ventricle.
+For classification, a cascade of deep learning classifiers was employed to
+distinguish heart conditions, including hypertrophic cardiomyopathy, myocardial
+infarction, and dilated cardiomyopathy, achieving an average accuracy of 97.2%.
+The proposed approach outperformed existing models, enhancing segmentation
+accuracy and classification precision. These advancements show promise for
+clinical applications, though further validation and interpretation across
+diverse imaging protocols is necessary.
+
+
+
+
+
+
+
+
+ Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
+
+
+ Recent work has empirically shown that Vision-Language Models (VLMs) struggle
+to fully understand the compositional properties of the human language, usually
+modeling an image caption as a "bag of words". As a result, they perform poorly
+on compositional tasks, which require a deeper understanding of the different
+entities of a sentence (subject, verb, etc.) jointly with their mutual
+relationships in order to be solved. In this paper, we model the dependency
+relations among textual and visual tokens using a Causal Graphical Model (CGM),
+built using a dependency parser, and we train a decoder conditioned by the VLM
+visual encoder. Differently from standard autoregressive or parallel
+predictions, our decoder's generative process is partially-ordered following
+the CGM structure. This structure encourages the decoder to learn only the main
+causal dependencies in a sentence discarding spurious correlations. Using
+extensive experiments on five compositional benchmarks, we show that our method
+significantly outperforms all the state-of-the-art compositional approaches by
+a large margin, and it also improves over methods trained using much larger
+datasets.
+
+
+
+
+
+
+
+ ☆ DisPose: Disentangling Pose Guidance for Controllable Human Image
+ Animation
+
+
+ Controllable human image animation aims to generate videos from reference
+images using driving videos. Due to the limited control signals provided by
+sparse guidance (e.g., skeleton pose), recent works have attempted to introduce
+additional dense conditions (e.g., depth map) to ensure motion alignment.
+However, such strict dense guidance impairs the quality of the generated video
+when the body shape of the reference character differs significantly from that
+of the driving video. In this paper, we present DisPose to mine more
+generalizable and effective control signals without additional dense input,
+which disentangles the sparse skeleton pose in human image animation into
+motion field guidance and keypoint correspondence. Specifically, we generate a
+dense motion field from a sparse motion field and the reference image, which
+provides region-level dense guidance while maintaining the generalization of
+the sparse pose control. We also extract diffusion features corresponding to
+pose keypoints from the reference image, and then these point features are
+transferred to the target pose to provide distinct identity information. To
+seamlessly integrate into existing models, we propose a plug-and-play hybrid
+ControlNet that improves the quality and consistency of generated videos while
+freezing the existing model parameters. Extensive qualitative and quantitative
+experiments demonstrate the superiority of DisPose compared to current methods.
+Code:
+\hyperlink{https://github.com/lihxxx/DisPose}{https://github.com/lihxxx/DisPose}.
+
+
+
+
+
+
+
+ ☆ Quantitative Evaluation of Motif Sets in Time Series
+
+
+
+
+
+
+
+
+ Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
+
+
+ Time Series Motif Discovery (TSMD), which aims at finding recurring patterns
+in time series, is an important task in numerous application domains, and many
+methods for this task exist. These methods are usually evaluated qualitatively.
+A few metrics for quantitative evaluation, where discovered motifs are compared
+to some ground truth, have been proposed, but they typically make implicit
+assumptions that limit their applicability. This paper introduces PROM, a
+broadly applicable metric that overcomes those limitations, and TSMD-Bench, a
+benchmark for quantitative evaluation of time series motif discovery.
+Experiments with PROM and TSMD-Bench show that PROM provides a more
+comprehensive evaluation than existing metrics, that TSMD-Bench is a more
+challenging benchmark than earlier ones, and that the combination can help
+understand the relative performance of TSMD methods. More generally, the
+proposed approach enables large-scale, systematic performance comparisons in
+this field.
+
+
+
+
+
+
+
+ ☆ MaskTerial: A Foundation Model for Automated 2D Material Flake Detection
+
+
+ The detection and classification of exfoliated two-dimensional (2D) material
+flakes from optical microscope images can be automated using computer vision
+algorithms. This has the potential to increase the accuracy and objectivity of
+classification and the efficiency of sample fabrication, and it allows for
+large-scale data collection. Existing algorithms often exhibit challenges in
+identifying low-contrast materials and typically require large amounts of
+training data. Here, we present a deep learning model, called MaskTerial, that
+uses an instance segmentation network to reliably identify 2D material flakes.
+The model is extensively pre-trained using a synthetic data generator, that
+generates realistic microscopy images from unlabeled data. This results in a
+model that can to quickly adapt to new materials with as little as 5 to 10
+images. Furthermore, an uncertainty estimation model is used to finally
+classify the predictions based on optical contrast. We evaluate our method on
+eight different datasets comprising five different 2D materials and demonstrate
+significant improvements over existing techniques in the detection of
+low-contrast materials such as hexagonal boron nitride.
+
+
+
+ comment: 9 pages, 5 figures
+
+
+
+
+
+
+ ☆ Physics-Driven Autoregressive State Space Models for Medical Image
+ Reconstruction
+
+
+ Medical image reconstruction from undersampled acquisitions is an ill-posed
+problem that involves inversion of the imaging operator linking measurement and
+image domains. In recent years, physics-driven (PD) models have gained
+prominence in learning-based reconstruction given their enhanced balance
+between efficiency and performance. For reconstruction, PD models cascade
+data-consistency modules that enforce fidelity to acquired data based on the
+imaging operator, with network modules that process feature maps to alleviate
+image artifacts due to undersampling. Success in artifact suppression
+inevitably depends on the ability of the network modules to tease apart
+artifacts from underlying tissue structures, both of which can manifest
+contextual relations over broad spatial scales. Convolutional modules that
+excel at capturing local correlations are relatively insensitive to non-local
+context. While transformers promise elevated sensitivity to non-local context,
+practical implementations often suffer from a suboptimal trade-off between
+local and non-local sensitivity due to intrinsic model complexity. Here, we
+introduce a novel physics-driven autoregressive state space model (MambaRoll)
+for enhanced fidelity in medical image reconstruction. In each cascade of an
+unrolled architecture, MambaRoll employs an autoregressive framework based on
+physics-driven state space modules (PSSM), where PSSMs efficiently aggregate
+contextual features at a given spatial scale while maintaining fidelity to
+acquired data, and autoregressive prediction of next-scale feature maps from
+earlier spatial scales enhance capture of multi-scale contextual features.
+Demonstrations on accelerated MRI and sparse-view CT reconstructions indicate
+that MambaRoll outperforms state-of-the-art PD methods based on convolutional,
+transformer and conventional SSM modules.
+
+
+
+ comment: 10 pages, 4 figures
+
+
+
+
+
+
+ ☆ Computer-Aided Osteoporosis Diagnosis Using Transfer Learning with
+ Enhanced Features from Stacked Deep Learning Modules
+
+
+
+
+
+
+
+
+ Ayesha Siddiqua, Rakibul Hasan, Anichur Rahman, Abu Saleh Musa Miah
+
+
+ Knee osteoporosis weakens the bone tissue in the knee joint, increasing
+fracture risk. Early detection through X-ray images enables timely intervention
+and improved patient outcomes. While some researchers have focused on
+diagnosing knee osteoporosis through manual radiology evaluation and
+traditional machine learning using hand-crafted features, these methods often
+struggle with performance and efficiency due to reliance on manual feature
+extraction and subjective interpretation. In this study, we propose a
+computer-aided diagnosis (CAD) system for knee osteoporosis, combining transfer
+learning with stacked feature enhancement deep learning blocks. Initially, knee
+X-ray images are preprocessed, and features are extracted using a pre-trained
+Convolutional Neural Network (CNN). These features are then enhanced through
+five sequential Conv-RELU-MaxPooling blocks. The Conv2D layers detect low-level
+features, while the ReLU activations introduce non-linearity, allowing the
+network to learn complex patterns. MaxPooling layers down-sample the features,
+retaining the most important spatial information. This sequential processing
+enables the model to capture complex, high-level features related to bone
+structure, joint deformation, and osteoporotic markers. The enhanced features
+are passed through a classification module to differentiate between healthy and
+osteoporotic knee conditions. Extensive experiments on three individual
+datasets and a combined dataset demonstrate that our model achieves 97.32%,
+98.24%, 97.27%, and 98.00% accuracy for OKX Kaggle Binary, KXO-Mendeley
+Multi-Class, OKX Kaggle Multi-Class, and the combined dataset, respectively,
+showing an improvement of around 2% over existing methods.
+
+
+
+
+
+
+
+ ☆ Are Conditional Latent Diffusion Models Effective for Image Restoration? CVPR 2025
+
+
+ Recent advancements in image restoration increasingly employ conditional
+latent diffusion models (CLDMs). While these models have demonstrated notable
+performance improvements in recent years, this work questions their suitability
+for IR tasks. CLDMs excel in capturing high-level semantic correlations, making
+them effective for tasks like text-to-image generation with spatial
+conditioning. However, in IR, where the goal is to enhance image perceptual
+quality, these models face difficulty of modeling the relationship between
+degraded images and ground truth images using a low-level representation. To
+support our claims, we compare state-of-the-art CLDMs with traditional image
+restoration models through extensive experiments. Results reveal that despite
+the scaling advantages of CLDMs, they suffer from high distortion and semantic
+deviation, especially in cases with minimal degradation, where traditional
+methods outperform them. Additionally, we perform empirical studies to examine
+the impact of various CLDM design elements on their restoration performance. We
+hope this finding inspires a reexamination of current CLDM-based IR solutions,
+opening up more opportunities in this field.
+
+
+ The advent of stereoscopic videos has opened new horizons in multimedia,
+particularly in extended reality (XR) and virtual reality (VR) applications,
+where immersive content captivates audiences across various platforms. Despite
+its growing popularity, producing stereoscopic videos remains challenging due
+to the technical complexities involved in generating stereo parallax. This
+refers to the positional differences of objects viewed from two distinct
+perspectives and is crucial for creating depth perception. This complex process
+poses significant challenges for creators aiming to deliver convincing and
+engaging presentations. To address these challenges, this paper introduces the
+Text-driven Stereoscopic Video Generation (T-SVG) system. This innovative,
+model-agnostic, zero-shot approach streamlines video generation by using text
+prompts to create reference videos. These videos are transformed into 3D point
+cloud sequences, which are rendered from two perspectives with subtle parallax
+differences, achieving a natural stereoscopic effect. T-SVG represents a
+significant advancement in stereoscopic content creation by integrating
+state-of-the-art, training-free techniques in text-to-video generation, depth
+estimation, and video inpainting. Its flexible architecture ensures high
+efficiency and user-friendliness, allowing seamless updates with newer models
+without retraining. By simplifying the production pipeline, T-SVG makes
+stereoscopic video generation accessible to a broader audience, demonstrating
+its potential to revolutionize the field.
+
+
+ Existing few-shot medical image segmentation (FSMIS) models fail to address a
+practical issue in medical imaging: the domain shift caused by different
+imaging techniques, which limits the applicability to current FSMIS tasks. To
+overcome this limitation, we focus on the cross-domain few-shot medical image
+segmentation (CD-FSMIS) task, aiming to develop a generalized model capable of
+adapting to a broader range of medical image segmentation scenarios with
+limited labeled data from the novel target domain. Inspired by the
+characteristics of frequency domain similarity across different domains, we
+propose a Frequency-aware Matching Network (FAMNet), which includes two key
+components: a Frequency-aware Matching (FAM) module and a Multi-Spectral Fusion
+(MSF) module. The FAM module tackles two problems during the meta-learning
+phase: 1) intra-domain variance caused by the inherent support-query bias, due
+to the different appearances of organs and lesions, and 2) inter-domain
+variance caused by different medical imaging techniques. Additionally, we
+design an MSF module to integrate the different frequency features decoupled by
+the FAM module, and further mitigate the impact of inter-domain variance on the
+model's segmentation performance. Combining these two modules, our FAMNet
+surpasses existing FSMIS models and Cross-domain Few-shot Semantic Segmentation
+models on three cross-domain datasets, achieving state-of-the-art performance
+in the CD-FSMIS task.
+
+
+
+ comment: Accepted by the 39th Annual AAAI Conference on Artificial
+ Intelligence (AAAI-25)
+
+
+
+
+
+
+ ☆ Multimodal Sentiment Analysis based on Video and Audio Inputs SP
+
+
+ Despite the abundance of current researches working on the sentiment analysis
+from videos and audios, finding the best model that gives the highest accuracy
+rate is still considered a challenge for researchers in this field. The main
+objective of this paper is to prove the usability of emotion recognition models
+that take video and audio inputs. The datasets used to train the models are the
+CREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned
+models that been used are: Facebook/wav2vec2-large for audio and the
+Google/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for
+each emotion generated by the two previous models is utilized in the decision
+making framework. After disparity in the results, if one of the models gets
+much higher accuracy, another test framework is created. The methods used are
+the Weighted Average method, the Confidence Level Threshold method, the Dynamic
+Weighting Based on Confidence method, and the Rule-Based Logic method. This
+limited approach gives encouraging results that make future research into these
+methods viable.
+
+
+
+ comment: Presented as a full paper in the 15th International Conference on
+ Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October
+ 28-30, 2024, Leuven, Belgium
+
+ Recent advancement in deep-neural network performance led to the development
+of new state-of-the-art approaches in numerous areas. However, the black-box
+nature of neural networks often prohibits their use in areas where model
+explainability and model transparency are crucial. Over the years, researchers
+proposed many algorithms to aid neural network understanding and provide
+additional information to the human expert. One of the most popular methods
+being Layer-Wise Relevance Propagation (LRP). This method assigns local
+relevance based on the pixel-wise decomposition of nonlinear classifiers. With
+the rise of attribution method research, there has emerged a pressing need to
+assess and evaluate their performance. Numerous metrics have been proposed,
+each assessing an individual property of attribution methods such as
+faithfulness, robustness or localization. Unfortunately, no single metric is
+deemed optimal for every case, and researchers often use several metrics to
+test the quality of the attribution maps. In this work, we address the
+shortcomings of the current LRP formulations and introduce a novel method for
+determining the relevance of input neurons through layer-wise relevance
+propagation. Furthermore, we apply this approach to the recently developed
+Vision Transformer architecture and evaluate its performance against existing
+methods on two image classification datasets, namely ImageNet and PascalVOC.
+Our results clearly demonstrate the advantage of our proposed method.
+Furthermore, we discuss the insufficiencies of current evaluation metrics for
+attribution-based explainability and propose a new evaluation metric that
+combines the notions of faithfulness, robustness and contrastiveness. We
+utilize this new metric to evaluate the performance of various
+attribution-based methods. Our code is available at:
+https://github.com/davor10105/relative-absolute-magnitude-propagation
+
+
+
+ comment: 30 pages, 16 figures, 13 tables, ACM Transactions on Intelligence
+ Systems and Technology
+
+
+
+
+
+
+ ☆ GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with
+ Rhythmic Poses and Realistic Expression AAAI 2025
+
+
+ Audio-driven talking head generation necessitates seamless integration of
+audio and visual data amidst the challenges posed by diverse input portraits
+and intricate correlations between audio and facial motions. In response, we
+propose a robust framework GoHD designed to produce highly realistic,
+expressive, and controllable portrait videos from any reference identity with
+any motion. GoHD innovates with three key modules: Firstly, an animation module
+utilizing latent navigation is introduced to improve the generalization ability
+across unseen input styles. This module achieves high disentanglement of motion
+and identity, and it also incorporates gaze orientation to rectify unnatural
+eye movements that were previously overlooked. Secondly, a conformer-structured
+conditional diffusion model is designed to guarantee head poses that are aware
+of prosody. Thirdly, to estimate lip-synchronized and realistic expressions
+from the input audio within limited training data, a two-stage training
+strategy is devised to decouple frequent and frame-wise lip motion distillation
+from the generation of other more temporally dependent but less audio-related
+motions, e.g., blinks and frowns. Extensive experiments validate GoHD's
+advanced generalization capabilities, demonstrating its effectiveness in
+generating realistic talking face results on arbitrary subjects.
+
+
+ Text-to-video generation has evolved rapidly in recent years, delivering
+remarkable results. Training typically relies on video-caption paired data,
+which plays a crucial role in enhancing generation performance. However,
+current video captions often suffer from insufficient details, hallucinations
+and imprecise motion depiction, affecting the fidelity and consistency of
+generated videos. In this work, we propose a novel instance-aware structured
+caption framework, termed InstanceCap, to achieve instance-level and
+fine-grained video caption for the first time. Based on this scheme, we design
+an auxiliary models cluster to convert original video into instances to enhance
+instance fidelity. Video instances are further used to refine dense prompts
+into structured phrases, achieving concise yet precise descriptions.
+Furthermore, a 22K InstanceVid dataset is curated for training, and an
+enhancement pipeline that tailored to InstanceCap structure is proposed for
+inference. Experimental results demonstrate that our proposed InstanceCap
+significantly outperform previous models, ensuring high fidelity between
+captions and videos while reducing hallucinations.
+
+
+
+
+
+
+
+ ☆ Towards a Multimodal Large Language Model with Pixel-Level Insight for
+ Biomedicine AAAI2025
+
+
+ In recent years, Multimodal Large Language Models (MLLM) have achieved
+notable advancements, demonstrating the feasibility of developing an
+intelligent biomedical assistant. However, current biomedical MLLMs
+predominantly focus on image-level understanding and restrict interactions to
+textual commands, thus limiting their capability boundaries and the flexibility
+of usage. In this paper, we introduce a novel end-to-end multimodal large
+language model for the biomedical domain, named MedPLIB, which possesses
+pixel-level understanding. Excitingly, it supports visual question answering
+(VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form
+shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE)
+multi-stage training strategy, which divides MoE into separate training phases
+for a visual-language expert model and a pixel-grounding expert model, followed
+by fine-tuning using MoE. This strategy effectively coordinates multitask
+learning while maintaining the computational cost at inference equivalent to
+that of a single expert model. To advance the research of biomedical MLLMs, we
+introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA),
+which comprises an array of 8 modalities for complex medical imaging question
+answering and image region understanding. Experimental results indicate that
+MedPLIB has achieved state-of-the-art outcomes across multiple medical visual
+language tasks. More importantly, in zero-shot evaluations for the pixel
+grounding task, MedPLIB leads the best small and large models by margins of
+19.7 and 15.6 respectively on the mDice metric. The codes, data, and model
+checkpoints will be made publicly available at
+https://github.com/ShawnHuang497/MedPLIB.
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ☆ Text-Video Multi-Grained Integration for Video Moment Montage
+
+
+
+
+
+
+
+
+ Zhihui Yin, Ye Ma, Xipeng Cao, Bo Wang, Quan Chen, Peng Jiang
+
+
+ The proliferation of online short video platforms has driven a surge in user
+demand for short video editing. However, manually selecting, cropping, and
+assembling raw footage into a coherent, high-quality video remains laborious
+and time-consuming. To accelerate this process, we focus on a user-friendly new
+task called Video Moment Montage (VMM), which aims to accurately locate the
+corresponding video segments based on a pre-provided narration text and then
+arrange these video clips to create a complete video that aligns with the
+corresponding descriptions. The challenge lies in extracting precise temporal
+segments while ensuring intra-sentence and inter-sentence context consistency,
+as a single script sentence may require trimming and assembling multiple video
+clips. To address this problem, we present a novel \textit{Text-Video
+Multi-Grained Integration} method (TV-MGI) that efficiently fuses text features
+from the script with both shot-level and frame-level video features, which
+enables the global and fine-grained alignment between the video content and the
+corresponding textual descriptions in the script. To facilitate further
+research in this area, we introduce the Multiple Sentences with Shots Dataset
+(MSSD), a large-scale dataset designed explicitly for the VMM task. We conduct
+extensive experiments on the MSSD dataset to demonstrate the effectiveness of
+our framework compared to baseline methods.
+
+
+ We present LatentSync, an end-to-end lip sync framework based on audio
+conditioned latent diffusion models without any intermediate motion
+representation, diverging from previous diffusion-based lip sync methods based
+on pixel space diffusion or two-stage generation. Our framework can leverage
+the powerful capabilities of Stable Diffusion to directly model complex
+audio-visual correlations. Additionally, we found that the diffusion-based lip
+sync methods exhibit inferior temporal consistency due to the inconsistency in
+the diffusion process across different frames. We propose Temporal
+REPresentation Alignment (TREPA) to enhance temporal consistency while
+preserving lip-sync accuracy. TREPA uses temporal representations extracted by
+large-scale self-supervised video models to align the generated frames with the
+ground truth frames. Furthermore, we observe the commonly encountered SyncNet
+convergence issue and conduct comprehensive empirical studies, identifying key
+factors affecting SyncNet convergence in terms of model architecture, training
+hyperparameters, and data preprocessing methods. We significantly improve the
+accuracy of SyncNet from 91% to 94% on the HDTF test set. Since we did not
+change the overall training framework of SyncNet, our experience can also be
+applied to other lip sync and audio-driven portrait animation methods that
+utilize SyncNet. Based on the above innovations, our method outperforms
+state-of-the-art lip sync methods across various metrics on the HDTF and
+VoxCeleb2 datasets.
+
+
+
+
+
+
+
+
+ Ke Li, Di Wang, Zhangyuan Hu, Shaofeng Li, Weiping Ni, Lin Zhao, Quan Wang
+
+
+ Infrared-visible object detection (IVOD) seeks to harness the complementary
+information in infrared and visible images, thereby enhancing the performance
+of detectors in complex environments. However, existing methods often neglect
+the frequency characteristics of complementary information, such as the
+abundant high-frequency details in visible images and the valuable
+low-frequency thermal information in infrared images, thus constraining
+detection performance. To solve this problem, we introduce a novel
+Frequency-Driven Feature Decomposition Network for IVOD, called FD2-Net, which
+effectively captures the unique frequency representations of complementary
+information across multimodal visual spaces. Specifically, we propose a feature
+decomposition encoder, wherein the high-frequency unit (HFU) utilizes discrete
+cosine transform to capture representative high-frequency features, while the
+low-frequency unit (LFU) employs dynamic receptive fields to model the
+multi-scale context of diverse objects. Next, we adopt a parameter-free
+complementary strengths strategy to enhance multimodal features through
+seamless inter-frequency recoupling. Furthermore, we innovatively design a
+multimodal reconstruction mechanism that recovers image details lost during
+feature extraction, further leveraging the complementary information from
+infrared and visible images to enhance overall representational capacity.
+Extensive experiments demonstrate that FD2-Net outperforms state-of-the-art
+(SOTA) models across various IVOD benchmarks, i.e. LLVIP (96.2% mAP), FLIR
+(82.9% mAP), and M3FD (83.5% mAP).
+
+
+
+ comment: This work is accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ VLMs meet UDA: Boosting Transferability of Open Vocabulary Segmentation
+ with Unsupervised Domain Adaptation
+
+
+
+
+
+
+
+
+ Roberto Alcover-Couso, Marcos Escudero-Viñolo, Juan C. SanMiguel, Jesus Bescos
+
+
+ Segmentation models are typically constrained by the categories defined
+during training. To address this, researchers have explored two independent
+approaches: adapting Vision-Language Models (VLMs) and leveraging synthetic
+data. However, VLMs often struggle with granularity, failing to disentangle
+fine-grained concepts, while synthetic data-based methods remain limited by the
+scope of available datasets.
+ This paper proposes enhancing segmentation accuracy across diverse domains by
+integrating Vision-Language reasoning with key strategies for Unsupervised
+Domain Adaptation (UDA). First, we improve the fine-grained segmentation
+capabilities of VLMs through multi-scale contextual data, robust text
+embeddings with prompt augmentation, and layer-wise fine-tuning in our proposed
+Foundational-Retaining Open Vocabulary Semantic Segmentation (FROVSS)
+framework. Next, we incorporate these enhancements into a UDA framework by
+employing distillation to stabilize training and cross-domain mixed sampling to
+boost adaptability without compromising generalization. The resulting
+UDA-FROVSS framework is the first UDA approach to effectively adapt across
+domains without requiring shared categories.
+
+
+
+
+
+
+
+ ☆ Foundation Models and Adaptive Feature Selection: A Synergistic Approach
+ to Video Question Answering
+
+
+ This paper tackles the intricate challenge of video question-answering
+(VideoQA). Despite notable progress, current methods fall short of effectively
+integrating questions with video frames and semantic object-level abstractions
+to create question-aware video representations. We introduce Local-Global
+Question Aware Video Embedding (LGQAVE), which incorporates three major
+innovations to integrate multi-modal knowledge better and emphasize semantic
+visual concepts relevant to specific questions. LGQAVE moves beyond traditional
+ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely
+identifies the most relevant frames concerning the questions. It captures the
+dynamics of objects within these frames using distinct graphs, grounding them
+in question semantics with the miniGPT model. These graphs are processed by a
+question-aware dynamic graph transformer (Q-DGT), which refines the outputs to
+develop nuanced global and local video representations. An additional
+cross-attention module integrates these local and global embeddings to generate
+the final video embeddings, which a language model uses to generate answers.
+Extensive evaluations across multiple benchmarks demonstrate that LGQAVE
+significantly outperforms existing models in delivering accurate multi-choice
+and open-ended answers.
+
+
+ We tackle the challenging problem of Open-Set Object Detection (OSOD), which
+aims to detect both known and unknown objects in unlabelled images. The main
+difficulty arises from the absence of supervision for these unknown classes,
+making it challenging to distinguish them from the background. Existing OSOD
+detectors either fail to properly exploit or inadequately leverage the abundant
+unlabeled unknown objects in training data, restricting their performance. To
+address these limitations, we propose UADet, an Uncertainty-Aware Open-Set
+Object Detector that considers appearance and geometric uncertainty. By
+integrating these uncertainty measures, UADet effectively reduces the number of
+unannotated instances incorrectly utilized or omitted by previous methods.
+Extensive experiments on OSOD benchmarks demonstrate that UADet substantially
+outperforms previous state-of-the-art (SOTA) methods in detecting both known
+and unknown objects, achieving a 1.8x improvement in unknown recall while
+maintaining high performance on known classes. When extended to Open World
+Object Detection (OWOD), our method shows significant advantages over the
+current SOTA method, with average improvements of 13.8% and 6.9% in unknown
+recall on M-OWODB and S-OWODB benchmarks, respectively. Extensive results
+validate the effectiveness of our uncertainty-aware approach across different
+open-set scenarios.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ DASK: Distribution Rehearsing via Adaptive Style Kernel Learning for
+ Exemplar-Free Lifelong Person Re-Identification AAAI
+
+
+ Lifelong person re-identification (LReID) is an important but challenging
+task that suffers from catastrophic forgetting due to significant domain gaps
+between training steps. Existing LReID approaches typically rely on data replay
+and knowledge distillation to mitigate this issue. However, data replay methods
+compromise data privacy by storing historical exemplars, while knowledge
+distillation methods suffer from limited performance due to the cumulative
+forgetting of undistilled knowledge. To overcome these challenges, we propose a
+novel paradigm that models and rehearses the distribution of the old domains to
+enhance knowledge consolidation during the new data learning, possessing a
+strong anti-forgetting capacity without storing any exemplars. Specifically, we
+introduce an exemplar-free LReID method called Distribution Rehearsing via
+Adaptive Style Kernel Learning (DASK). DASK includes a Distribution Rehearser
+Learning mechanism that learns to transform arbitrary distribution data into
+the current data style at each learning step. To enhance the style transfer
+capacity of DRL, an Adaptive Kernel Prediction network is explored to achieve
+an instance-specific distribution adjustment. Additionally, we design a
+Distribution Rehearsing-driven LReID Training module, which rehearses old
+distribution based on the new data via the old AKPNet model, achieving
+effective new-old knowledge accumulation under a joint knowledge consolidation
+scheme. Experimental results show our DASK outperforms the existing methods by
+3.6%-6.8% and 4.5%-6.5% on anti-forgetting and generalization capacity,
+respectively. Our code is available at
+https://github.com/zhoujiahuan1991/AAAI2025-DASK
+
+
+
+ comment: in Proceedings of the 39th AAAI Conference on Artificial Intelligence
+ (AAAI-25)
+
+ Contrastive learning has achieved great success in skeleton-based
+representation learning recently. However, the prevailing methods are
+predominantly negative-based, necessitating additional momentum encoder and
+memory bank to get negative samples, which increases the difficulty of model
+training. Furthermore, these methods primarily concentrate on learning a global
+representation for recognition and retrieval tasks, while overlooking the rich
+and detailed local representations that are crucial for dense prediction tasks.
+To alleviate these issues, we introduce a Unified Skeleton-based Dense
+Representation Learning framework based on feature decorrelation, called USDRL,
+which employs feature decorrelation across temporal, spatial, and instance
+domains in a multi-grained manner to reduce redundancy among dimensions of the
+representations to maximize information extraction from features. Additionally,
+we design a Dense Spatio-Temporal Encoder (DSTE) to capture fine-grained action
+representations effectively, thereby enhancing the performance of dense
+prediction tasks. Comprehensive experiments, conducted on the benchmarks
+NTU-60, NTU-120, PKU-MMD I, and PKU-MMD II, across diverse downstream tasks
+including action recognition, action retrieval, and action detection,
+conclusively demonstrate that our approach significantly outperforms the
+current state-of-the-art (SOTA) approaches. Our code and models are available
+at https://github.com/wengwanjiang/USDRL.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ Enhancing Implicit Neural Representations via Symmetric Power
+ Transformation AAAI 2025
+
+
+ We propose symmetric power transformation to enhance the capacity of Implicit
+Neural Representation~(INR) from the perspective of data transformation. Unlike
+prior work utilizing random permutation or index rearrangement, our method
+features a reversible operation that does not require additional storage
+consumption. Specifically, we first investigate the characteristics of data
+that can benefit the training of INR, proposing the Range-Defined Symmetric
+Hypothesis, which posits that specific range and symmetry can improve the
+expressive ability of INR. Based on this hypothesis, we propose a nonlinear
+symmetric power transformation to achieve both range-defined and symmetric
+properties simultaneously. We use the power coefficient to redistribute data to
+approximate symmetry within the target range. To improve the robustness of the
+transformation, we further design deviation-aware calibration and adaptive soft
+boundary to address issues of extreme deviation boosting and continuity
+breaking. Extensive experiments are conducted to verify the performance of the
+proposed method, demonstrating that our transformation can reliably improve INR
+compared with other data transformations. We also conduct 1D audio, 2D image
+and 3D video fitting tasks to demonstrate the effectiveness and applicability
+of our method.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ eCARLA-scenes: A synthetically generated dataset for event-based optical
+ flow prediction
+
+
+
+
+
+
+
+
+ Jad Mansour, Hayat Rajani, Rafael Garcia, Nuno Gracias
+
+
+ The joint use of event-based vision and Spiking Neural Networks (SNNs) is
+expected to have a large impact in robotics in the near future, in tasks such
+as, visual odometry and obstacle avoidance. While researchers have used
+real-world event datasets for optical flow prediction (mostly captured with
+Unmanned Aerial Vehicles (UAVs)), these datasets are limited in diversity,
+scalability, and are challenging to collect. Thus, synthetic datasets offer a
+scalable alternative by bridging the gap between reality and simulation. In
+this work, we address the lack of datasets by introducing eWiz, a comprehensive
+library for processing event-based data. It includes tools for data loading,
+augmentation, visualization, encoding, and generation of training data, along
+with loss functions and performance metrics. We further present a synthetic
+event-based datasets and data generation pipelines for optical flow prediction
+tasks. Built on top of eWiz, eCARLA-scenes makes use of the CARLA simulator to
+simulate self-driving car scenarios. The ultimate goal of this dataset is the
+depiction of diverse environments while laying a foundation for advancing
+event-based camera applications in autonomous field vehicle navigation, paving
+the way for using SNNs on neuromorphic hardware such as the Intel Loihi.
+
+
+
+
+
+
+
+
+ Qiang Li, Di Liu, Jun Kong, Sen Li, Hui Xu, Jianzhong Wang
+
+
+ Temporal action localization (TAL) involves dual tasks to classify and
+localize actions within untrimmed videos. However, the two tasks often have
+conflicting requirements for features. Existing methods typically employ
+separate heads for classification and localization tasks but share the same
+input feature, leading to suboptimal performance. To address this issue, we
+propose a novel TAL method with Cross Layer Task Decoupling and Refinement
+(CLTDR). Based on the feature pyramid of video, CLTDR strategy integrates
+semantically strong features from higher pyramid layers and detailed
+boundary-aware boundary features from lower pyramid layers to effectively
+disentangle the action classification and localization tasks. Moreover, the
+multiple features from cross layers are also employed to refine and align the
+disentangled classification and regression results. At last, a lightweight
+Gated Multi-Granularity (GMG) module is proposed to comprehensively extract and
+aggregate video features at instant, local, and global temporal granularities.
+Benefiting from the CLTDR and GMG modules, our method achieves state-of-the-art
+performance on five challenging benchmarks: THUMOS14, MultiTHUMOS,
+EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. Our code and pre-trained models
+are publicly available at: https://github.com/LiQiang0307/CLTDR-GMG.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ☆ Accuracy Improvements for Convolutional and Differential Distance
+ Function Approximations
+
+
+ Given a bounded domain, we deal with the problem of estimating the distance
+function from the internal points of the domain to the boundary of the domain.
+Convolutional and differential distance estimation schemes are considered and,
+for both the schemes, accuracy improvements are proposed and evaluated.
+Asymptotics of Laplace integrals and Taylor series extrapolations are used to
+achieve the improvements.
+
+
+
+
+
+
+
+ ☆ MVC-VPR: Mutual Learning of Viewpoint Classification and Visual Place
+ Recognition
+
+
+ Visual Place Recognition (VPR) aims to robustly identify locations by
+leveraging image retrieval based on descriptors encoded from environmental
+images. However, drastic appearance changes of images captured from different
+viewpoints at the same location pose incoherent supervision signals for
+descriptor learning, which severely hinder the performance of VPR. Previous
+work proposes classifying images based on manually defined rules or ground
+truth labels for viewpoints, followed by descriptor training based on the
+classification results. However, not all datasets have ground truth labels of
+viewpoints and manually defined rules may be suboptimal, leading to degraded
+descriptor performance.To address these challenges, we introduce the mutual
+learning of viewpoint self-classification and VPR. Starting from coarse
+classification based on geographical coordinates, we progress to finer
+classification of viewpoints using simple clustering techniques. The dataset is
+partitioned in an unsupervised manner while simultaneously training a
+descriptor extractor for place recognition. Experimental results show that this
+approach almost perfectly partitions the dataset based on viewpoints, thus
+achieving mutually reinforcing effects. Our method even excels state-of-the-art
+(SOTA) methods that partition datasets using ground truth labels.
+
+
+
+ comment: 8 pages
+
+
+
+
+
+
+ ☆ ExpRDiff: Short-exposure Guided Diffusion Model for Realistic Local
+ Motion Deblurring
+
+
+ Removing blur caused by moving objects is challenging, as the moving objects
+are usually significantly blurry while the static background remains clear.
+Existing methods that rely on local blur detection often suffer from
+inaccuracies and cannot generate satisfactory results when focusing solely on
+blurred regions. To overcome these problems, we first design a context-based
+local blur detection module that incorporates additional contextual information
+to improve the identification of blurry regions. Considering that modern
+smartphones are equipped with cameras capable of providing short-exposure
+images, we develop a blur-aware guided image restoration method that utilizes
+sharp structural details from short-exposure images, facilitating accurate
+reconstruction of heavily blurred regions. Furthermore, to restore images
+realistically and visually-pleasant, we develop a short-exposure guided
+diffusion model that explores useful features from short-exposure images and
+blurred regions to better constrain the diffusion process. Finally, we
+formulate the above components into a simple yet effective network, named
+ExpRDiff. Experimental results show that ExpRDiff performs favorably against
+state-of-the-art methods.
+
+
+ Diffusion models have achieved remarkable success in image generation, with
+applications broadening across various domains. Inpainting is one such
+application that can benefit significantly from diffusion models. Existing
+methods either hijack the reverse process of a pretrained diffusion model or
+cast the problem into a larger framework, \ie, conditioned generation. However,
+these approaches often require nested loops in the generation process or
+additional components for conditioning. In this paper, we present region-aware
+diffusion models (RAD) for inpainting with a simple yet effective reformulation
+of the vanilla diffusion models. RAD utilizes a different noise schedule for
+each pixel, which allows local regions to be generated asynchronously while
+considering the global image context. A plain reverse process requires no
+additional components, enabling RAD to achieve inference time up to 100 times
+faster than the state-of-the-art approaches. Moreover, we employ low-rank
+adaptation (LoRA) to fine-tune RAD based on other pretrained diffusion models,
+reducing computational burdens in training as well. Experiments demonstrated
+that RAD provides state-of-the-art results both qualitatively and
+quantitatively, on the FFHQ, LSUN Bedroom, and ImageNet datasets.
+
+
+
+
+
+
+
+ ☆ On the effectiveness of Rotation-Equivariance in U-Net: A Benchmark for
+ Image Segmentation
+
+
+
+
+
+
+
+
+ Robin Ghyselinck, Valentin Delchevalerie, Bruno Dumas, Benoît Frénay
+
+
+ Numerous studies have recently focused on incorporating different variations
+of equivariance in Convolutional Neural Networks (CNNs). In particular,
+rotation-equivariance has gathered significant attention due to its relevance
+in many applications related to medical imaging, microscopic imaging, satellite
+imaging, industrial tasks, etc. While prior research has primarily focused on
+enhancing classification tasks with rotation equivariant CNNs, their impact on
+more complex architectures, such as U-Net for image segmentation, remains
+scarcely explored. Indeed, previous work interested in integrating
+rotation-equivariance into U-Net architecture have focused on solving specific
+applications with a limited scope. In contrast, this paper aims to provide a
+more exhaustive evaluation of rotation equivariant U-Net for image segmentation
+across a broader range of tasks. We benchmark their effectiveness against
+standard U-Net architectures, assessing improvements in terms of performance
+and sustainability (i.e., computational cost). Our evaluation focuses on
+datasets whose orientation of objects of interest is arbitrary in the image
+(e.g., Kvasir-SEG), but also on more standard segmentation datasets (such as
+COCO-Stuff) as to explore the wider applicability of rotation equivariance
+beyond tasks undoubtedly concerned by rotation equivariance. The main
+contribution of this work is to provide insights into the trade-offs and
+advantages of integrating rotation equivariance for segmentation tasks.
+
+
+
+
+
+
+
+ ☆ Weighted Poisson-disk Resampling on Large-Scale Point Clouds AAAI 2025
+
+
+ For large-scale point cloud processing, resampling takes the important role
+of controlling the point number and density while keeping the geometric
+consistency. % in related tasks. However, current methods cannot balance such
+different requirements. Particularly with large-scale point clouds, classical
+methods often struggle with decreased efficiency and accuracy. To address such
+issues, we propose a weighted Poisson-disk (WPD) resampling method to improve
+the usability and efficiency for the processing. We first design an initial
+Poisson resampling with a voxel-based estimation strategy. It is able to
+estimate a more accurate radius of the Poisson-disk while maintaining high
+efficiency. Then, we design a weighted tangent smoothing step to further
+optimize the Voronoi diagram for each point. At the same time, sharp features
+are detected and kept in the optimized results with isotropic property.
+Finally, we achieve a resampling copy from the original point cloud with the
+specified point number, uniform density, and high-quality geometric
+consistency. Experiments show that our method significantly improves the
+performance of large-scale point cloud resampling for different applications,
+and provides a highly practical solution.
+
+
+
+ comment: Accepted to AAAI 2025
+
+
+
+
+
+
+ ☆ DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image
+ Customization
+
+
+
+
+
+
+
+
+ Geonhui Jang, Jin-Hwa Kim, Yong-Hyun Park, Junho Kim, Gayoung Lee, Yonghyun Jeong
+
+
+ Text-to-image (T2I) models can effectively capture the content or style of
+reference images to perform high-quality customization. A representative
+technique for this is fine-tuning using low-rank adaptations (LoRA), which
+enables efficient model customization with reference images. However,
+fine-tuning with a limited number of reference images often leads to
+overfitting, resulting in issues such as prompt misalignment or content
+leakage. These issues prevent the model from accurately following the input
+prompt or generating undesired objects during inference. To address this
+problem, we examine the text embeddings that guide the diffusion model during
+inference. This study decomposes the text embedding matrix and conducts a
+component analysis to understand the embedding space geometry and identify the
+cause of overfitting. Based on this, we propose DECOR, which projects text
+embeddings onto a vector space orthogonal to undesired token vectors, thereby
+reducing the influence of unwanted semantics in the text embeddings.
+Experimental results demonstrate that DECOR outperforms state-of-the-art
+customization models and achieves Pareto frontier performance across text and
+visual alignment evaluation metrics. Furthermore, it generates images more
+faithful to the input prompts, showcasing its effectiveness in addressing
+overfitting and enhancing text-to-image customization.
+
+
+ Generating sound effects for product-level videos, where only a small amount
+of labeled data is available for diverse scenes, requires the production of
+high-quality sounds in few-shot settings. To tackle the challenge of limited
+labeled data in real-world scenes, we introduce YingSound, a foundation model
+designed for video-guided sound generation that supports high-quality audio
+generation in few-shot settings. Specifically, YingSound consists of two major
+modules. The first module uses a conditional flow matching transformer to
+achieve effective semantic alignment in sound generation across audio and
+visual modalities. This module aims to build a learnable audio-visual
+aggregator (AVA) that integrates high-resolution visual features with
+corresponding audio features at multiple stages. The second module is developed
+with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to
+generate finer sound effects in few-shot settings. Finally, an
+industry-standard video-to-audio (V2A) dataset that encompasses various
+real-world scenarios is presented. We show that YingSound effectively generates
+high-quality synchronized sounds across diverse conditional inputs through
+automated evaluations and human studies. Project Page:
+\url{https://giantailab.github.io/yingsound/}
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ☆ Pinpoint Counterfactuals: Reducing social bias in foundation models via
+ localized counterfactual generation
+
+
+ Foundation models trained on web-scraped datasets propagate societal biases
+to downstream tasks. While counterfactual generation enables bias analysis,
+existing methods introduce artifacts by modifying contextual elements like
+clothing and background. We present a localized counterfactual generation
+method that preserves image context by constraining counterfactual
+modifications to specific attribute-relevant regions through automated masking
+and guided inpainting. When applied to the Conceptual Captions dataset for
+creating gender counterfactuals, our method results in higher visual and
+semantic fidelity than state-of-the-art alternatives, while maintaining the
+performance of models trained using only real data on non-human-centric tasks.
+Models fine-tuned with our counterfactuals demonstrate measurable bias
+reduction across multiple metrics, including a decrease in gender
+classification disparity and balanced person preference scores, while
+preserving ImageNet zero-shot performance. The results establish a framework
+for creating balanced datasets that enable both accurate bias profiling and
+effective mitigation.
+
+
+
+
+
+
+
+
+ Svetlana Pavlitska, Leopold Müller, J. Marius Zöllner
+
+
+ Adversarial attacks on traffic sign classification models were among the
+first successfully tried in the real world. Since then, the research in this
+area has been mainly restricted to repeating baseline models, such as LISA-CNN
+or GTSRB-CNN, and similar experiment settings, including white and black
+patches on traffic signs. In this work, we decouple model architectures from
+the datasets and evaluate on further generic models to make a fair comparison.
+Furthermore, we compare two attack settings, inconspicuous and visible, which
+are usually regarded without direct comparison. Our results show that standard
+baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than
+the generic ones. We, therefore, suggest evaluating new attacks on a broader
+spectrum of baselines in the future. Our code is available at
+\url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.
+
+
+
+ comment: Accepted for publication at ICMLA 2024
+
+
+
+
+
+
+ ☆ LVMark: Robust Watermark for latent video diffusion models
+
+
+ Rapid advancements in generative models have made it possible to create
+hyper-realistic videos. As their applicability increases, their unauthorized
+use has raised significant concerns, leading to the growing demand for
+techniques to protect the ownership of the generative model itself. While
+existing watermarking methods effectively embed watermarks into
+image-generative models, they fail to account for temporal information,
+resulting in poor performance when applied to video-generative models. To
+address this issue, we introduce a novel watermarking method called LVMark,
+which embeds watermarks into video diffusion models. A key component of LVMark
+is a selective weight modulation strategy that efficiently embeds watermark
+messages into the video diffusion model while preserving the quality of the
+generated videos. To accurately decode messages in the presence of malicious
+attacks, we design a watermark decoder that leverages spatio-temporal
+information in the 3D wavelet domain through a cross-attention module. To the
+best of our knowledge, our approach is the first to highlight the potential of
+video-generative model watermarking as a valuable tool for enhancing the
+effectiveness of ownership protection in video-generative models.
+
+
+
+
+
+
+
+
+ Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo
+
+
+ Studies of the functional role of the primate ventral visual stream have
+traditionally focused on object categorization, often ignoring -- despite much
+prior evidence -- its role in estimating "spatial" latents such as object
+position and pose. Most leading ventral stream models are derived by optimizing
+networks for object categorization, which seems to imply that the ventral
+stream is also derived under such an objective. Here, we explore an alternative
+hypothesis: Might the ventral stream be optimized for estimating spatial
+latents? And a closely related question: How different -- if at all -- are
+representations learned from spatial latent estimation compared to
+categorization? To ask these questions, we leveraged synthetic image datasets
+generated by a 3D graphic engine and trained convolutional neural networks
+(CNNs) to estimate different combinations of spatial and category latents. We
+found that models trained to estimate just a few spatial latents achieve neural
+alignment scores comparable to those trained on hundreds of categories, and the
+spatial latent performance of models strongly correlates with their neural
+alignment. Spatial latent and category-trained models have very similar -- but
+not identical -- internal representations, especially in their early and middle
+layers. We provide evidence that this convergence is partly driven by
+non-target latent variability in the training data, which facilitates the
+implicit learning of representations of those non-target latents. Taken
+together, these results suggest that many training objectives, such as spatial
+latents, can lead to similar models aligned neurally with the ventral stream.
+Thus, one should not assume that the ventral stream is optimized for object
+categorization only. As a field, we need to continue to sharpen our measures of
+comparing models to brains to better understand the functional roles of the
+ventral stream.
+
+
+ Event cameras hold significant promise for high-temporal-resolution (HTR)
+motion estimation. However, estimating event-based HTR optical flow faces two
+key challenges: the absence of HTR ground-truth data and the intrinsic sparsity
+of event data. Most existing approaches rely on the flow accumulation paradigms
+to indirectly supervise intermediate flows, often resulting in accumulation
+errors and optimization difficulties. To address these challenges, we propose a
+residual-based paradigm for estimating HTR optical flow with event data. Our
+approach separates HTR flow estimation into two stages: global linear motion
+estimation and HTR residual flow refinement. The residual paradigm effectively
+mitigates the impacts of event sparsity on optimization and is compatible with
+any LTR algorithm. Next, to address the challenge posed by the absence of HTR
+ground truth, we incorporate novel learning strategies. Specifically, we
+initially employ a shared refiner to estimate the residual flows, enabling both
+LTR supervision and HTR inference. Subsequently, we introduce regional noise to
+simulate the residual patterns of intermediate flows, facilitating the
+adaptation from LTR supervision to HTR inference. Additionally, we show that
+the noise-based strategy supports in-domain self-supervised training.
+Comprehensive experimental results demonstrate that our approach achieves
+state-of-the-art accuracy in both LTR and HTR metrics, highlighting its
+effectiveness and superiority.
+
+
+
+ comment: 10 pages, 8 figures
+
+
+
+
+
+
+ ☆ Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and
+ Method
+
+
+
+
+
+
+
+
+ Xinshuai Song, Weixing Chen, Yang Liu, Weikai Chen, Guanbin Li, Liang Lin
+
+
+ Existing Vision-Language Navigation (VLN) methods primarily focus on
+single-stage navigation, limiting their effectiveness in multi-stage and
+long-horizon tasks within complex and dynamic environments. To address these
+limitations, we propose a novel VLN task, named Long-Horizon Vision-Language
+Navigation (LH-VLN), which emphasizes long-term planning and decision
+consistency across consecutive subtasks. Furthermore, to support LH-VLN, we
+develop an automated data generation platform NavGen, which constructs datasets
+with complex task structures and improves data utility through a bidirectional,
+multi-granularity generation approach. To accurately evaluate complex tasks, we
+construct the Long-Horizon Planning and Reasoning in VLN (LHPR-VLN) benchmark
+consisting of 3,260 tasks with an average of 150 task steps, serving as the
+first dataset specifically designed for the long-horizon vision-language
+navigation task. Furthermore, we propose Independent Success Rate (ISR),
+Conditional Success Rate (CSR), and CSR weight by Ground Truth (CGT) metrics,
+to provide fine-grained assessments of task completion. To improve model
+adaptability in complex tasks, we propose a novel Multi-Granularity Dynamic
+Memory (MGDM) module that integrates short-term memory blurring with long-term
+memory retrieval to enable flexible navigation in dynamic environments. Our
+platform, benchmark and method supply LH-VLN with a robust data generation
+pipeline, comprehensive model evaluation dataset, reasonable metrics, and a
+novel VLN model, establishing a foundational framework for advancing LH-VLN.
+
+
+
+
+
+
+
+
+ Jin-Seop Lee, Noo-ri Kim, Jee-Hyong Lee
+
+
+ Self-supervised learning (SSL) methods based on the instance discrimination
+tasks with InfoNCE have achieved remarkable success. Despite their success, SSL
+models often struggle to generate effective representations for unseen-domain
+data. To address this issue, research on unsupervised domain generalization
+(UDG), which aims to develop SSL models that can generate domain-irrelevant
+features, has been conducted. Most UDG approaches utilize contrastive learning
+with InfoNCE to generate representations, and perform feature alignment based
+on strong assumptions to generalize domain-irrelevant common features from
+multi-source domains. However, existing methods that rely on instance
+discrimination tasks are not effective at extracting domain-irrelevant common
+features. This leads to the suppression of domain-irrelevant common features
+and the amplification of domain-relevant features, thereby hindering domain
+generalization. Furthermore, strong assumptions underlying feature alignment
+can lead to biased feature learning, reducing the diversity of common features.
+In this paper, we propose a novel approach, DomCLP, Domain-wise Contrastive
+Learning with Prototype Mixup. We explore how InfoNCE suppresses
+domain-irrelevant common features and amplifies domain-relevant features. Based
+on this analysis, we propose Domain-wise Contrastive Learning (DCon) to enhance
+domain-irrelevant common features. We also propose Prototype Mixup Learning
+(PMix) to generalize domain-irrelevant common features across multiple domains
+without relying on strong assumptions. The proposed method consistently
+outperforms state-of-the-art methods on the PACS and DomainNet datasets across
+various label fractions, showing significant improvements. Our code will be
+released. Our project page is available at https://github.com/jinsuby/DomCLP.
+
+
+ Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen
+source domains to unseen target domains, which is crucial for evaluating the
+generalization and robustness of models. Recent studies focus on utilizing
+visual styles to bridge the domain gap between different domains. However, the
+serious dilemma of gradient instability and local optimization problem occurs
+in those style-based CD-FSL methods. This paper addresses these issues and
+proposes a novel crop-global style perturbation method, called
+\underline{\textbf{S}}elf-\underline{\textbf{V}}ersatility
+\underline{\textbf{A}}dversarial \underline{\textbf{S}}tyle
+\underline{\textbf{P}}erturbation (\textbf{SVasP}), which enhances the gradient
+stability and escapes from poor sharp minima jointly. Specifically, SVasP
+simulates more diverse potential target domain adversarial styles via
+diversifying input patterns and aggregating localized crop style gradients, to
+serve as global style perturbation stabilizers within one image, a concept we
+refer to as self-versatility. Then a novel objective function is proposed to
+maximize visual discrepancy while maintaining semantic consistency between
+global, crop, and adversarial features. Having the stabilized global style
+perturbation in the training phase, one can obtain a flattened minima in the
+loss landscape, boosting the transferability of the model to the target
+domains. Extensive experiments on multiple benchmark datasets demonstrate that
+our method significantly outperforms existing state-of-the-art methods. Our
+codes are available at https://github.com/liwenqianSEU/SVasP.
+
+
+
+
+
+
+
+
+ Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, Seungryong Kim
+
+
+ In this work, we explore new perspectives on cross-view completion learning
+by drawing an analogy to self-supervised correspondence learning. Through our
+analysis, we demonstrate that the cross-attention map within cross-view
+completion models captures correspondence more effectively than other
+correlations derived from encoder or decoder features. We verify the
+effectiveness of the cross-attention map by evaluating on both zero-shot
+matching and learning-based geometric matching and multi-frame depth
+estimation. Project page is available at https://cvlab-kaist.github.io/ZeroCo/.
+
+
+ Image classification serves as the cornerstone of computer vision,
+traditionally achieved through discriminative models based on deep neural
+networks. Recent advancements have introduced classification methods derived
+from generative models, which offer the advantage of zero-shot classification.
+However, these methods suffer from two main drawbacks: high computational
+overhead and inferior performance compared to discriminative models. Inspired
+by the coordinated cognitive processes of rapid-slow pathway interactions in
+the human brain during visual signal recognition, we propose the
+Diffusion-Based Discriminative Model Enhancement Framework (DBMEF). This
+framework seamlessly integrates discriminative and generative models in a
+training-free manner, leveraging discriminative models for initial predictions
+and endowing deep neural networks with rethinking capabilities via diffusion
+models. Consequently, DBMEF can effectively enhance the classification accuracy
+and generalization capability of discriminative models in a plug-and-play
+manner. We have conducted extensive experiments across 17 prevalent deep model
+architectures with different training methods, including both CNN-based models
+such as ResNet and Transformer-based models like ViT, to demonstrate the
+effectiveness of the proposed DBMEF. Specifically, the framework yields a
+1.51\% performance improvement for ResNet-50 on the ImageNet dataset and 3.02\%
+on the ImageNet-A dataset. In conclusion, our research introduces a novel
+paradigm for image classification, demonstrating stable improvements across
+different datasets and neural networks.
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ☆ Hyperbolic-constraint Point Cloud Reconstruction from Single RGB-D
+ Images AAAI25
+
+
+ Reconstructing desired objects and scenes has long been a primary goal in 3D
+computer vision. Single-view point cloud reconstruction has become a popular
+technique due to its low cost and accurate results. However, single-view
+reconstruction methods often rely on expensive CAD models and complex geometric
+priors. Effectively utilizing prior knowledge about the data remains a
+challenge. In this paper, we introduce hyperbolic space to 3D point cloud
+reconstruction, enabling the model to represent and understand complex
+hierarchical structures in point clouds with low distortion. We build upon
+previous methods by proposing a hyperbolic Chamfer distance and a regularized
+triplet loss to enhance the relationship between partial and complete point
+clouds. Additionally, we design adaptive boundary conditions to improve the
+model's understanding and reconstruction of 3D structures. Our model
+outperforms most existing models, and ablation studies demonstrate the
+significance of our model and its components. Experimental results show that
+our method significantly improves feature extraction capabilities. Our model
+achieves outstanding performance in 3D reconstruction tasks.
+
+
+ Spatial contexts, such as the backgrounds and surroundings, are considered
+critical in Human-Object Interaction (HOI) recognition, especially when the
+instance-centric foreground is blurred or occluded. Recent advancements in HOI
+detectors are usually built upon detection transformer pipelines. While such an
+object-detection-oriented paradigm shows promise in localizing objects, its
+exploration of spatial context is often insufficient for accurately recognizing
+human actions. To enhance the capabilities of object detectors for HOI
+detection, we present a dual-branch framework named ContextHOI, which
+efficiently captures both object detection features and spatial contexts. In
+the context branch, we train the model to extract informative spatial context
+without requiring additional hand-craft background labels. Furthermore, we
+introduce context-aware spatial and semantic supervision to the context branch
+to filter out irrelevant noise and capture informative contexts. ContextHOI
+achieves state-of-the-art performance on the HICO-DET and v-coco benchmarks.
+For further validation, we construct a novel benchmark, HICO-ambiguous, which
+is a subset of HICO-DET that contains images with occluded or impaired instance
+cues. Extensive experiments across all benchmarks, complemented by
+visualizations, underscore the enhancements provided by ContextHOI, especially
+in recognizing interactions involving occluded or blurred instances.
+
+
+
+ comment: in proceedings of the 39th AAAI Conference on Artificial Intelligence
+ (AAAI-25)
+
+
+
+
+
+
+ ☆ Motif Guided Graph Transformer with Combinatorial Skeleton Prototype
+ Learning for Skeleton-Based Person Re-Identification AAAI 2025
+
+
+ Person re-identification (re-ID) via 3D skeleton data is a challenging task
+with significant value in many scenarios. Existing skeleton-based methods
+typically assume virtual motion relations between all joints, and adopt average
+joint or sequence representations for learning. However, they rarely explore
+key body structure and motion such as gait to focus on more important body
+joints or limbs, while lacking the ability to fully mine valuable
+spatial-temporal sub-patterns of skeletons to enhance model learning. This
+paper presents a generic Motif guided graph transformer with Combinatorial
+skeleton prototype learning (MoCos) that exploits structure-specific and
+gait-related body relations as well as combinatorial features of skeleton
+graphs to learn effective skeleton representations for person re-ID. In
+particular, motivated by the locality within joints' structure and the
+body-component collaboration in gait, we first propose the motif guided graph
+transformer (MGT) that incorporates hierarchical structural motifs and gait
+collaborative motifs, which simultaneously focuses on multi-order local joint
+correlations and key cooperative body parts to enhance skeleton relation
+learning. Then, we devise the combinatorial skeleton prototype learning (CSP)
+that leverages random spatial-temporal combinations of joint nodes and skeleton
+graphs to generate diverse sub-skeleton and sub-tracklet representations, which
+are contrasted with the most representative features (prototypes) of each
+identity to learn class-related semantics and discriminative skeleton
+representations. Extensive experiments validate the superior performance of
+MoCos over existing state-of-the-art models. We further show its generality
+under RGB-estimated skeletons, different graph modeling, and unsupervised
+scenarios.
+
+
+
+ comment: Accepted by AAAI 2025. Codes are available at
+ https://github.com/Kali-Hac/MoCos
+
+
+
+
+
+
+ ☆ DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous
+ Driving
+
+
+ Photorealistic 4D reconstruction of street scenes is essential for developing
+real-world simulators in autonomous driving. However, most existing methods
+perform this task offline and rely on time-consuming iterative processes,
+limiting their practical applications. To this end, we introduce the Large 4D
+Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene
+reconstruction model, which directly predicts 4D Gaussian from surround view
+videos. To better integrate the surround-view images, the Prune and Dilate
+Block (PD-Block) is proposed to eliminate overlapping Gaussian points between
+adjacent views and remove redundant background points. To enhance
+cross-temporal information, dynamic and static decoupling is tailored to better
+learn geometry and motion features. Experimental results demonstrate that
+DrivingRecon significantly improves scene reconstruction quality and novel view
+synthesis compared to existing methods. Furthermore, we explore applications of
+DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our
+code is available at https://github.com/EnVision-Research/DriveRecon.
+
+
+
+
+
+
+
+ ♻ ☆ LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
+
+
+
+
+
+
+
+
+ Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
+
+
+ Current Large Language Vision Models (LLVMs) trained on web videos perform
+well in general video understanding but struggle with fine-grained details,
+complex human-object interactions (HOI), and view-invariant representation
+learning essential for Activities of Daily Living (ADL). This limitation stems
+from a lack of specialized ADL video instruction-tuning datasets and
+insufficient modality integration to capture discriminative action
+representations. To address this, we propose a semi-automated framework for
+curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS
+instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM
+integrating videos, 3D skeletons, and HOIs to model ADL's complex
+spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of
+all modalities yields suboptimal results; thus, we propose a Multimodal
+Progressive (MMPro) training strategy, incorporating modalities in stages
+following a curriculum. We also establish ADL MCQ and video description
+benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL
+achieves state-of-the-art performance across ADL benchmarks. Code and data will
+be made publicly available at: https://adl-x.github.io/.
+
+
+
+
+
+
+
+
+ Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
+
+
+ Recent work on studying memorization in self-supervised learning (SSL)
+suggests that even though SSL encoders are trained on millions of images, they
+still memorize individual data points. While effort has been put into
+characterizing the memorized data and linking encoder memorization to
+downstream utility, little is known about where the memorization happens inside
+SSL encoders. To close this gap, we propose two metrics for localizing
+memorization in SSL encoders on a per-layer (layermem) and per-unit basis
+(unitmem). Our localization methods are independent of the downstream task, do
+not require any label information, and can be performed in a forward pass. By
+localizing memorization in various encoder architectures (convolutional and
+transformer-based) trained on diverse datasets with contrastive and
+non-contrastive SSL frameworks, we find that (1) while SSL memorization
+increases with layer depth, highly memorizing units are distributed across the
+entire encoder, (2) a significant fraction of units in SSL encoders experiences
+surprisingly high memorization of individual data points, which is in contrast
+to models trained under supervision, (3) atypical (or outlier) data points
+cause much higher layer and unit memorization than standard data points, and
+(4) in vision transformers, most memorization happens in the fully-connected
+layers. Finally, we show that localizing memorization in SSL has the potential
+to improve fine-tuning and to inform pruning strategies.
+
+
+
+ comment: Accepted at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Learning Flow Fields in Attention for Controllable Person Image
+ Generation
+
+
+
+
+
+
+
+
+ Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
+
+
+ Controllable person image generation aims to generate a person image
+conditioned on reference images, allowing precise control over the person's
+appearance or pose. However, prior methods often distort fine-grained textural
+details from the reference image, despite achieving high overall image quality.
+We attribute these distortions to inadequate attention to corresponding regions
+in the reference image. To address this, we thereby propose learning flow
+fields in attention (Leffa), which explicitly guides the target query to attend
+to the correct reference key in the attention layer during training.
+Specifically, it is realized via a regularization loss on top of the attention
+map within a diffusion-based baseline. Our extensive experiments show that
+Leffa achieves state-of-the-art performance in controlling appearance (virtual
+try-on) and pose (pose transfer), significantly reducing fine-grained detail
+distortion while maintaining high image quality. Additionally, we show that our
+loss is model-agnostic and can be used to improve the performance of other
+diffusion models.
+
+
+ Breast cancer is the most common cancer type in women worldwide. Early
+detection and appropriate treatment can significantly reduce its impact. While
+histopathology examinations play a vital role in rapid and accurate diagnosis,
+they often require experienced medical experts for proper recognition and
+cancer grading. Automated image retrieval systems have the potential to assist
+pathologists in identifying cancerous tissues, thereby accelerating the
+diagnostic process. Nevertheless, proposing an accurate image retrieval model
+is challenging due to considerable variability among the tissue and cell
+patterns in histological images. In this work, we leverage the features from
+foundation models in a novel attention-based adversarially regularized
+variational graph autoencoder model for breast histological image retrieval.
+Our results confirm the superior performance of models trained with foundation
+model features compared to those using pre-trained convolutional neural
+networks (up to 7.7% and 15.5% for mAP and mMV, respectively), with the
+pre-trained general-purpose self-supervised model for computational pathology
+(UNI) delivering the best overall performance. By evaluating two publicly
+available histology image datasets of breast cancer, our top-performing model,
+trained with UNI features, achieved average mAP/mMV scores of 96.7%/91.5% and
+97.6%/94.2% for the BreakHis and BACH datasets, respectively. Our proposed
+retrieval model has the potential to be used in clinical settings to enhance
+diagnostic performance and ultimately benefit patients.
+
+
+
+
+
+
+
+
+ Hu Xu, Po-Yao Huang, Xiaoqing Ellen Tan, Ching-Feng Yeh, Jacob Kahn, Christine Jou, Gargi Ghosh, Omer Levy, Luke Zettlemoyer, Wen-tau Yih, Shang-Wen Li, Saining Xie, Christoph Feichtenhofer
+
+
+ This paper focuses on creating synthetic data to improve the quality of image
+captions. Existing works typically have two shortcomings. First, they caption
+images from scratch, ignoring existing alt-text metadata, and second, lack
+transparency if the captioners' training data (e.g. GPT) is unknown. In this
+paper, we study a principled approach Altogether based on the key idea to edit
+and re-align existing alt-texts associated with the images. To generate
+training data, we perform human annotation where annotators start with the
+existing alt-text and re-align it to the image content in multiple rounds,
+consequently constructing captions with rich visual concepts. This differs from
+prior work that carries out human annotation as a one-time description task
+solely based on images and annotator knowledge. We train a captioner on this
+data that generalizes the process of re-aligning alt-texts at scale. Our
+results show our Altogether approach leads to richer image captions that also
+improve text-to-image generation and zero-shot image classification tasks.
+
+
+
+ comment: accepted by EMNLP 2024; Meta CLIP 1.2 Data Engine
+
+
+
+
+
+
+ ♻ ☆ Disentangling Mean Embeddings for Better Diagnostics of Image Generators NeurIPS 2024
+
+
+
+
+
+
+
+
+ Sebastian G. Gruber, Pascal Tobias Ziegler, Florian Buettner
+
+
+ The evaluation of image generators remains a challenge due to the limitations
+of traditional metrics in providing nuanced insights into specific image
+regions. This is a critical problem as not all regions of an image may be
+learned with similar ease. In this work, we propose a novel approach to
+disentangle the cosine similarity of mean embeddings into the product of cosine
+similarities for individual pixel clusters via central kernel alignment.
+Consequently, we can quantify the contribution of the cluster-wise performance
+to the overall image generation performance. We demonstrate how this enhances
+the explainability and the likelihood of identifying pixel regions of model
+misbehavior across various real-world use cases.
+
+
+
+ comment: Published at Interpretable AI: Past, Present and Future Workshop at
+ NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Liquid: Language Models are Scalable Multi-modal Generators
+
+
+
+
+
+
+
+
+ Junfeng Wu, Yi Jiang, Chuofan Ma, Yuliang Liu, Hengshuang Zhao, Zehuan Yuan, Song Bai, Xiang Bai
+
+
+ We present Liquid, an auto-regressive generation paradigm that seamlessly
+integrates visual comprehension and generation by tokenizing images into
+discrete codes and learning these code embeddings alongside text tokens within
+a shared feature space for both vision and language. Unlike previous multimodal
+large language model (MLLM), Liquid achieves this integration using a single
+large language model (LLM), eliminating the need for external pretrained visual
+embeddings such as CLIP. For the first time, Liquid uncovers a scaling law that
+performance drop unavoidably brought by the unified training of visual and
+language tasks diminishes as the model size increases. Furthermore, the unified
+token space enables visual generation and comprehension tasks to mutually
+enhance each other, effectively removing the typical interference seen in
+earlier models. We show that existing LLMs can serve as strong foundations for
+Liquid, saving 100x in training costs while outperforming Chameleon in
+multimodal capabilities and maintaining language performance comparable to
+mainstream LLMs like LLAMA2. Liquid also outperforms models like SD v2.1 and
+SD-XL (FID of 5.47 on MJHQ-30K), excelling in both vision-language and
+text-only tasks. This work demonstrates that LLMs such as LLAMA3.2 and GEMMA2
+are powerful multimodal generators, offering a scalable solution for enhancing
+both vision-language understanding and generation. The code and models will be
+released at https://github.com/FoundationVision/Liquid.
+
+
+
+
+
+
+
+
+ Imad Ali Shah, Jiarong Li, Martin Glavin, Edward Jones, Enda Ward, Brian Deegan
+
+
+ Hyperspectral Imaging (HSI) is known for its advantages over traditional RGB
+imaging in remote sensing, agriculture, and medicine. Recently, it has gained
+attention for enhancing Advanced Driving Assistance Systems (ADAS) perception.
+Several HSI datasets such as HyKo, HSI-Drive, HSI-Road, and Hyperspectral City
+have been made available. However, a comprehensive evaluation of semantic
+segmentation models (SSM) using these datasets is lacking. To address this gap,
+we evaluated the available annotated HSI datasets on four deep learning-based
+baseline SSMs: DeepLab v3+, HRNet, PSPNet, and U-Net, along with its two
+variants: Coordinate Attention (UNet-CA) and Convolutional Block-Attention
+Module (UNet-CBAM). The original model architectures were adapted to handle the
+varying spatial and spectral dimensions of the datasets. These baseline SSMs
+were trained using a class-weighted loss function for individual HSI datasets
+and evaluated using mean-based metrics such as intersection over union (IoU),
+recall, precision, F1 score, specificity, and accuracy. Our results indicate
+that UNet-CBAM, which extracts channel-wise features, outperforms other SSMs
+and shows potential to leverage spectral information for enhanced semantic
+segmentation. This study establishes a baseline SSM benchmark on available
+annotated datasets for future evaluation of HSI-based ADAS perception. However,
+limitations of current HSI datasets, such as limited dataset size, high class
+imbalance, and lack of fine-grained annotations, remain significant constraints
+for developing robust SSMs for ADAS applications.
+
+
+
+ comment: Accepted and Presented at IEEE WHISPERS 2024
+
+
+
+
+
+
+ ♻ ☆ Distribution-Level Feature Distancing for Machine Unlearning: Towards a
+ Better Trade-off Between Model Utility and Forgetting AAAI 2025
+
+
+ With the explosive growth of deep learning applications and increasing
+privacy concerns, the right to be forgotten has become a critical requirement
+in various AI industries. For example, given a facial recognition system, some
+individuals may wish to remove their personal data that might have been used in
+the training phase. Unfortunately, deep neural networks sometimes unexpectedly
+leak personal identities, making this removal challenging. While recent machine
+unlearning algorithms aim to enable models to forget specific data, we identify
+an unintended utility drop-correlation collapse-in which the essential
+correlations between image features and true labels weaken during the
+forgetting process. To address this challenge, we propose Distribution-Level
+Feature Distancing (DLFD), a novel method that efficiently forgets instances
+while preserving task-relevant feature correlations. Our method synthesizes
+data samples by optimizing the feature distribution to be distinctly different
+from that of forget samples, achieving effective results within a single
+training epoch. Through extensive experiments on facial recognition datasets,
+we demonstrate that our approach significantly outperforms state-of-the-art
+machine unlearning methods in both forgetting performance and model utility
+preservation.
+
+
+
+ comment: 10 pages, 6 figures, AAAI 2025 camera ready version
+
+
+
+
+
+
+ ♻ ☆ EVQAScore: Efficient Video Question Answering Data Evaluation
+
+
+ Video question-answering (QA) is a core task in video understanding.
+Evaluating the quality of video QA and video caption data quality for training
+video large language models (VideoLLMs) is an essential challenge. Although
+various methods have been proposed for assessing video caption quality, there
+remains a lack of dedicated evaluation methods for Video QA. To address this
+gap, we introduce EVQAScore, a reference-free method that leverages keyword
+extraction to assess both video caption and video QA data quality.
+Additionally, we incorporate frame sampling and rescaling techniques to enhance
+the efficiency and robustness of our evaluation, this enables our score to
+evaluate the quality of extremely long videos. Our approach achieves
+state-of-the-art (SOTA) performance (32.8 for Kendall correlation and 42.3 for
+Spearman correlation, 4.7 and 5.9 higher than the previous method PAC-S++) on
+the VATEX-EVAL benchmark for video caption evaluation. Furthermore, by using
+EVQAScore for data selection, we achieved SOTA results with only 12.5\% of the
+original data volume, outperforming the previous SOTA method PAC-S and 100\% of
+data.
+
+
+
+
+
+
+
+ ♻ ☆ On the Robustness of Kolmogorov-Arnold Networks: An Adversarial
+ Perspective
+
+
+
+
+
+
+
+
+ Tal Alter, Raz Lapid, Moshe Sipper
+
+
+ Kolmogorov-Arnold Networks (KANs) have recently emerged as a novel approach
+to function approximation, demonstrating remarkable potential in various
+domains. Despite their theoretical promise, the robustness of KANs under
+adversarial conditions has yet to be thoroughly examined. In this paper we
+explore the adversarial robustness of KANs, with a particular focus on image
+classification tasks. We assess the performance of KANs against standard white
+box and black-box adversarial attacks, comparing their resilience to that of
+established neural network architectures. Our experimental evaluation
+encompasses a variety of standard image classification benchmark datasets and
+investigates both fully connected and convolutional neural network
+architectures, of three sizes: small, medium, and large. We conclude that
+small- and medium-sized KANs (either fully connected or convolutional) are not
+consistently more robust than their standard counterparts, but that large-sized
+KANs are, by and large, more robust. This comprehensive evaluation of KANs in
+adversarial scenarios offers the first in-depth analysis of KAN security,
+laying the groundwork for future research in this emerging field.
+
+
+
+
+
+
+
+ ♻ ☆ Video Summarization using Denoising Diffusion Probabilistic Model AAAI2025
+
+
+ Video summarization aims to eliminate visual redundancy while retaining key
+parts of video to construct concise and comprehensive synopses. Most existing
+methods use discriminative models to predict the importance scores of video
+frames. However, these methods are susceptible to annotation inconsistency
+caused by the inherent subjectivity of different annotators when annotating the
+same video. In this paper, we introduce a generative framework for video
+summarization that learns how to generate summaries from a probability
+distribution perspective, effectively reducing the interference of subjective
+annotation noise. Specifically, we propose a novel diffusion summarization
+method based on the Denoising Diffusion Probabilistic Model (DDPM), which
+learns the probability distribution of training data through noise prediction,
+and generates summaries by iterative denoising. Our method is more resistant to
+subjective annotation noise, and is less prone to overfitting the training data
+than discriminative methods, with strong generalization ability. Moreover, to
+facilitate training DDPM with limited data, we employ an unsupervised video
+summarization model to implement the earlier denoising process. Extensive
+experiments on various datasets (TVSum, SumMe, and FPVSum) demonstrate the
+effectiveness of our method.
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ♻ ☆ DualPM: Dual Posed-Canonical Point Maps for 3D Shape and Pose
+ Reconstruction
+
+
+
+
+
+
+
+
+ Ben Kaye, Tomas Jakab, Shangzhe Wu, Christian Rupprecht, Andrea Vedaldi
+
+
+ The choice of data representation is a key factor in the success of deep
+learning in geometric tasks. For instance, DUSt3R has recently introduced the
+concept of viewpoint-invariant point maps, generalizing depth prediction, and
+showing that one can reduce all the key problems in the 3D reconstruction of
+static scenes to predicting such point maps. In this paper, we develop an
+analogous concept for a very different problem, namely, the reconstruction of
+the 3D shape and pose of deformable objects. To this end, we introduce the Dual
+Point Maps (DualPM), where a pair of point maps is extracted from the same
+image, one associating pixels to their 3D locations on the object, and the
+other to a canonical version of the object at rest pose. We also extend point
+maps to amodal reconstruction, seeing through self-occlusions to obtain the
+complete shape of the object. We show that 3D reconstruction and 3D pose
+estimation reduce to the prediction of the DualPMs. We demonstrate empirically
+that this representation is a good target for a deep network to predict;
+specifically, we consider modeling horses, showing that DualPMs can be trained
+purely on 3D synthetic data, consisting of a single model of a horse, while
+generalizing very well to real images. With this, we improve by a large margin
+previous methods for the 3D analysis and reconstruction of this type of
+objects.
+
+
+
+ comment: First two authors contributed equally. Project page:
+ https://dualpm.github.io
+
+
+
+
+
+
+ ♻ ☆ GaussianOcc: Fully Self-supervised and Efficient 3D Occupancy Estimation
+ with Gaussian Splatting
+
+
+ We introduce GaussianOcc, a systematic method that investigates the two
+usages of Gaussian splatting for fully self-supervised and efficient 3D
+occupancy estimation in surround views. First, traditional methods for
+self-supervised 3D occupancy estimation still require ground truth 6D poses
+from sensors during training. To address this limitation, we propose Gaussian
+Splatting for Projection (GSP) module to provide accurate scale information for
+fully self-supervised training from adjacent view projection. Additionally,
+existing methods rely on volume rendering for final 3D voxel representation
+learning using 2D signals (depth maps, semantic maps), which is both
+time-consuming and less effective. We propose Gaussian Splatting from Voxel
+space (GSV) to leverage the fast rendering properties of Gaussian splatting. As
+a result, the proposed GaussianOcc method enables fully self-supervised (no
+ground truth pose) 3D occupancy estimation in competitive performance with low
+computational cost (2.7 times faster in training and 5 times faster in
+rendering). The relevant code is available in
+https://github.com/GANWANSHUI/GaussianOcc.git.
+
+
+ The increasing frequency and severity of wildfires highlight the need for
+accurate fire and plume spread models. We introduce an approach that
+effectively isolates and tracks fire and plume behavior across various spatial
+and temporal scales and image types, identifying physical phenomena in the
+system and providing insights useful for developing and validating models. Our
+method combines image segmentation and graph theory to delineate fire fronts
+and plume boundaries. We demonstrate that the method effectively distinguishes
+fires and plumes from visually similar objects. Results demonstrate the
+successful isolation and tracking of fire and plume dynamics across various
+image sources, ranging from synoptic-scale ($10^4$-$10^5$ m) satellite images
+to sub-microscale ($10^0$-$10^1$ m) images captured close to the fire
+environment. Furthermore, the methodology leverages image inpainting and
+spatio-temporal dataset generation for use in statistical and machine learning
+models.
+
+
+
+
+
+
+
+ ♻ ☆ Perturb and Recover: Fine-tuning for Effective Backdoor Removal from
+ CLIP
+
+
+
+
+
+
+
+
+ Naman Deep Singh, Francesco Croce, Matthias Hein
+
+
+ Vision-Language models like CLIP have been shown to be highly effective at
+linking visual perception and natural language understanding, enabling
+sophisticated image-text capabilities, including strong retrieval and zero-shot
+classification performance. Their widespread use, as well as the fact that CLIP
+models are trained on image-text pairs from the web, make them both a
+worthwhile and relatively easy target for backdoor attacks. As training
+foundational models, such as CLIP, from scratch is very expensive, this paper
+focuses on cleaning potentially poisoned models via fine-tuning. We first show
+that existing cleaning techniques are not effective against simple structured
+triggers used in Blended or BadNet backdoor attacks, exposing a critical
+vulnerability for potential real-world deployment of these models. Then, we
+introduce PAR, Perturb and Recover, a surprisingly simple yet effective
+mechanism to remove backdoors from CLIP models. Through extensive experiments
+across different encoders and types of backdoor attacks, we show that PAR
+achieves high backdoor removal rate while preserving good standard performance.
+Finally, we illustrate that our approach is effective even only with synthetic
+text-image pairs, i.e. without access to real training data. The code and
+models are available at https://github.com/nmndeep/PerturbAndRecover.
+
+
+
+
+
+
+
+ ♻ ☆ Learning Visual Generative Priors without Text
+
+
+ Although text-to-image (T2I) models have recently thrived as visual
+generative priors, their reliance on high-quality text-image pairs makes
+scaling up expensive. We argue that grasping the cross-modality alignment is
+not a necessity for a sound visual generative prior, whose focus should be on
+texture modeling. Such a philosophy inspires us to study image-to-image (I2I)
+generation, where models can learn from in-the-wild images in a self-supervised
+manner. We first develop a pure vision-based training framework, Lumos, and
+confirm the feasibility and the scalability of learning I2I models. We then
+find that, as an upstream task of T2I, our I2I model serves as a more
+foundational visual prior and achieves on-par or better performance than
+existing T2I models using only 1/10 text-image pairs for fine-tuning. We
+further demonstrate the superiority of I2I priors over T2I priors on some
+text-irrelevant visual generative tasks, like image-to-3D and image-to-video.
+Our project page is available at https://xiaomabufei.github.io/lumos.
+
+
+ Multiple object tracking (MOT) from unmanned aerial vehicle (UAV) platforms
+requires efficient motion modeling. This is because UAV-MOT faces both local
+object motion and global camera motion. Motion blur also increases the
+difficulty of detecting large moving objects. Previous UAV motion modeling
+approaches either focus only on local motion or ignore motion blurring effects,
+thus limiting their tracking performance and speed. To address these issues, we
+propose the Motion Mamba Module, which explores both local and global motion
+features through cross-correlation and bi-directional Mamba Modules for better
+motion modeling. To address the detection difficulties caused by motion blur,
+we also design motion margin loss to effectively improve the detection accuracy
+of motion blurred objects. Based on the Motion Mamba module and motion margin
+loss, our proposed MM-Tracker surpasses the state-of-the-art in two widely
+open-source UAV-MOT datasets. Code will be available.
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ♻ ☆ A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual
+ Attention Prediction Using Deep Learning
+
+
+ Purpose: As visual inspection is an inherent process during radiological
+screening, the associated eye gaze data can provide valuable insights into
+relevant clinical decisions. As deep learning has become the state-of-the-art
+for computer-assisted diagnosis, integrating human behavior, such as eye gaze
+data, into these systems is instrumental to help align machine predictions with
+clinical diagnostic criteria, thus enhancing the quality of automatic
+radiological diagnosis. Methods: We propose a novel deep learning framework for
+joint disease diagnosis and prediction of corresponding clinical visual
+attention maps for chest X-ray scans. Specifically, we introduce a new
+dual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a
+Residual and Squeeze-and-Excitation block-based encoder to extract diverse
+features for visual attention map prediction, and a multi-scale feature-fusion
+classifier to perform disease classification. To tackle the issue of
+asynchronous training schedules of individual tasks in multi-task learning, we
+proposed a multi-stage cooperative learning strategy, with contrastive learning
+for feature encoder pretraining to boost performance. Results: Our proposed
+method is shown to significantly outperform existing techniques for chest X-ray
+diagnosis (AUC=0.93) and the quality of visual attention map prediction
+(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed
+multi-task multi-stage cooperative learning, our technique demonstrates the
+benefit of integrating clinicians' eye gaze into clinical AI systems to boost
+performance and potentially explainability.
+
+
+
+
+
+
+
+ ♻ ☆ A Survey of Artificial Intelligence in Gait-Based Neurodegenerative
+ Disease Diagnosis
+
+
+ Recent years have witnessed an increasing global population affected by
+neurodegenerative diseases (NDs), which traditionally require extensive
+healthcare resources and human effort for medical diagnosis and monitoring. As
+a crucial disease-related motor symptom, human gait can be exploited to
+characterize different NDs. The current advances in artificial intelligence
+(AI) models enable automatic gait analysis for NDs identification and
+classification, opening a new avenue to facilitate faster and more
+cost-effective diagnosis of NDs. In this paper, we provide a comprehensive
+survey on recent progress of machine learning and deep learning based AI
+techniques applied to diagnosis of five typical NDs through gait. We provide an
+overview of the process of AI-assisted NDs diagnosis, and present a systematic
+taxonomy of existing gait data and AI models. Meanwhile, a novel quality
+evaluation criterion is proposed to quantitatively assess the quality of
+existing studies. Through an extensive review and analysis of 169 studies, we
+present recent technical advancements, discuss existing challenges, potential
+solutions, and future directions in this field. Finally, we envision the
+prospective utilization of 3D skeleton data for human gait representation and
+the development of more efficient AI models for NDs diagnosis.
+
+
+
+ comment: Article: 57 pages, citing 290 papers. Appendix: 30 pages. A
+ up-to-date resource (papers, data, etc.) of this survey (AI4NDD) is provided
+ at https://github.com/minlinzeng/AI4NDD-Survey
+
+
+
+
+
+
+ ♻ ☆ Humans as Checkerboards: Calibrating Camera Motion Scale for
+ World-Coordinate Human Mesh Recovery
+
+
+
+
+
+
+
+
+ Fengyuan Yang, Kerui Gu, Ha Linh Nguyen, Tze Ho Elden Tse, Angela Yao
+
+
+ Accurate camera motion estimation is essential for recovering global human
+motion in world coordinates from RGB video inputs. SLAM is widely used for
+estimating camera trajectory and point cloud, but monocular SLAM does so only
+up to an unknown scale factor. Previous works estimate the scale factor through
+optimization, but this is unreliable and time-consuming. This paper presents an
+optimization-free scale calibration framework, Human as Checkerboard (HAC). HAC
+innovatively leverages the human body predicted by human mesh recovery model as
+a calibration reference. Specifically, it uses the absolute depth of
+human-scene contact joints as references to calibrate the corresponding
+relative scene depth from SLAM. HAC benefits from geometric priors encoded in
+human mesh recovery models to estimate the SLAM scale and achieves precise
+global human motion estimation. Simple yet powerful, our method sets a new
+state-of-the-art performance for global human mesh estimation tasks, reducing
+motion errors by 50% over prior local-to-global methods while using 100$\times$
+less inference time than optimization-based methods. Project page:
+https://martayang.github.io/HAC.
+
+
+
+ comment: 13 pages, 11 figures, 6 tables
+
+
+
+
+
+
+ ♻ ☆ Improving generative adversarial network inversion via fine-tuning GAN
+ encoders
+
+
+ Generative adversarial networks (GANs) can synthesize high-quality (HQ)
+images, and GAN inversion is a technique that discovers how to invert given
+images back to latent space. While existing methods perform on StyleGAN
+inversion, they have limited performance and are not generalized to different
+GANs. To address these issues, we proposed a self-supervised method to
+pre-train and fine-tune GAN encoders. First, we designed an adaptive block to
+fit different encoder architectures for inverting diverse GANs. Then we
+pre-train GAN encoders using synthesized images and emphasize local regions
+through cropping images. Finally, we fine-tune the pre-trained GAN encoder for
+inverting real images. Compared with state-of-the-art methods, our method
+achieved better results that reconstructed high-quality images on mainstream
+GANs. Our code and pre-trained models are available at:
+https://github.com/disanda/Deep-GAN-Encoders.
+
+
+
+
+
+
+
+ ♻ ☆ Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million
+ Images
+
+
+
+
+
+
+
+
+ Virmarie Maquiling, Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati, Enkelejda Kasneci
+
+
+ We explore the transformative potential of SAM 2, a vision foundation model,
+in advancing gaze estimation and eye tracking technologies. By significantly
+reducing annotation time, lowering technical barriers through its ease of
+deployment, and enhancing segmentation accuracy, SAM 2 addresses critical
+challenges faced by researchers and practitioners. Utilizing its zero-shot
+segmentation capabilities with minimal user input-a single click per video-we
+tested SAM 2 on over 14 million eye images from diverse datasets, including
+virtual reality setups and the world's largest unified dataset recorded using
+wearable eye trackers. Remarkably, in pupil segmentation tasks, SAM 2 matches
+the performance of domain-specific models trained solely on eye images,
+achieving competitive mean Intersection over Union (mIoU) scores of up to 93%
+without fine-tuning. Additionally, we provide our code and segmentation masks
+for these widely used datasets to promote further research.
+
+
+
+ comment: Virmarie Maquiling and Sean Anthony Byrne contributed equally to this
+ paper, 8 pages, 3 figures, CHI Case Study, pre-print
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony
+ in Talking Head Generation
+
+
+ Animating still face images with deep generative models using a speech input
+signal is an active research topic and has seen important recent
+progress.However, much of the effort has been put into lip syncing and
+rendering quality while the generation of natural head motion, let alone the
+audio-visual correlation between head motion and speech, has often been
+neglected.In this work, we propose a multi-scale audio-visual synchrony loss
+and a multi-scale autoregressive GAN to better handle short and long-term
+correlation between speech and the dynamics of the head and lips.In particular,
+we train a stack of syncer models on multimodal input pyramids and use these
+models as guidance in a multi-scale generator network to produce audio-aligned
+motion unfolding over diverse time scales.Both the pyramid of audio-visual
+syncers and the generative models are trained in a low-dimensional space that
+fully preserves dynamics cues.The experiments show significant improvements
+over the state-of-the-art in head motion dynamics quality and especially in
+multi-scale audio-visual synchrony on a collection of benchmark datasets.
+
+
+ Iodinated contrast agents are widely utilized in numerous interventional
+procedures, yet posing substantial health risks to patients. This paper
+presents CAS-GAN, a novel GAN framework that serves as a "virtual contrast
+agent" to synthesize X-ray angiographies via disentanglement representation
+learning and vessel semantic guidance, thereby reducing the reliance on
+iodinated contrast agents during interventional procedures. Specifically, our
+approach disentangles X-ray angiographies into background and vessel
+components, leveraging medical prior knowledge. A specialized predictor then
+learns to map the interrelationships between these components. Additionally, a
+vessel semantic-guided generator and a corresponding loss function are
+introduced to enhance the visual fidelity of generated images. Experimental
+results on the XCAD dataset demonstrate the state-of-the-art performance of our
+CAS-GAN, achieving a FID of 5.87 and a MMD of 0.016. These promising results
+highlight {\tt CAS-GAN}'s potential for clinical applications.
+
+
+
+ comment: IEEE Symposium Series on Computational Intelligence (SSCI 2025)
+
+ Due to the challenges in acquiring paired Text-3D data and the inherent
+irregularity of 3D data structures, combined representation learning of 3D
+point clouds and text remains unexplored. In this paper, we propose a novel
+Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D
+retrieval. Specifically, the extracted text and point cloud features are
+refined by their respective Adaptive Feature Refiner (AFR). Furthermore, we
+introduce the innovative Riemann Local Similarity (RLS) module and the Global
+Pooling Similarity (GPS) module. However, as 3D point cloud data and text data
+often possess complex geometric structures in high-dimensional space, the
+proposed RLS employs a novel Riemann Attention Mechanism to reflect the
+intrinsic geometric relationships of the data. Without explicitly defining the
+manifold, RMARN learns the manifold parameters to better represent the
+distances between text-point cloud samples. To address the challenges of
+lacking paired text-3D data, we have created the large-scale Text-3D Retrieval
+dataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud
+data. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained
+Chinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs,
+respectively. Experiments on our custom datasets demonstrate the superior
+performance of the proposed method. Our code and proposed datasets are
+available at \url{https://github.com/liwrui/RMARN}.
+
+
+
+ comment: Accepted by AAAI25
+
+
+
+
+
+
+ ♻ ☆ Game4Loc: A UAV Geo-Localization Benchmark from Game Data AAAI 2025
+
+
+ The vision-based geo-localization technology for UAV, serving as a secondary
+source of GPS information in addition to the global navigation satellite
+systems (GNSS), can still operate independently in the GPS-denied environment.
+Recent deep learning based methods attribute this as the task of image matching
+and retrieval. By retrieving drone-view images in geo-tagged satellite image
+database, approximate localization information can be obtained. However, due to
+high costs and privacy concerns, it is usually difficult to obtain large
+quantities of drone-view images from a continuous area. Existing drone-view
+datasets are mostly composed of small-scale aerial photography with a strong
+assumption that there exists a perfect one-to-one aligned reference image for
+any query, leaving a significant gap from the practical localization scenario.
+In this work, we construct a large-range contiguous area UAV geo-localization
+dataset named GTA-UAV, featuring multiple flight altitudes, attitudes, scenes,
+and targets using modern computer games. Based on this dataset, we introduce a
+more practical UAV geo-localization task including partial matches of
+cross-view paired data, and expand the image-level retrieval to the actual
+localization in terms of distance (meters). For the construction of drone-view
+and satellite-view pairs, we adopt a weight-based contrastive learning
+approach, which allows for effective learning while avoiding additional
+post-processing matching steps. Experiments demonstrate the effectiveness of
+our data and training method for UAV geo-localization, as well as the
+generalization capabilities to real-world scenarios.
+
+
+
+
+
+
+
+ ♻ ☆ Optimized 3D Point Labeling with Leaders Using the Beams Displacement
+ Method
+
+
+
+
+
+
+
+
+ Zhiwei Wei, Nai Yang, Wenjia Xu, Su Ding, Li Minmin, Li You, Guo Renzhong
+
+
+ In three-dimensional geographical scenes, adding labels with leader lines to
+point features can significantly improve their visibility. Leadered labels have
+a large degree of freedom in position con-figuration, but existing methods are
+mostly based on limited position candidate models, which not only fail to
+effectively utilize the map space but also make it difficult to consider the
+relative relationships between labels. Therefore, we conceptualize the dynamic
+configuration process of computing label positions as akin to solving a map
+displacement problem. We use a triangulated graph to delineate spatial
+relationships among labels and calculate the forces exerted on labels
+considering the constraints associated with point feature labels. Then we use
+the Beams Displacement Method to iteratively calculate new positions for the
+labels. Our experimental outcomes demonstrate that this method effectively
+mitigates label overlay issues while maintaining minimal average directional
+deviation between adjacent labels. Furthermore, this method is adaptable to
+various types of leader line labels. Meanwhile, we also discuss the block
+processing strategy to improve the efficiency of label configuration and
+analyze the impact of different proximity graphs.
+
+
+
+ comment: 12 pages, in Chinese language, 10 figures
+
+
+
+
+
+
+ ♻ ☆ Golden Noise for Diffusion Models: A Learning Framework
+
+
+ Text-to-image diffusion model is a popular paradigm that synthesizes
+personalized images by providing a text prompt and a random Gaussian noise.
+While people observe that some noises are ``golden noises'' that can achieve
+better text-image alignment and higher human preference than others, we still
+lack a machine learning framework to obtain those golden noises. To learn
+golden noises for diffusion sampling, we mainly make three contributions in
+this paper. First, we identify a new concept termed the \textit{noise prompt},
+which aims at turning a random Gaussian noise into a golden noise by adding a
+small desirable perturbation derived from the text prompt. Following the
+concept, we first formulate the \textit{noise prompt learning} framework that
+systematically learns ``prompted'' golden noise associated with a text prompt
+for diffusion models. Second, we design a noise prompt data collection pipeline
+and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains
+100k pairs of random noises and golden noises with the associated text prompts.
+With the prepared NPD as the training dataset, we trained a small \textit{noise
+prompt network}~(NPNet) that can directly learn to transform a random noise
+into a golden noise. The learned golden noise perturbation can be considered as
+a kind of prompt for noise, as it is rich in semantic information and tailored
+to the given text prompt. Third, our extensive experiments demonstrate the
+impressive effectiveness and generalization of NPNet on improving the quality
+of synthesized images across various diffusion models, including SDXL,
+DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and
+efficient controller that acts as a plug-and-play module with very limited
+additional inference and computational costs, as it just provides a golden
+noise instead of a random noise without accessing the original pipeline.
+
+
+ Fashion design is a challenging and complex process.Recent works on fashion
+generation and editing are all agnostic of the actual fashion design process,
+which limits their usage in practice.In this paper, we propose a novel
+hierarchical diffusion-based framework tailored for fashion design, coined as
+HieraFashDiff. Our model is designed to mimic the practical fashion design
+workflow, by unraveling the denosing process into two successive stages: 1) an
+ideation stage that generates design proposals given high-level concepts and 2)
+an iteration stage that continuously refines the proposals using low-level
+attributes. Our model supports fashion design generation and fine-grained local
+editing in a single framework. To train our model, we contribute a new dataset
+of full-body fashion images annotated with hierarchical text descriptions.
+Extensive evaluations show that, as compared to prior approaches, our method
+can generate fashion designs and edited results with higher fidelity and better
+prompt adherence, showing its promising potential to augment the practical
+fashion design workflow. Code and Dataset are available at
+https://github.com/haoli-zbdbc/hierafashdiff.
+
+
+
+
+
+
+
+ ♻ ☆ Veri-Car: Towards Open-world Vehicle Information Retrieval
+
+
+ Many industrial and service sectors require tools to extract vehicle
+characteristics from images. This is a complex task not only by the variety of
+noise, and large number of classes, but also by the constant introduction of
+new vehicle models to the market. In this paper, we present Veri-Car, an
+information retrieval integrated approach designed to help on this task. It
+leverages supervised learning techniques to accurately identify the make, type,
+model, year, color, and license plate of cars. The approach also addresses the
+challenge of handling open-world problems, where new car models and variations
+frequently emerge, by employing a sophisticated combination of pre-trained
+models, and a hierarchical multi-similarity loss. Veri-Car demonstrates robust
+performance, achieving high precision and accuracy in classifying both seen and
+unseen data. Additionally, it integrates an ensemble license plate detection,
+and an OCR model to extract license plate numbers with impressive accuracy.
+
+
+
+
+
+
+
+
+ Tim Selig, Thomas März, Martin Storath, Andreas Weinmann
+
+
+ Computed tomography from a low radiation dose (LDCT) is challenging due to
+high noise in the projection data. Popular approaches for LDCT image
+reconstruction are two-stage methods, typically consisting of the filtered
+backprojection (FBP) algorithm followed by a neural network for LDCT image
+enhancement. Two-stage methods are attractive for their simplicity and
+potential for computational efficiency, typically requiring only a single FBP
+and a neural network forward pass for inference. However, the best
+reconstruction quality is currently achieved by unrolled iterative methods
+(Learned Primal-Dual and ItNet), which are more complex and thus have a higher
+computational cost for training and inference. We propose a method combining
+the simplicity and efficiency of two-stage methods with state-of-the-art
+reconstruction quality. Our strategy utilizes a neural network pretrained for
+Gaussian noise removal from natural grayscale images, fine-tuned for LDCT image
+enhancement. We call this method FBP-DTSGD (Domain and Task Shifted Gaussian
+Denoisers) as the fine-tuning is a task shift from Gaussian denoising to
+enhancing LDCT images and a domain shift from natural grayscale to LDCT images.
+An ablation study with three different pretrained Gaussian denoisers indicates
+that the performance of FBP-DTSGD does not depend on a specific denoising
+architecture, suggesting future advancements in Gaussian denoising could
+benefit the method. The study also shows that pretraining on natural images
+enhances LDCT reconstruction quality, especially with limited training data.
+Notably, pretraining involves no additional cost, as existing pretrained models
+are used. The proposed method currently holds the top mean position in the
+LoDoPaB-CT challenge.
+
+
+
+
+
+
+
+
+ Kyusik Cho, Dong Yeop Kim, Euntai Kim
+
+
+ We present a novel, training-free approach to scene change detection. Our
+method leverages tracking models, which inherently perform change detection
+between consecutive frames of video by identifying common objects and detecting
+new or missing objects. Specifically, our method takes advantage of the change
+detection effect of the tracking model by inputting reference and query images
+instead of consecutive frames. Furthermore, we focus on the content gap and
+style gap between two input images in change detection, and address both issues
+by proposing adaptive content threshold and style bridging layers,
+respectively. Finally, we extend our approach to video, leveraging rich
+temporal information to enhance the performance of scene change detection. We
+compare our approach and baseline through various experiments. While existing
+train-based baseline tend to specialize only in the trained domain, our method
+shows consistent performance across various domains, proving the
+competitiveness of our approach.
+
+
+
+ comment: AAAI 2025. Code available at: https://github.com/kyusik-cho/ZSSCD
+
+
+
+
+
+
+ ♻ ☆ Uncovering Hidden Subspaces in Video Diffusion Models Using
+ Re-Identification
+
+
+ Latent Video Diffusion Models can easily deceive casual observers and domain
+experts alike thanks to the produced image quality and temporal consistency.
+Beyond entertainment, this creates opportunities around safe data sharing of
+fully synthetic datasets, which are crucial in healthcare, as well as other
+domains relying on sensitive personal information. However, privacy concerns
+with this approach have not fully been addressed yet, and models trained on
+synthetic data for specific downstream tasks still perform worse than those
+trained on real data. This discrepancy may be partly due to the sampling space
+being a subspace of the training videos, effectively reducing the training data
+size for downstream models. Additionally, the reduced temporal consistency when
+generating long videos could be a contributing factor.
+ In this paper, we first show that training privacy-preserving models in
+latent space is computationally more efficient and generalize better.
+Furthermore, to investigate downstream degradation factors, we propose to use a
+re-identification model, previously employed as a privacy preservation filter.
+We demonstrate that it is sufficient to train this model on the latent space of
+the video generator. Subsequently, we use these models to evaluate the subspace
+covered by synthetic video datasets and thus introduce a new way to measure the
+faithfulness of generative machine learning models. We focus on a specific
+application in healthcare echocardiography to illustrate the effectiveness of
+our novel methods. Our findings indicate that only up to 30.8% of the training
+videos are learned in latent video diffusion models, which could explain the
+lack of performance when training downstream tasks on synthetic data.
+
+
+
+
+
+
+
+ ♻ ☆ Image Generation Diversity Issues and How to Tame Them
+
+
+
+
+
+
+
+
+ Mischa Dombrowski, Weitong Zhang, Sarah Cechnicka, Hadrien Reynaud, Bernhard Kainz
+
+
+ Generative methods now produce outputs nearly indistinguishable from real
+data but often fail to fully capture the data distribution. Unlike quality
+issues, diversity limitations in generative models are hard to detect visually,
+requiring specific metrics for assessment. In this paper, we draw attention to
+the current lack of diversity in generative models and the inability of common
+metrics to measure this. We achieve this by framing diversity as an image
+retrieval problem, where we measure how many real images can be retrieved using
+synthetic data as queries. This yields the Image Retrieval Score (IRS), an
+interpretable, hyperparameter-free metric that quantifies the diversity of a
+generative model's output. IRS requires only a subset of synthetic samples and
+provides a statistical measure of confidence. Our experiments indicate that
+current feature extractors commonly used in generative model assessment are
+inadequate for evaluating diversity effectively. Consequently, we perform an
+extensive search for the best feature extractors to assess diversity.
+Evaluation reveals that current diffusion models converge to limited subsets of
+the real distribution, with no current state-of-the-art models superpassing 77%
+of the diversity of the training data. To address this limitation, we introduce
+Diversity-Aware Diffusion Models (DiADM), a novel approach that improves
+diversity of unconditional diffusion models without loss of image quality. We
+do this by disentangling diversity from image quality by using a diversity
+aware module that uses pseudo-unconditional features as input. We provide a
+Python package offering unified feature extraction and metric computation to
+further facilitate the evaluation of generative models
+https://github.com/MischaD/beyondfid.
+
+
+ Current collaborative perception methods often rely on fully annotated
+datasets, which can be expensive to obtain in practical situations. To reduce
+annotation costs, some works adopt sparsely supervised learning techniques and
+generate pseudo labels for the missing instances. However, these methods fail
+to achieve an optimal confidence threshold that harmonizes the quality and
+quantity of pseudo labels. To address this issue, we propose an end-to-end
+Collaborative perception Dual Teacher-Student framework (CoDTS), which employs
+adaptive complementary learning to produce both high-quality and high-quantity
+pseudo labels. Specifically, the Main Foreground Mining (MFM) module generates
+high-quality pseudo labels based on the prediction of the static teacher.
+Subsequently, the Supplement Foreground Mining (SFM) module ensures a balance
+between the quality and quantity of pseudo labels by adaptively identifying
+missing instances based on the prediction of the dynamic teacher. Additionally,
+the Neighbor Anchor Sampling (NAS) module is incorporated to enhance the
+representation of pseudo labels. To promote the adaptive complementary
+learning, we implement a staged training strategy that trains the student and
+dynamic teacher in a mutually beneficial manner. Extensive experiments
+demonstrate that the CoDTS effectively ensures an optimal balance of pseudo
+labels in both quality and quantity, establishing a new state-of-the-art in
+sparsely supervised collaborative perception.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ R2G: Reasoning to Ground in 3D Scenes
+
+
+ We propose Reasoning to Ground (R2G), a neural symbolic model that grounds
+the target objects within 3D scenes in a reasoning manner. In contrast to prior
+works, R2G explicitly models the 3D scene with a semantic concept-based scene
+graph; recurrently simulates the attention transferring across object entities;
+thus makes the process of grounding the target objects with the highest
+probability interpretable. Specifically, we respectively embed multiple object
+properties within the graph nodes and spatial relations among entities within
+the edges, utilizing a predefined semantic vocabulary. To guide attention
+transferring, we employ learning or prompting-based methods to analyze the
+referential utterance and convert it into reasoning instructions within the
+same semantic space. In each reasoning round, R2G either (1) merges current
+attention distribution with the similarity between the instruction and embedded
+entity properties or (2) shifts the attention across the scene graph based on
+the similarity between the instruction and embedded spatial relations. The
+experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result
+with the prior works while maintaining improved interpretability, breaking a
+new path for 3D language grounding.
+
+
+
+
+
+
+
+ ♻ ☆ Advancing Extended Reality with 3D Gaussian Splatting: Innovations and
+ Prospects
+
+
+ 3D Gaussian Splatting (3DGS) has attracted significant attention for its
+potential to revolutionize 3D representation, rendering, and interaction.
+Despite the rapid growth of 3DGS research, its direct application to Extended
+Reality (XR) remains underexplored. Although many studies recognize the
+potential of 3DGS for XR, few have explicitly focused on or demonstrated its
+effectiveness within XR environments. In this paper, we aim to synthesize
+innovations in 3DGS that show specific potential for advancing XR research and
+development. We conduct a comprehensive review of publicly available 3DGS
+papers, with a focus on those referencing XR-related concepts. Additionally, we
+perform an in-depth analysis of innovations explicitly relevant to XR and
+propose a taxonomy to highlight their significance. Building on these insights,
+we propose several prospective XR research areas where 3DGS can make promising
+contributions, yet remain rarely touched. By investigating the intersection of
+3DGS and XR, this paper provides a roadmap to push the boundaries of XR using
+cutting-edge 3DGS techniques.
+
+
+
+ comment: IEEE AIxVR 2025
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts
+
+
+ Machine learning methods strive to acquire a robust model during the training
+process that can effectively generalize to test samples, even in the presence
+of distribution shifts. However, these methods often suffer from performance
+degradation due to unknown test distributions. Test-time adaptation (TTA), an
+emerging paradigm, has the potential to adapt a pre-trained model to unlabeled
+data during testing, before making predictions. Recent progress in this
+paradigm has highlighted the significant benefits of using unlabeled data to
+train self-adapted models prior to inference. In this survey, we categorize TTA
+into several distinct groups based on the form of test data, namely, test-time
+domain adaptation, test-time batch adaptation, and online test-time adaptation.
+For each category, we provide a comprehensive taxonomy of advanced algorithms
+and discuss various learning scenarios. Furthermore, we analyze relevant
+applications of TTA and discuss open challenges and promising areas for future
+research. For a comprehensive list of TTA methods, kindly refer to
+\url{https://github.com/tim-learn/awesome-test-time-adaptation}.
+
+
+
+ comment: Discussions, comments, and questions are all welcomed in
+ \url{https://github.com/tim-learn/awesome-test-time-adaptation}
+
+
+
+
+
+
+ ♻ ☆ Swin2-MoSE: A New Single Image Super-Resolution Model for Remote Sensing
+
+
+
+
+
+
+
+
+ Leonardo Rossi, Vittorio Bernuzzi, Tomaso Fontanini, Massimo Bertozzi, Andrea Prati
+
+
+ Due to the limitations of current optical and sensor technologies and the
+high cost of updating them, the spectral and spatial resolution of satellites
+may not always meet desired requirements. For these reasons, Remote-Sensing
+Single-Image Super-Resolution (RS-SISR) techniques have gained significant
+interest. In this paper, we propose Swin2-MoSE model, an enhanced version of
+Swin2SR. Our model introduces MoE-SM, an enhanced Mixture-of-Experts (MoE) to
+replace the Feed-Forward inside all Transformer block. MoE-SM is designed with
+Smart-Merger, and new layer for merging the output of individual experts, and
+with a new way to split the work between experts, defining a new per-example
+strategy instead of the commonly used per-token one. Furthermore, we analyze
+how positional encodings interact with each other, demonstrating that
+per-channel bias and per-head bias can positively cooperate. Finally, we
+propose to use a combination of Normalized-Cross-Correlation (NCC) and
+Structural Similarity Index Measure (SSIM) losses, to avoid typical MSE loss
+limitations. Experimental results demonstrate that Swin2-MoSE outperforms any
+Swin derived models by up to 0.377 - 0.958 dB (PSNR) on task of 2x, 3x and 4x
+resolution-upscaling (Sen2Venus and OLI2MSI datasets). It also outperforms SOTA
+models by a good margin, proving to be competitive and with excellent
+potential, especially for complex tasks. Additionally, an analysis of
+computational costs is also performed. Finally, we show the efficacy of
+Swin2-MoSE, applying it to a semantic segmentation task (SeasoNet dataset).
+Code and pretrained are available on
+https://github.com/IMPLabUniPr/swin2-mose/tree/official_code
+
+
+
+
+
+
+
+ ♻ ☆ Good Grasps Only: A data engine for self-supervised fine-tuning of pose
+ estimation using grasp poses for verification
+
+
+ In this paper, we present a novel method for self-supervised fine-tuning of
+pose estimation. Leveraging zero-shot pose estimation, our approach enables the
+robot to automatically obtain training data without manual labeling. After pose
+estimation the object is grasped, and in-hand pose estimation is used for data
+validation. Our pipeline allows the system to fine-tune while the process is
+running, removing the need for a learning phase. The motivation behind our work
+lies in the need for rapid setup of pose estimation solutions. Specifically, we
+address the challenging task of bin picking, which plays a pivotal role in
+flexible robotic setups. Our method is implemented on a robotics work-cell, and
+tested with four different objects. For all objects, our method increases the
+performance and outperforms a state-of-the-art method trained on the CAD model
+of the objects. Project page available at gogoengine.github.io
+
+
+ Recent advancements in Text-to-image (T2I) generation have witnessed a shift
+from adapting text to fixed backgrounds to creating images around text.
+Traditional approaches are often limited to generate layouts within static
+images for effective text placement. Our proposed approach, TextCenGen,
+introduces a dynamic adaptation of the blank region for text-friendly image
+generation, emphasizing text-centric design and visual harmony generation. Our
+method employs force-directed attention guidance in T2I models to generate
+images that strategically reserve whitespace for pre-defined text areas, even
+for text or icons at the golden ratio. Observing how cross-attention maps
+affect object placement, we detect and repel conflicting objects using a
+force-directed graph approach, combined with a Spatial Excluding
+Cross-Attention Constraint for smooth attention in whitespace areas. As a novel
+task in graphic design, experiments indicate that TextCenGen outperforms
+existing methods with more harmonious compositions. Furthermore, our method
+significantly enhances T2I model outcomes on our specially collected prompt
+datasets, catering to varied text positions. These results demonstrate the
+efficacy of TextCenGen in creating more harmonious and integrated text-image
+compositions.
+
+
+
+ comment: 7 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ QueSTMaps: Queryable Semantic Topological Maps for 3D Scene
+ Understanding IROS
+
+
+ Robotic tasks such as planning and navigation require a hierarchical semantic
+understanding of a scene, which could include multiple floors and rooms.
+Current methods primarily focus on object segmentation for 3D scene
+understanding. However, such methods struggle to segment out topological
+regions like "kitchen" in the scene. In this work, we introduce a two-step
+pipeline to solve this problem. First, we extract a topological map, i.e.,
+floorplan of the indoor scene using a novel multi-channel occupancy
+representation. Then, we generate CLIP-aligned features and semantic labels for
+every room instance based on the objects it contains using a self-attention
+transformer. Our language-topology alignment supports natural language
+querying, e.g., a "place to cook" locates the "kitchen". We outperform the
+current state-of-the-art on room segmentation by ~20% and room classification
+by ~12%. Our detailed qualitative analysis and ablation studies provide
+insights into the problem of joint structural and semantic 3D scene
+understanding. Project Page: quest-maps.github.io
+
+
+
+ comment: Accepted at 2024 IEEE/RSJ International Conference on Intelligent
+ Robots and Systems (IROS) as Oral Presentation. Also presented at the 2nd
+ Workshop on Open-Vocabulary 3D Scene Understanding (OpenSUN3D) at CVPR 2024
+
+
+
+
+
+
+ ♻ ☆ A simple thinking about the application of the attention mechanism in
+ medical ultrasound image segmentation task
+
+
+ The AI-based assisted diagnosis programs have been widely investigated on
+medical ultrasound images. Complex scenario of ultrasound image, in which the
+coupled interference of internal and external factors is severe, brings a
+unique challenge for localize the object region automatically and precisely in
+ultrasound images. In this study, we seek to propose a more general and robust
+Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or
+diagnose lesions and tissues in ultrasound images more quickly and accurately.
+Different from existing attention schemes, the BAAF consists of a parallel
+hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM).
+Specifically, BAAF first coarsely calibrates the input features from the
+channel and spatial dimensions, and then adaptively selects more robust lesion
+or tissue characterizations from the coarse-calibrated feature maps. The design
+of BAAF further optimizes the "what" and "where" focus and selection problems
+in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in
+medical ultrasound images. The method is evaluated on four medical ultrasound
+segmentation tasks, and the adequate experimental results demonstrate the
+remarkable performance improvement over existing state-of-the-art methods. In
+addition, the comparison with existing attention mechanisms also demonstrates
+the superiority of BAAF. This work provides the possibility for automated
+medical ultrasound assisted diagnosis and reduces reliance on human accuracy
+and precision.
+
+
+
+ comment: 10 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Archaeoscape: Bringing Aerial Laser Scanning Archaeology to the Deep
+ Learning Era NeurIPS 2024
+
+
+ Airborne Laser Scanning (ALS) technology has transformed modern archaeology
+by unveiling hidden landscapes beneath dense vegetation. However, the lack of
+expert-annotated, open-access resources has hindered the analysis of ALS data
+using advanced deep learning techniques. We address this limitation with
+Archaeoscape (available at https://archaeoscape.ai/data/2024/), a novel
+large-scale archaeological ALS dataset spanning 888 km$^2$ in Cambodia with
+31,141 annotated archaeological features from the Angkorian period.
+Archaeoscape is over four times larger than comparable datasets, and the first
+ALS archaeology resource with open-access data, annotations, and models.
+ We benchmark several recent segmentation models to demonstrate the benefits
+of modern vision techniques for this problem and highlight the unique
+challenges of discovering subtle human-made structures under dense jungle
+canopies. By making Archaeoscape available in open access, we hope to bridge
+the gap between traditional archaeology and modern computer vision methods.
+
+
+ Current methods commonly utilize three-branch structures of inversion,
+reconstruction, and editing, to tackle consistent image editing task. However,
+these methods lack control over the generation position of the edited object
+and have issues with background preservation. To overcome these limitations, we
+propose a tuning-free method with only two branches: inversion and editing.
+This approach allows users to simultaneously edit the object's action and
+control the generation position of the edited object. Additionally, it achieves
+improved background preservation. Specifically, we transfer the edited object
+information to the target area and repair or preserve the background of other
+areas during the inversion process at a specific time step. In the editing
+stage, we use the image features in self-attention to query the key and value
+of the corresponding time step in the inversion to achieve consistent image
+editing. Impressive image editing results and quantitative evaluation
+demonstrate the effectiveness of our method. The code is available at
+https://github.com/mobiushy/move-act.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+
+
+
+ Information Retrieval 13
+
+
+
+
+
+ ☆ Foundational Large Language Models for Materials Research
+
+
+
+
+
+
+
+
+ Vaibhav Mishra, Somaditya Singh, Dhruv Ahlawat, Mohd Zaki, Vaibhav Bihani, Hargun Singh Grover, Biswajit Mishra, Santiago Miret, Mausam, N. M. Anoop Krishnan
+
+
+ Materials discovery and development are critical for addressing global
+challenges. Yet, the exponential growth in materials science literature
+comprising vast amounts of textual data has created significant bottlenecks in
+knowledge extraction, synthesis, and scientific reasoning. Large Language
+Models (LLMs) offer unprecedented opportunities to accelerate materials
+research through automated analysis and prediction. Still, their effective
+deployment requires domain-specific adaptation for understanding and solving
+domain-relevant tasks. Here, we present LLaMat, a family of foundational models
+for materials science developed through continued pretraining of LLaMA models
+on an extensive corpus of materials literature and crystallographic data.
+Through systematic evaluation, we demonstrate that LLaMat excels in
+materials-specific NLP and structured information extraction while maintaining
+general linguistic capabilities. The specialized LLaMat-CIF variant
+demonstrates unprecedented capabilities in crystal structure generation,
+predicting stable crystals with high coverage across the periodic table.
+Intriguingly, despite LLaMA-3's superior performance in comparison to LLaMA-2,
+we observe that LLaMat-2 demonstrates unexpectedly enhanced domain-specific
+performance across diverse materials science tasks, including structured
+information extraction from text and tables, more particularly in crystal
+structure generation, a potential adaptation rigidity in overtrained LLMs.
+Altogether, the present work demonstrates the effectiveness of domain
+adaptation towards developing practically deployable LLM copilots for materials
+research. Beyond materials science, our findings reveal important
+considerations for domain adaptation of LLMs, such as model selection, training
+methodology, and domain-specific performance, which may influence the
+development of specialized scientific AI systems.
+
+
+
+
+
+
+
+ ☆ SPRec: Leveraging Self-Play to Debias Preference Alignment for Large
+ Language Model-based Recommendations
+
+
+ Large language models (LLMs) have attracted significant attention in
+recommendation systems. Current LLM-based recommender systems primarily rely on
+supervised fine-tuning (SFT) to train the model for recommendation tasks.
+However, relying solely on positive samples limits the model's ability to align
+with user satisfaction and expectations. To address this, researchers have
+introduced Direct Preference Optimization (DPO), which explicitly aligns
+recommendations with user preferences using offline preference ranking data.
+Despite its advantages, our theoretical analysis reveals that DPO inherently
+biases the model towards a few items, exacerbating the filter bubble issue and
+ultimately degrading user experience. In this paper, we propose SPRec, a novel
+self-play recommendation framework designed to mitigate over-recommendation and
+improve fairness without requiring additional data or manual intervention. In
+each self-play iteration, the model undergoes an SFT step followed by a DPO
+step, treating offline interaction data as positive samples and the predicted
+outputs from the previous iteration as negative samples. This effectively
+re-weights the DPO loss function using the model's logits, adaptively
+suppressing biased items. Extensive experiments on multiple real-world datasets
+demonstrate SPRec's effectiveness in enhancing recommendation accuracy and
+addressing fairness concerns.
+
+
+
+
+
+
+
+ ☆ When Text Embedding Meets Large Language Model: A Comprehensive Survey
+
+
+ Text embedding has become a foundational technology in natural language
+processing (NLP) during the deep learning era, driving advancements across a
+wide array of downstream tasks. While many natural language understanding
+challenges can now be modeled using generative paradigms and leverage the
+robust generative and comprehension capabilities of large language models
+(LLMs), numerous practical applications, such as semantic matching, clustering,
+and information retrieval, continue to rely on text embeddings for their
+efficiency and effectiveness. In this survey, we categorize the interplay
+between LLMs and text embeddings into three overarching themes: (1)
+LLM-augmented text embedding, enhancing traditional embedding methods with
+LLMs; (2) LLMs as text embedders, utilizing their innate capabilities for
+embedding generation; and (3) Text embedding understanding with LLMs,
+leveraging LLMs to analyze and interpret embeddings. By organizing these
+efforts based on interaction patterns rather than specific downstream
+applications, we offer a novel and systematic overview of contributions from
+various research and application domains in the era of LLMs. Furthermore, we
+highlight the unresolved challenges that persisted in the pre-LLM era with
+pre-trained language models (PLMs) and explore the emerging obstacles brought
+forth by LLMs. Building on this analysis, we outline prospective directions for
+the evolution of text embedding, addressing both theoretical and practical
+opportunities in the rapidly advancing landscape of NLP.
+
+
+
+ comment: Work in progress
+
+
+
+
+
+
+ ☆ Predicting Quality of Video Gaming Experience Using Global-Scale
+ Telemetry Data and Federated Learning
+
+
+ Frames Per Second (FPS) significantly affects the gaming experience.
+Providing players with accurate FPS estimates prior to purchase benefits both
+players and game developers. However, we have a limited understanding of how to
+predict a game's FPS performance on a specific device. In this paper, we first
+conduct a comprehensive analysis of a wide range of factors that may affect
+game FPS on a global-scale dataset to identify the determinants of FPS. This
+includes player-side and game-side characteristics, as well as country-level
+socio-economic statistics. Furthermore, recognizing that accurate FPS
+predictions require extensive user data, which raises privacy concerns, we
+propose a federated learning-based model to ensure user privacy. Each player
+and game is assigned a unique learnable knowledge kernel that gradually
+extracts latent features for improved accuracy. We also introduce a novel
+training and prediction scheme that allows these kernels to be dynamically
+plug-and-play, effectively addressing cold start issues. To train this model
+with minimal bias, we collected a large telemetry dataset from 224 countries
+and regions, 100,000 users, and 835 games. Our model achieved a mean
+Wasserstein distance of 0.469 between predicted and ground truth FPS
+distributions, outperforming all baseline methods.
+
+
+
+ comment: 22 pages, 11 figures, 6 tables
+
+
+
+
+
+
+ ☆ A Flexible Plug-and-Play Module for Generating Variable-Length
+
+
+ Deep supervised hashing has become a pivotal technique in large-scale image
+retrieval, offering significant benefits in terms of storage and search
+efficiency. However, existing deep supervised hashing models predominantly
+focus on generating fixed-length hash codes. This approach fails to address the
+inherent trade-off between efficiency and effectiveness when using hash codes
+of varying lengths. To determine the optimal hash code length for a specific
+task, multiple models must be trained for different lengths, leading to
+increased training time and computational overhead. Furthermore, the current
+paradigm overlooks the potential relationships between hash codes of different
+lengths, limiting the overall effectiveness of the models. To address these
+challenges, we propose the Nested Hash Layer (NHL), a plug-and-play module
+designed for existing deep supervised hashing models. The NHL framework
+introduces a novel mechanism to simultaneously generate hash codes of varying
+lengths in a nested manner. To tackle the optimization conflicts arising from
+the multiple learning objectives associated with different code lengths, we
+further propose an adaptive weights strategy that dynamically monitors and
+adjusts gradients during training. Additionally, recognizing that the
+structural information in longer hash codes can provide valuable guidance for
+shorter hash codes, we develop a long-short cascade self-distillation method
+within the NHL to enhance the overall quality of the generated hash codes.
+Extensive experiments demonstrate that NHL not only accelerates the training
+process but also achieves superior retrieval performance across various deep
+hashing models. Our code is publicly available at
+https://github.com/hly1998/NHL.
+
+
+ Multi-objective learning endeavors to concurrently optimize multiple
+objectives using a single model, aiming to achieve high and balanced
+performance across these diverse objectives. However, it often involves a more
+complex optimization problem, particularly when navigating potential conflicts
+between objectives, leading to solutions with higher memory requirements and
+computational complexity. This paper introduces a Multi-Objective
+Goal-Conditioned Supervised Learning (MOGCSL) framework for automatically
+learning to achieve multiple objectives from offline sequential data. MOGCSL
+extends the conventional Goal-Conditioned Supervised Learning (GCSL) method to
+multi-objective scenarios by redefining goals from one-dimensional scalars to
+multi-dimensional vectors. The need for complex architectures and optimization
+constraints can be naturally eliminated. MOGCSL benefits from filtering out
+uninformative or noisy instances that do not achieve desirable long-term
+rewards. It also incorporates a novel goal-choosing algorithm to model and
+select "high" achievable goals for inference.
+ While MOGCSL is quite general, we focus on its application to the next action
+prediction problem in commercial-grade recommender systems. In this context,
+any viable solution needs to be reasonably scalable and also be robust to large
+amounts of noisy data that is characteristic of this application space. We show
+that MOGCSL performs admirably on both counts. Specifically, extensive
+experiments conducted on real-world recommendation datasets validate its
+efficacy and efficiency. Also, analysis and experiments are included to explain
+its strength in discounting the noisier portions of training data in
+recommender systems.
+
+
+
+
+
+
+
+ ☆ MOPI-HFRS: A Multi-objective Personalized Health-aware Food
+ Recommendation System with LLM-enhanced Interpretation
+
+
+
+
+
+
+
+
+ Zheyuan Zhang, Zehong Wang, Tianyi Ma, Varun Sameer Taneja, Sofia Nelson, Nhi Ha Lan Le, Keerthiram Murugesan, Mingxuan Ju, Nitesh V Chawla, Chuxu Zhang, Yanfang Ye
+
+
+ The prevalence of unhealthy eating habits has become an increasingly
+concerning issue in the United States. However, major food recommendation
+platforms (e.g., Yelp) continue to prioritize users' dietary preferences over
+the healthiness of their choices. Although efforts have been made to develop
+health-aware food recommendation systems, the personalization of such systems
+based on users' specific health conditions remains under-explored. In addition,
+few research focus on the interpretability of these systems, which hinders
+users from assessing the reliability of recommendations and impedes the
+practical deployment of these systems. In response to this gap, we first
+establish two large-scale personalized health-aware food recommendation
+benchmarks at the first attempt. We then develop a novel framework,
+Multi-Objective Personalized Interpretable Health-aware Food Recommendation
+System (MOPI-HFRS), which provides food recommendations by jointly optimizing
+the three objectives: user preference, personalized healthiness and nutritional
+diversity, along with an large language model (LLM)-enhanced reasoning module
+to promote healthy dietary knowledge through the interpretation of recommended
+results. Specifically, this holistic graph learning framework first utilizes
+two structure learning and a structure pooling modules to leverage both
+descriptive features and health data. Then it employs Pareto optimization to
+achieve designed multi-facet objectives. Finally, to further promote the
+healthy dietary knowledge and awareness, we exploit an LLM by utilizing
+knowledge-infusion, prompting the LLMs with knowledge obtained from the
+recommendation model for interpretation.
+
+
+
+
+
+
+
+ ♻ ☆ HGCH: A Hyperbolic Graph Convolution Network Model for Heterogeneous
+ Collaborative Graph Recommendation CIKM '24
+
+
+ User-item interaction data in collaborative filtering and graph modeling
+tasks often exhibit power-law characteristics, which suggest the suitability of
+hyperbolic space modeling. Hyperbolic Graph Convolution Neural Networks (HGCNs)
+are a novel technique that leverages the advantages of GCN and hyperbolic
+space, and then achieves remarkable results. However, existing HGCN methods
+have several drawbacks: they fail to fully leverage hyperbolic space properties
+due to arbitrary embedding initialization and imprecise tangent space
+aggregation; they overlook auxiliary information that could enrich the
+collaborative graph; and their training convergence is slow due to margin
+ranking loss and random negative sampling. To overcome these challenges, we
+propose Hyperbolic Graph Collaborative for Heterogeneous Recommendation (HGCH),
+an enhanced HGCN-based model for collaborative filtering that integrates
+diverse side information into a heterogeneous collaborative graph and improves
+training convergence speed. HGCH first preserves the long-tailed nature of the
+graph by initializing node embeddings with power law prior; then it aggregates
+neighbors in hyperbolic space using the gyromidpoint method for accurate
+computation; finally, it fuses multiple embeddings from different hyperbolic
+spaces by the gate fusion with prior. Moreover, HGCH employs a hyperbolic
+user-specific negative sampling to speed up convergence. We evaluate HGCH on
+four real datasets, and the results show that HGCH achieves competitive results
+and outperforms leading baselines, including HGCNs. Extensive ablation studies
+further confirm its effectiveness.
+
+
+
+ comment: Proceedings of the 33rd ACM International Conference on Information
+ and Knowledge Management (CIKM '24)
+
+
+
+
+
+
+ ♻ ☆ Large language models as oracles for instantiating ontologies with
+ domain-specific knowledge
+
+
+
+
+
+
+
+
+ Giovanni Ciatto, Andrea Agiollo, Matteo Magnini, Andrea Omicini
+
+
+ Background. Endowing intelligent systems with semantic data commonly requires
+designing and instantiating ontologies with domain-specific knowledge.
+Especially in the early phases, those activities are typically performed
+manually by human experts possibly leveraging on their own experience. The
+resulting process is therefore time-consuming, error-prone, and often biased by
+the personal background of the ontology designer. Objective. To mitigate that
+issue, we propose a novel domain-independent approach to automatically
+instantiate ontologies with domain-specific knowledge, by leveraging on large
+language models (LLMs) as oracles. Method. Starting from (i) an initial schema
+composed by inter-related classes and properties and (ii) a set of query
+templates, our method queries the LLM multiple times, and generates instances
+for both classes and properties from its replies. Thus, the ontology is
+automatically filled with domain-specific knowledge, compliant to the initial
+schema. As a result, the ontology is quickly and automatically enriched with
+manifold instances, which experts may consider to keep, adjust, discard, or
+complement according to their own needs and expertise. Contribution. We
+formalise our method in general way and instantiate it over various LLMs, as
+well as on a concrete case study. We report experiments rooted in the
+nutritional domain where an ontology of food meals and their ingredients is
+automatically instantiated from scratch, starting from a categorisation of
+meals and their relationships. There, we analyse the quality of the generated
+ontologies and compare ontologies attained by exploiting different LLMs.
+Experimentally, our approach achieves a quality metric that is up to five times
+higher than the state-of-the-art, while reducing erroneous entities and
+relations by up to ten times. Finally, we provide a SWOT analysis of the
+proposed method.
+
+
+
+
+
+
+
+ ♻ ☆ Writing Style Matters: An Examination of Bias and Fairness in
+ Information Retrieval Systems WSDM 25
+
+
+ The rapid advancement of Language Model technologies has opened new
+opportunities, but also introduced new challenges related to bias and fairness.
+This paper explores the uncharted territory of potential biases in
+state-of-the-art universal text embedding models towards specific document and
+query writing styles within Information Retrieval (IR) systems. Our
+investigation reveals that different embedding models exhibit different
+preferences of document writing style, while more informal and emotive styles
+are less favored by most embedding models. In terms of query writing styles,
+many embedding models tend to match the style of the query with the style of
+the retrieved documents, but some show a consistent preference for specific
+styles. Text embedding models fine-tuned on synthetic data generated by LLMs
+display a consistent preference for certain style of generated data. These
+biases in text embedding based IR systems can inadvertently silence or
+marginalize certain communication styles, thereby posing a significant threat
+to fairness in information retrieval. Finally, we also compare the answer
+styles of Retrieval Augmented Generation (RAG) systems based on different LLMs
+and find out that most text embedding models are biased towards LLM's answer
+styles when used as evaluation metrics for answer correctness. This study sheds
+light on the critical issue of writing style based bias in IR systems, offering
+valuable insights for the development of more fair and robust models.
+
+
+
+ comment: In Proceedings of the Eighteenth ACM International Conference on Web
+ Search and Data Mining (WSDM 25)
+
+
+
+
+
+
+
+ Xinyu Li, Chuang Zhao, Hongke Zhao, Likang Wu, Ming HE
+
+
+ In recent years, Large Language Models (LLMs) have demonstrated remarkable
+proficiency in comprehending and generating natural language, with a growing
+prevalence in the domain of recommendation systems. However, LLMs still face a
+significant challenge called prompt sensitivity, which refers to that it is
+highly susceptible to the influence of prompt words. This inconsistency in
+response to minor alterations in prompt input may compromise the accuracy and
+resilience of recommendation models. To address this issue, this paper proposes
+GANPrompt, a multi-dimensional LLMs prompt diversity framework based on
+Generative Adversarial Networks (GANs). The framework enhances the model's
+adaptability and stability to diverse prompts by integrating GANs generation
+techniques with the deep semantic understanding capabilities of LLMs. GANPrompt
+first trains a generator capable of producing diverse prompts by analysing
+multidimensional user behavioural data. These diverse prompts are then used to
+train the LLMs to improve its performance in the face of unseen prompts.
+Furthermore, to ensure a high degree of diversity and relevance of the prompts,
+this study introduces a mathematical theory-based diversity constraint
+mechanism that optimises the generated prompts to ensure that they are not only
+superficially distinct, but also semantically cover a wide range of user
+intentions. Through extensive experiments on multiple datasets, we demonstrate
+the effectiveness of the proposed framework, especially in improving the
+adaptability and robustness of recommendation systems in complex and dynamic
+environments. The experimental results demonstrate that GANPrompt yields
+substantial enhancements in accuracy and robustness relative to existing
+state-of-the-art methodologies.
+
+
+
+
+
+
+
+ ♻ ☆ Task-level Distributionally Robust Optimization for Large Language
+ Model-based Dense Retrieval AAAI25
+
+
+
+
+
+
+
+
+ Guangyuan Ma, Yongliang Ma, Xing Wu, Zhenpeng Su, Ming Zhou, Songlin Hu
+
+
+ Large Language Model-based Dense Retrieval (LLM-DR) optimizes over numerous
+heterogeneous fine-tuning collections from different domains. However, the
+discussion about its training data distribution is still minimal. Previous
+studies rely on empirically assigned dataset choices or sampling ratios, which
+inevitably lead to sub-optimal retrieval performances. In this paper, we
+propose a new task-level Distributionally Robust Optimization (tDRO) algorithm
+for LLM-DR fine-tuning, targeted at improving the universal domain
+generalization ability by end-to-end reweighting the data distribution of each
+task. The tDRO parameterizes the domain weights and updates them with scaled
+domain gradients. The optimized weights are then transferred to the LLM-DR
+fine-tuning to train more robust retrievers. Experiments show optimal
+improvements in large-scale retrieval benchmarks and reduce up to 30% dataset
+usage after applying our optimization algorithm with a series of
+different-sized LLM-DR models.
+
+
+
+ comment: Accepted by AAAI25. Source code is available at
+ https://github.com/tdro-llm/tdro
+
+
+
+
+
+
+ ♻ ☆ The Informational Role of Online Recommendations: Evidence from a Field
+ Experiment
+
+
+
+
+
+
+
+
+ Guy Aridor, Duarte Goncalves, Daniel Kluver, Ruoyan Kong, Joseph Konstan
+
+
+ We conduct a field experiment on a movie-recommendation platform to
+investigate whether and how online recommendations influence consumption
+choices. Using a within-subjects design, our experiment measures the causal
+effect of recommendations on consumption and decomposes the relative importance
+of two economic mechanisms: expanding consumers' consideration sets and
+providing information about their idiosyncratic match value. We find that the
+informational component exerts a stronger influence - recommendations shape
+consumer beliefs, which in turn drive consumption, particularly among less
+experienced consumers. Our findings and experimental design provide valuable
+insights for the economic evaluation and optimisation of online recommendation
+systems.
+
+
+
+
+
+
+
+
+
+
+ Machine Learning 150
+
+
+
+
+
+ ☆ Doe-1: Closed-Loop Autonomous Driving with Large World Model
+
+
+
+
+
+
+
+
+ Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu
+
+
+ End-to-end autonomous driving has received increasing attention due to its
+potential to learn from large amounts of data. However, most existing methods
+are still open-loop and suffer from weak scalability, lack of high-order
+interactions, and inefficient decision-making. In this paper, we explore a
+closed-loop framework for autonomous driving and propose a large Driving wOrld
+modEl (Doe-1) for unified perception, prediction, and planning. We formulate
+autonomous driving as a next-token generation problem and use multi-modal
+tokens to accomplish different tasks. Specifically, we use free-form texts
+(i.e., scene descriptions) for perception and generate future predictions
+directly in the RGB space with image tokens. For planning, we employ a
+position-aware tokenizer to effectively encode action into discrete tokens. We
+train a multi-modal transformer to autoregressively generate perception,
+prediction, and planning tokens in an end-to-end and unified manner.
+Experiments on the widely used nuScenes dataset demonstrate the effectiveness
+of Doe-1 in various tasks including visual question-answering,
+action-conditioned video generation, and motion planning. Code:
+https://github.com/wzzheng/Doe.
+
+
+
+ comment: Code is available at: https://github.com/wzzheng/Doe
+
+
+
+
+
+
+ ☆ Spectral Image Tokenizer
+
+
+
+
+
+
+
+
+ Carlos Esteves, Mohammed Suhail, Ameesh Makadia
+
+
+ Image tokenizers map images to sequences of discrete tokens, and are a
+crucial component of autoregressive transformer-based image generation. The
+tokens are typically associated with spatial locations in the input image,
+arranged in raster scan order, which is not ideal for autoregressive modeling.
+In this paper, we propose to tokenize the image spectrum instead, obtained from
+a discrete wavelet transform (DWT), such that the sequence of tokens represents
+the image in a coarse-to-fine fashion. Our tokenizer brings several advantages:
+1) it leverages that natural images are more compressible at high frequencies,
+2) it can take and reconstruct images of different resolutions without
+retraining, 3) it improves the conditioning for next-token prediction --
+instead of conditioning on a partial line-by-line reconstruction of the image,
+it takes a coarse reconstruction of the full image, 4) it enables partial
+decoding where the first few generated tokens can reconstruct a coarse version
+of the image, 5) it enables autoregressive models to be used for image
+upsampling. We evaluate the tokenizer reconstruction metrics as well as
+multiscale image generation, text-guided image upsampling and editing.
+
+
+
+
+
+
+
+
+ Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, Andreas Geiger, Kashyap Chitta
+
+
+ End-to-end driving systems have made rapid progress, but have so far not been
+applied to the challenging new CARLA Leaderboard 2.0. Further, while there is a
+large body of literature on end-to-end architectures and training strategies,
+the impact of the training dataset is often overlooked. In this work, we make a
+first attempt at end-to-end driving for Leaderboard 2.0. Instead of
+investigating architectures, we systematically analyze the training dataset,
+leading to new insights: (1) Expert style significantly affects downstream
+policy performance. (2) In complex data sets, the frames should not be weighted
+on the basis of simplistic criteria such as class frequencies. (3) Instead,
+estimating whether a frame changes the target labels compared to previous
+frames can reduce the size of the dataset without removing important
+information. By incorporating these findings, our model ranks first and second
+respectively on the map and sensors tracks of the 2024 CARLA Challenge, and
+sets a new state-of-the-art on the Bench2Drive test routes. Finally, we uncover
+a design flaw in the current evaluation metrics and propose a modification for
+future challenges. Our dataset, code, and pre-trained models are publicly
+available at https://github.com/autonomousvision/carla_garage.
+
+
+
+ comment: Technical report for the CVPR 2024 Workshop on Foundation Models for
+ Autonomous Systems. Runner-up of the track 'CARLA Autonomous Driving
+ Challenge' in the 2024 Autonomous Grand Challenge
+ (https://opendrivelab.com/challenge2024/)
+
+
+
+
+
+
+ ☆ Owl-1: Omni World Model for Consistent Long Video Generation
+
+
+
+
+
+
+
+
+ Yuanhui Huang, Wenzhao Zheng, Yuan Gao, Xin Tao, Pengfei Wan, Di Zhang, Jie Zhou, Jiwen Lu
+
+
+ Video generation models (VGMs) have received extensive attention recently and
+serve as promising candidates for general-purpose large vision models. While
+they can only generate short videos each time, existing methods achieve long
+video generation by iteratively calling the VGMs, using the last-frame output
+as the condition for the next-round generation. However, the last frame only
+contains short-term fine-grained information about the scene, resulting in
+inconsistency in the long horizon. To address this, we propose an Omni World
+modeL (Owl-1) to produce long-term coherent and comprehensive conditions for
+consistent long video generation. As videos are observations of the underlying
+evolving world, we propose to model the long-term developments in a latent
+space and use VGMs to film them into videos. Specifically, we represent the
+world with a latent state variable which can be decoded into explicit video
+observations. These observations serve as a basis for anticipating temporal
+dynamics which in turn update the state variable. The interaction between
+evolving dynamics and persistent state enhances the diversity and consistency
+of the long videos. Extensive experiments show that Owl-1 achieves comparable
+performance with SOTA methods on VBench-I2V and VBench-Long, validating its
+ability to generate high-quality video observations. Code:
+https://github.com/huang-yh/Owl.
+
+
+
+ comment: Code is available at: https://github.com/huang-yh/Owl
+
+
+
+
+
+
+ ☆ Wait-Less Offline Tuning and Re-solving for Online Decision Making
+
+
+
+
+
+
+
+
+ Jingruo Sun, Wenzhi Gao, Ellen Vitercik, Yinyu Ye
+
+
+ Online linear programming (OLP) has found broad applications in revenue
+management and resource allocation. State-of-the-art OLP algorithms achieve low
+regret by repeatedly solving linear programming (LP) subproblems that
+incorporate updated resource information. However, LP-based methods are
+computationally expensive and often inefficient for large-scale applications.
+In contrast, recent first-order OLP algorithms are more computationally
+efficient but typically suffer from worse regret guarantees. To address these
+shortcomings, we propose a new algorithm that combines the strengths of
+LP-based and first-order OLP methods. The algorithm re-solves the LP
+subproblems periodically at a predefined frequency $f$ and uses the latest dual
+prices to guide online decision-making. In addition, a first-order method runs
+in parallel during each interval between LP re-solves, smoothing resource
+consumption. Our algorithm achieves $\mathscr{O}(\log (T/f) + \sqrt{f})$
+regret, delivering a "wait-less" online decision-making process that balances
+the computational efficiency of first-order methods and the superior regret
+guarantee of LP-based methods.
+
+
+
+
+
+
+
+ ☆ Neptune: The Long Orbit to Benchmarking Long Video Understanding
+
+
+ This paper describes a semi-automatic pipeline to generate challenging
+question-answer-decoy sets for understanding long videos. Many existing video
+datasets and models are focused on short clips (10s-30s). While some long video
+datasets do exist, they can often be solved by powerful image models applied
+per frame (and often to very few frames) in a video, and are usually manually
+annotated at high cost. In order to mitigate both these problems, we propose a
+scalable dataset creation pipeline which leverages large models (VLMs and
+LLMs), to automatically generate dense, time-aligned video captions, as well as
+tough question answer decoy sets for video segments (up to 15 minutes in
+length). Our dataset Neptune covers a broad range of long video reasoning
+abilities and consists of a subset that emphasizes multimodal reasoning. Since
+existing metrics for open-ended question answering are either rule-based or may
+rely on proprietary models, we provide a new open source model-based metric GEM
+to score open-ended responses on Neptune. Benchmark evaluations reveal that
+most current open-source long video models perform poorly on Neptune,
+particularly on questions testing temporal ordering, counting and state
+changes. Through Neptune, we aim to spur the development of more advanced
+models capable of understanding long videos. The dataset is available at
+https://github.com/google-deepmind/neptune
+
+
+
+
+
+
+
+ ☆ A Theoretical Analysis of Soft-Label vs Hard-Label Training in Neural
+ Networks
+
+
+ Knowledge distillation, where a small student model learns from a pre-trained
+large teacher model, has achieved substantial empirical success since the
+seminal work of \citep{hinton2015distilling}. Despite prior theoretical studies
+exploring the benefits of knowledge distillation, an important question remains
+unanswered: why does soft-label training from the teacher require significantly
+fewer neurons than directly training a small neural network with hard labels?
+To address this, we first present motivating experimental results using simple
+neural network models on a binary classification problem. These results
+demonstrate that soft-label training consistently outperforms hard-label
+training in accuracy, with the performance gap becoming more pronounced as the
+dataset becomes increasingly difficult to classify. We then substantiate these
+observations with a theoretical contribution based on two-layer neural network
+models. Specifically, we show that soft-label training using gradient descent
+requires only $O\left(\frac{1}{\gamma^2 \epsilon}\right)$ neurons to achieve a
+classification loss averaged over epochs smaller than some $\epsilon > 0$,
+where $\gamma$ is the separation margin of the limiting kernel. In contrast,
+hard-label training requires $O\left(\frac{1}{\gamma^4} \cdot
+\ln\left(\frac{1}{\epsilon}\right)\right)$ neurons, as derived from an adapted
+version of the gradient descent analysis in \citep{ji2020polylogarithmic}. This
+implies that when $\gamma \leq \epsilon$, i.e., when the dataset is challenging
+to classify, the neuron requirement for soft-label training can be
+significantly lower than that for hard-label training. Finally, we present
+experimental results on deep neural networks, further validating these
+theoretical findings.
+
+
+
+ comment: Main Body of the Paper is under Review at L4DC 2025
+
+
+
+
+
+
+ ☆ JuStRank: Benchmarking LLM Judges for System Ranking
+
+
+ Given the rapid progress of generative AI, there is a pressing need to
+systematically compare and choose between the numerous models and
+configurations available. The scale and versatility of such evaluations make
+the use of LLM-based judges a compelling solution for this challenge.
+Crucially, this approach requires first to validate the quality of the LLM
+judge itself. Previous work has focused on instance-based assessment of LLM
+judges, where a judge is evaluated over a set of responses, or response pairs,
+while being agnostic to their source systems. We argue that this setting
+overlooks critical factors affecting system-level ranking, such as a judge's
+positive or negative bias towards certain systems. To address this gap, we
+conduct the first large-scale study of LLM judges as system rankers. System
+scores are generated by aggregating judgment scores over multiple system
+outputs, and the judge's quality is assessed by comparing the resulting system
+ranking to a human-based ranking. Beyond overall judge assessment, our analysis
+provides a fine-grained characterization of judge behavior, including their
+decisiveness and bias.
+
+
+
+
+
+
+
+
+ Luke Bailey, Alex Serrano, Abhay Sheshadri, Mikhail Seleznyov, Jordan Taylor, Erik Jenner, Jacob Hilton, Stephen Casper, Carlos Guestrin, Scott Emmons
+
+
+ Recent latent-space monitoring techniques have shown promise as defenses
+against LLM attacks. These defenses act as scanners that seek to detect harmful
+activations before they lead to undesirable actions. This prompts the question:
+Can models execute harmful behavior via inconspicuous latent states? Here, we
+study such obfuscated activations. We show that state-of-the-art latent-space
+defenses -- including sparse autoencoders, representation probing, and latent
+OOD detection -- are all vulnerable to obfuscated activations. For example,
+against probes trained to classify harmfulness, our attacks can often reduce
+recall from 100% to 0% while retaining a 90% jailbreaking rate. However,
+obfuscation has limits: we find that on a complex task (writing SQL code),
+obfuscation reduces model performance. Together, our results demonstrate that
+neural activations are highly malleable: we can reshape activation patterns in
+a variety of ways, often while preserving a network's behavior. This poses a
+fundamental challenge to latent-space defenses.
+
+
+ Cable broadband networks are one of the few "last-mile" broadband
+technologies widely available in the U.S. Unfortunately, they have poor
+reliability after decades of deployment. The cable industry proposed a
+framework called Proactive Network Maintenance (PNM) to diagnose the cable
+networks. However, there is little public knowledge or systematic study on how
+to use these data to detect and localize cable network problems. Existing tools
+in the public domain have prohibitive high false-positive rates. In this paper,
+we propose CableMon, the first public-domain system that applies machine
+learning techniques to PNM data to improve the reliability of cable broadband
+networks. CableMon tackles two key challenges faced by cable ISPs: accurately
+detecting failures, and distinguishing whether a failure occurs within a
+network or at a subscriber's premise. CableMon uses statistical models to
+generate features from time series data and uses customer trouble tickets as
+hints to infer abnormal/failure thresholds for these generated features.
+Further, CableMon employs an unsupervised learning model to group cable devices
+sharing similar anomalous patterns and effectively identify impairments that
+occur inside a cable network and impairments occur at a subscriber's premise,
+as these two different faults require different types of technical personnel to
+repair them. We use eight months of PNM data and customer trouble tickets from
+an ISP and experimental deployment to evaluate CableMon's performance. Our
+evaluation results show that CableMon can effectively detect and distinguish
+failures from PNM data and outperforms existing public-domain tools.
+
+
+
+ comment: 15 pages including reference. Submitted to IEEE/ACM Transactions on
+ Networking. Partly published in NSDI'20, this is the extended version
+
+
+
+
+
+
+ ★ Does Representation Matter? Exploring Intermediate Layers in Large
+ Language Models
+
+
+ Understanding what defines a good representation in large language models
+(LLMs) is fundamental to both theoretical understanding and practical
+applications. In this paper, we investigate the quality of intermediate
+representations in various LLM architectures, including Transformers and State
+Space Models (SSMs). We find that intermediate layers often yield more
+informative representations for downstream tasks than the final layers. To
+measure the representation quality, we adapt and apply a suite of metrics -
+such as prompt entropy, curvature, and augmentation-invariance - originally
+proposed in other contexts. Our empirical study reveals significant
+architectural differences, how representations evolve throughout training, and
+how factors like input randomness and prompt length affect each layer. Notably,
+we observe a bimodal pattern in the entropy of some intermediate layers and
+consider potential explanations tied to training data. Overall, our results
+illuminate the internal mechanics of LLMs and guide strategies for
+architectural optimization and training.
+
+
+
+ comment: Accepted to 2024 NeurIPs Workshop on Machine Learning and Compression
+
+
+
+
+
+
+ ☆ Experimental Machine Learning with Classical and Quantum Data via NMR
+ Quantum Kernels
+
+
+ Kernel methods map data into high-dimensional spaces, enabling linear
+algorithms to learn nonlinear functions without explicitly storing the feature
+vectors. Quantum kernel methods promise efficient learning by encoding feature
+maps into exponentially large Hilbert spaces inherent in quantum systems. In
+this work we implement quantum kernels on a 10-qubit star-topology register in
+a nuclear magnetic resonance (NMR) platform. We experimentally encode classical
+data in the evolution of multiple quantum coherence orders using data-dependent
+unitary transformations and then demonstrate one-dimensional regression and
+two-dimensional classification tasks. By extending the register to a
+double-layered star configuration, we propose an extended quantum kernel to
+handle non-parametrized operator inputs. By numerically simulating the extended
+quantum kernel, we show classification of entangling and nonentangling
+unitaries. These results confirm that quantum kernels exhibit strong
+capabilities in classical as well as quantum machine learning tasks.
+
+
+
+ comment: 8 pages, 5 figures
+
+
+
+
+
+
+ ☆ Enhancing Convergence of Decentralized Gradient Tracking under the KL
+ Property
+
+
+ We study decentralized multiagent optimization over networks, modeled as
+undirected graphs. The optimization problem consists of minimizing a nonconvex
+smooth function plus a convex extended-value function, which enforces
+constraints or extra structure on the solution (e.g., sparsity, low-rank). We
+further assume that the objective function satisfies the Kurdyka-{\L}ojasiewicz
+(KL) property, with given exponent $\theta\in [0,1)$. The KL property is
+satisfied by several (nonconvex) functions of practical interest, e.g., arising
+from machine learning applications; in the centralized setting, it permits to
+achieve strong convergence guarantees. Here we establish convergence of the
+same type for the notorious decentralized gradient-tracking-based algorithm
+SONATA. Specifically, $\textbf{(i)}$ when $\theta\in (0,1/2]$, the sequence
+generated by SONATA converges to a stationary solution of the problem at
+R-linear rate;$ \textbf{(ii)} $when $\theta\in (1/2,1)$, sublinear rate is
+certified; and finally $\textbf{(iii)}$ when $\theta=0$, the iterates will
+either converge in a finite number of steps or converges at R-linear rate. This
+matches the convergence behavior of centralized proximal-gradient algorithms
+except when $\theta=0$. Numerical results validate our theoretical findings.
+
+
+
+ comment: 25 pages, 4 figures
+
+
+
+
+
+
+ ☆ SimAvatar: Simulation-Ready Avatars with Layered Hair and Clothing
+
+
+
+
+
+
+
+
+ Xueting Li, Ye Yuan, Shalini De Mello, Gilles Daviet, Jonathan Leaf, Miles Macklin, Jan Kautz, Umar Iqbal
+
+
+ We introduce SimAvatar, a framework designed to generate simulation-ready
+clothed 3D human avatars from a text prompt. Current text-driven human avatar
+generation methods either model hair, clothing, and the human body using a
+unified geometry or produce hair and garments that are not easily adaptable for
+simulation within existing simulation pipelines. The primary challenge lies in
+representing the hair and garment geometry in a way that allows leveraging
+established prior knowledge from foundational image diffusion models (e.g.,
+Stable Diffusion) while being simulation-ready using either physics or neural
+simulators. To address this task, we propose a two-stage framework that
+combines the flexibility of 3D Gaussians with simulation-ready hair strands and
+garment meshes. Specifically, we first employ three text-conditioned 3D
+generative models to generate garment mesh, body shape and hair strands from
+the given text prompt. To leverage prior knowledge from foundational diffusion
+models, we attach 3D Gaussians to the body mesh, garment mesh, as well as hair
+strands and learn the avatar appearance through optimization. To drive the
+avatar given a pose sequence, we first apply physics simulators onto the
+garment meshes and hair strands. We then transfer the motion onto 3D Gaussians
+through carefully designed mechanisms for each body part. As a result, our
+synthesized avatars have vivid texture and realistic dynamic motion. To the
+best of our knowledge, our method is the first to produce highly realistic,
+fully simulation-ready 3D avatars, surpassing the capabilities of current
+approaches.
+
+
+ Aligning AI systems with human preferences typically suffers from the
+infamous reward hacking problem, where optimization of an imperfect reward
+model leads to undesired behaviors. In this paper, we investigate reward
+hacking in offline preference optimization, which aims to improve an initial
+model using a preference dataset. We identify two types of reward hacking
+stemming from statistical fluctuations in the dataset: Type I Reward Hacking
+due to subpar choices appearing more favorable, and Type II Reward Hacking due
+to decent choices appearing less favorable. We prove that many (mainstream or
+theoretical) preference optimization methods suffer from both types of reward
+hacking. To mitigate Type I Reward Hacking, we propose POWER, a new preference
+optimization method that combines Guiasu's weighted entropy with a robust
+reward maximization objective. POWER enjoys finite-sample guarantees under
+general function approximation, competing with the best covered policy in the
+data. To mitigate Type II Reward Hacking, we analyze the learning dynamics of
+preference optimization and develop a novel technique that dynamically updates
+preference labels toward certain "stationary labels", resulting in diminishing
+gradients for untrustworthy samples. Empirically, POWER with dynamic labels
+(POWER-DL) consistently outperforms state-of-the-art methods on alignment
+benchmarks, achieving improvements of up to 13.0 points on AlpacaEval 2.0 and
+11.5 points on Arena-Hard over DPO, while also improving or maintaining
+performance on downstream tasks such as mathematical reasoning. Strong
+theoretical guarantees and empirical results demonstrate the promise of
+POWER-DL in mitigating reward hacking.
+
+
+
+ comment: 46 pages, 3 figures
+
+
+
+
+
+
+ ☆ Capturing the Temporal Dependence of Training Data Influence
+
+
+
+
+
+
+
+
+ Jiachen T. Wang, Dawn Song, James Zou, Prateek Mittal, Ruoxi Jia
+
+
+ Traditional data influence estimation methods, like influence function,
+assume that learning algorithms are permutation-invariant with respect to
+training data. However, modern training paradigms, especially for foundation
+models using stochastic algorithms and multi-stage curricula, are sensitive to
+data ordering, thus violating this assumption. This mismatch renders influence
+functions inadequate for answering a critical question in machine learning: How
+can we capture the dependence of data influence on the optimization trajectory
+during training? To address this gap, we formalize the concept of
+trajectory-specific leave-one-out (LOO) influence, which quantifies the impact
+of removing a data point from a specific iteration during training, accounting
+for the exact sequence of data encountered and the model's optimization
+trajectory. However, exactly evaluating the trajectory-specific LOO presents a
+significant computational challenge. To address this, we propose data value
+embedding, a novel technique enabling efficient approximation of
+trajectory-specific LOO. Specifically, we compute a training data embedding
+that encapsulates the cumulative interactions between data and the evolving
+model parameters. The LOO can then be efficiently approximated through a simple
+dot-product between the data value embedding and the gradient of the given test
+data. As data value embedding captures training data ordering, it offers
+valuable insights into model training dynamics. In particular, we uncover
+distinct phases of data influence, revealing that data points in the early and
+late stages of training exert a greater impact on the final model. These
+insights translate into actionable strategies for managing the computational
+overhead of data selection by strategically timing the selection process,
+potentially opening new avenues in data curation research.
+
+
+
+ comment: Correspondence to Jiachen T. Wang and Ruoxi Jia
+
+
+
+
+
+
+ ☆ GainAdaptor: Learning Quadrupedal Locomotion with Dual Actors for
+ Adaptable and Energy-Efficient Walking on Various Terrains
+
+
+
+
+
+
+
+
+ Mincheol Kim, Nahyun Kwon, Jung-Yup Kim
+
+
+ Deep reinforcement learning (DRL) has emerged as an innovative solution for
+controlling legged robots in challenging environments using minimalist
+architectures. Traditional control methods for legged robots, such as inverse
+dynamics, either directly manage joint torques or use proportional-derivative
+(PD) controllers to regulate joint positions at a higher level. In case of DRL,
+direct torque control presents significant challenges, leading to a preference
+for joint position control. However, this approach necessitates careful
+adjustment of joint PD gains, which can limit both adaptability and efficiency.
+In this paper, we propose GainAdaptor, an adaptive gain control framework that
+autonomously tunes joint PD gains to enhance terrain adaptability and energy
+efficiency. The framework employs a dual-actor algorithm to dynamically adjust
+the PD gains based on varying ground conditions. By utilizing a divided action
+space, GainAdaptor efficiently learns stable and energy-efficient locomotion.
+We validate the effectiveness of the proposed method through experiments
+conducted on a Unitree Go1 robot, demonstrating improved locomotion performance
+across diverse terrains.
+
+
+
+ comment: 8 pages, 6 figures
+
+
+
+
+
+
+ ☆ Loss function to optimise signal significance in particle physics NeurIPS 2024
+
+
+ We construct a surrogate loss to directly optimise the significance metric
+used in particle physics. We evaluate our loss function for a simple event
+classification task using a linear model and show that it produces decision
+boundaries that change according to the cross sections of the processes
+involved. We find that the models trained with the new loss have higher signal
+efficiency for similar values of estimated signal significance compared to ones
+trained with a cross-entropy loss, showing promise to improve sensitivity of
+particle physics searches at colliders.
+
+
+
+ comment: 9 pages, 4 figures. Appeared in the Machine Learning for Physical
+ Sciences (ML4PS) workshop in NeurIPS 2024 conference
+
+
+
+
+
+
+ ☆ A novel ML-fuzzy control system for optimizing PHEV fuel efficiency and
+ extending electric range under diverse driving conditions
+
+
+
+
+
+
+
+
+ Mehrdad Raeesi, Saba Mansour, Sina Changizian
+
+
+ Aiming for a greener transportation future, this study introduces an
+innovative control system for plug-in hybrid electric vehicles (PHEVs) that
+utilizes machine learning (ML) techniques to forecast energy usage in the pure
+electric mode of the vehicle and optimize power allocation across different
+operational modes, including pure electric, series hybrid, parallel hybrid, and
+internal combustion operation. The fuzzy logic decision-making process governs
+the vehicle control system. The performance was assessed under various driving
+conditions. Key findings include a significant enhancement in pure electric
+mode efficiency, achieving an extended full-electric range of approximately 84
+kilometers on an 80% utilization of a 20-kWh battery pack. During the WLTC
+driving cycle, the control system reduced fuel consumption to 2.86 L/100km,
+representing a 20% reduction in gasoline-equivalent fuel consumption.
+Evaluations of vehicle performance at discrete driving speeds, highlighted
+effective energy management, with the vehicle battery charging at lower speeds
+and discharging at higher speeds, showing optimized energy recovery and
+consumption strategies. Initial battery charge levels notably influenced
+vehicle performance. A 90% initial charge enabled prolonged all-electric
+operation, minimizing fuel consumption to 2 L/100km less than that of the base
+control system. Real-world driving pattern analysis revealed significant
+variations, with shorter, slower cycles requiring lower fuel consumption due to
+prioritized electric propulsion, while longer, faster cycles increased internal
+combustion engine usage. The control system also adapted to different battery
+state of health (SOH) conditions, with higher SOH facilitating extended
+electric mode usage, reducing total fuel consumption by up to 2.87 L/100km.
+
+
+
+ comment: 29 pages, 13 figures
+
+
+
+
+
+
+ ☆ Regression and Classification with Single-Qubit Quantum Neural Networks
+
+
+
+
+
+
+
+
+ Leandro C. Souza, Bruno C. Guingo, Gilson Giraldi, Renato Portugal
+
+
+ Since classical machine learning has become a powerful tool for developing
+data-driven algorithms, quantum machine learning is expected to similarly
+impact the development of quantum algorithms. The literature reflects a
+mutually beneficial relationship between machine learning and quantum
+computing, where progress in one field frequently drives improvements in the
+other. Motivated by the fertile connection between machine learning and quantum
+computing enabled by parameterized quantum circuits, we use a
+resource-efficient and scalable Single-Qubit Quantum Neural Network (SQQNN) for
+both regression and classification tasks. The SQQNN leverages parameterized
+single-qubit unitary operators and quantum measurements to achieve efficient
+learning. To train the model, we use gradient descent for regression tasks. For
+classification, we introduce a novel training method inspired by the Taylor
+series, which can efficiently find a global minimum in a single step. This
+approach significantly accelerates training compared to iterative methods.
+Evaluated across various applications, the SQQNN exhibits virtually error-free
+and strong performance in regression and classification tasks, including the
+MNIST dataset. These results demonstrate the versatility, scalability, and
+suitability of the SQQNN for deployment on near-term quantum devices.
+
+
+
+ comment: 21 pages, 7 figures, 6 tables
+
+
+
+
+
+
+ ☆ Early Detection of At-Risk Students Using Machine Learning
+
+
+ This research presents preliminary work to address the challenge of
+identifying at-risk students using supervised machine learning and three unique
+data categories: engagement, demographics, and performance data collected from
+Fall 2023 using Canvas and the California State University, Fullerton
+dashboard. We aim to tackle the persistent challenges of higher education
+retention and student dropout rates by screening for at-risk students and
+building a high-risk identification system. By focusing on previously
+overlooked behavioral factors alongside traditional metrics, this work aims to
+address educational gaps, enhance student outcomes, and significantly boost
+student success across disciplines at the University. Pre-processing steps take
+place to establish a target variable, anonymize student information, manage
+missing data, and identify the most significant features. Given the mixed data
+types in the datasets and the binary classification nature of this study, this
+work considers several machine learning models, including Support Vector
+Machines (SVM), Naive Bayes, K-nearest neighbors (KNN), Decision Trees,
+Logistic Regression, and Random Forest. These models predict at-risk students
+and identify critical periods of the semester when student performance is most
+vulnerable. We will use validation techniques such as train test split and
+k-fold cross-validation to ensure the reliability of the models. Our analysis
+indicates that all algorithms generate an acceptable outcome for at-risk
+student predictions, while Naive Bayes performs best overall.
+
+
+
+
+
+
+
+ ☆ Bayesian Optimization via Continual Variational Last Layer Training
+
+
+
+
+
+
+
+
+ Paul Brunzema, Mikkel Jordahn, John Willes, Sebastian Trimpe, Jasper Snoek, James Harrison
+
+
+ Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate
+models for Bayesian optimization (BO) due to their ability to model uncertainty
+and their performance on tasks where correlations are easily captured (such as
+those defined by Euclidean metrics) and their ability to be efficiently updated
+online. However, the performance of GPs depends on the choice of kernel, and
+kernel selection for complex correlation structures is often difficult or must
+be made bespoke. While Bayesian neural networks (BNNs) are a promising
+direction for higher capacity surrogate models, they have so far seen limited
+use due to poor performance on some problem types. In this paper, we propose an
+approach which shows competitive performance on many problem types, including
+some that BNNs typically struggle with. We build on variational Bayesian last
+layers (VBLLs), and connect training of these models to exact conditioning in
+GPs. We exploit this connection to develop an efficient online training
+algorithm that interleaves conditioning and optimization. Our findings suggest
+that VBLL networks significantly outperform GPs and other BNN architectures on
+tasks with complex input correlations, and match the performance of well-tuned
+GPs on established benchmark tasks.
+
+
+
+
+
+
+
+ ☆ A Novel Ensemble-Based Deep Learning Model with Explainable AI for
+ Accurate Kidney Disease Diagnosis
+
+
+ Chronic Kidney Disease (CKD) represents a significant global health
+challenge, characterized by the progressive decline in renal function, leading
+to the accumulation of waste products and disruptions in fluid balance within
+the body. Given its pervasive impact on public health, there is a pressing need
+for effective diagnostic tools to enable timely intervention. Our study delves
+into the application of cutting-edge transfer learning models for the early
+detection of CKD. Leveraging a comprehensive and publicly available dataset, we
+meticulously evaluate the performance of several state-of-the-art models,
+including EfficientNetV2, InceptionNetV2, MobileNetV2, and the Vision
+Transformer (ViT) technique. Remarkably, our analysis demonstrates superior
+accuracy rates, surpassing the 90% threshold with MobileNetV2 and achieving
+91.5% accuracy with ViT. Moreover, to enhance predictive capabilities further,
+we integrate these individual methodologies through ensemble modeling,
+resulting in our ensemble model exhibiting a remarkable 96% accuracy in the
+early detection of CKD. This significant advancement holds immense promise for
+improving clinical outcomes and underscores the critical role of machine
+learning in addressing complex medical challenges.
+
+
+ Cornish (2024) recently gave a general theory of neural network
+symmetrisation in the abstract context of Markov categories. We give a
+high-level overview of these results, and their concrete implications for the
+symmetrisation of deterministic functions and of Markov kernels.
+
+
+
+
+
+
+
+ ☆ STORM: A Spatio-Temporal Factor Model Based on Dual Vector Quantized
+ Variational Autoencoders for Financial Trading
+
+
+ In financial trading, factor models are widely used to price assets and
+capture excess returns from mispricing. Recently, we have witnessed the rise of
+variational autoencoder-based latent factor models, which learn latent factors
+self-adaptively. While these models focus on modeling overall market
+conditions, they often fail to effectively capture the temporal patterns of
+individual stocks. Additionally, representing multiple factors as single values
+simplifies the model but limits its ability to capture complex relationships
+and dependencies. As a result, the learned factors are of low quality and lack
+diversity, reducing their effectiveness and robustness across different trading
+periods. To address these issues, we propose a Spatio-Temporal factOR Model
+based on dual vector quantized variational autoencoders, named STORM, which
+extracts features of stocks from temporal and spatial perspectives, then fuses
+and aligns these features at the fine-grained and semantic level, and
+represents the factors as multi-dimensional embeddings. The discrete codebooks
+cluster similar factor embeddings, ensuring orthogonality and diversity, which
+helps distinguish between different factors and enables factor selection in
+financial trading. To show the performance of the proposed factor model, we
+apply it to two downstream experiments: portfolio management on two stock
+datasets and individual trading tasks on six specific stocks. The extensive
+experiments demonstrate STORM's flexibility in adapting to downstream tasks and
+superior performance over baseline models.
+
+
+
+
+
+
+
+ ☆ Finite-PINN: A Physics-Informed Neural Network Architecture for Solving
+ Solid Mechanics Problems with General Geometries
+
+
+
+
+
+
+
+
+ Haolin Li, Yuyang Miao, Zahra Sharif Khodaei, M. H. Aliabadi
+
+
+ PINN models have demonstrated impressive capabilities in addressing fluid PDE
+problems, and their potential in solid mechanics is beginning to emerge. This
+study identifies two key challenges when using PINN to solve general solid
+mechanics problems. These challenges become evident when comparing the
+limitations of PINN with the well-established numerical methods commonly used
+in solid mechanics, such as the finite element method (FEM). Specifically: a)
+PINN models generate solutions over an infinite domain, which conflicts with
+the finite boundaries typical of most solid structures; and b) the solution
+space utilised by PINN is Euclidean, which is inadequate for addressing the
+complex geometries often present in solid structures.
+ This work proposes a PINN architecture used for general solid mechanics
+problems, termed the Finite-PINN model. The proposed model aims to effectively
+address these two challenges while preserving as much of the original
+implementation of PINN as possible. The unique architecture of the Finite-PINN
+model addresses these challenges by separating the approximation of stress and
+displacement fields, and by transforming the solution space from the
+traditional Euclidean space to a Euclidean-topological joint space. Several
+case studies presented in this paper demonstrate that the Finite-PINN model
+provides satisfactory results for a variety of problem types, including both
+forward and inverse problems, in both 2D and 3D contexts. The developed
+Finite-PINN model offers a promising tool for addressing general solid
+mechanics problems, particularly those not yet well-explored in current
+research.
+
+
+
+
+
+
+
+ ☆ Search Strategy Generation for Branch and Bound Using Genetic
+ Programming AAAI 2025
+
+
+ Branch-and-Bound (B\&B) is an exact method in integer programming that
+recursively divides the search space into a tree. During the resolution
+process, determining the next subproblem to explore within the tree-known as
+the search strategy-is crucial. Hand-crafted heuristics are commonly used, but
+none are effective over all problem classes. Recent approaches utilizing neural
+networks claim to make more intelligent decisions but are computationally
+expensive. In this paper, we introduce GP2S (Genetic Programming for Search
+Strategy), a novel machine learning approach that automatically generates a
+B\&B search strategy heuristic, aiming to make intelligent decisions while
+being computationally lightweight. We define a policy as a function that
+evaluates the quality of a B\&B node by combining features from the node and
+the problem; the search strategy policy is then defined by a best-first search
+based on this node ranking. The policy space is explored using a genetic
+programming algorithm, and the policy that achieves the best performance on a
+training set is selected. We compare our approach with the standard method of
+the SCIP solver, a recent graph neural network-based method, and handcrafted
+heuristics. Our first evaluation includes three types of primal hard problems,
+tested on instances similar to the training set and on larger instances. Our
+method is at most 2\% slower than the best baseline and consistently
+outperforms SCIP, achieving an average speedup of 11.3\%. Additionally, GP2S is
+tested on the MIPLIB 2017 dataset, generating multiple heuristics from
+different subsets of instances. It exceeds SCIP's average performance in 7 out
+of 10 cases across 15 times more instances and under a time limit 15 times
+longer, with some GP2S methods leading on most experiments in terms of the
+number of feasible solutions or optimality gap.
+
+
+
+ comment: Accepted at AAAI 2025
+
+
+
+
+
+
+ ☆ MOS: Model Surgery for Pre-Trained Model-Based Class-Incremental
+ Learning AAAI 2025
+
+
+
+
+
+
+
+
+ Hai-Long Sun, Da-Wei Zhou, Hanbin Zhao, Le Gan, De-Chuan Zhan, Han-Jia Ye
+
+
+ Class-Incremental Learning (CIL) requires models to continually acquire
+knowledge of new classes without forgetting old ones. Despite Pre-trained
+Models (PTMs) have shown excellent performance in CIL, catastrophic forgetting
+still occurs as the model learns new concepts. Existing work seeks to utilize
+lightweight components to adjust the PTM, while the forgetting phenomenon still
+comes from {\em parameter and retrieval} levels. Specifically, iterative
+updates of the model result in parameter drift, while mistakenly retrieving
+irrelevant modules leads to the mismatch during inference. To this end, we
+propose MOdel Surgery (MOS) to rescue the model from forgetting previous
+knowledge. By training task-specific adapters, we continually adjust the PTM to
+downstream tasks. To mitigate parameter-level forgetting, we present an adapter
+merging approach to learn task-specific adapters, which aims to bridge the gap
+between different components while reserve task-specific information. Besides,
+to address retrieval-level forgetting, we introduce a training-free
+self-refined adapter retrieval mechanism during inference, which leverages the
+model's inherent ability for better adapter retrieval. By jointly rectifying
+the model with those steps, MOS can robustly resist catastrophic forgetting in
+the learning process. Extensive experiments on seven benchmark datasets
+validate MOS's state-of-the-art performance. Code is available at:
+https://github.com/sun-hailong/AAAI25-MOS
+
+
+
+ comment: Accepted to AAAI 2025. Code is available at:
+ https://github.com/sun-hailong/AAAI25-MOS
+
+
+
+
+
+
+ ☆ Data Efficient Prediction of excited-state properties using Quantum
+ Neural Networks
+
+
+
+
+
+
+
+
+ Manuel Hagelüken, Marco F. Huber, Marco Roth
+
+
+ Understanding the properties of excited states of complex molecules is
+crucial for many chemical and physical processes. Calculating these properties
+is often significantly more resource-intensive than calculating their ground
+state counterparts. We present a quantum machine learning model that predicts
+excited-state properties from the molecular ground state for different
+geometric configurations. The model comprises a symmetry-invariant quantum
+neural network and a conventional neural network and is able to provide
+accurate predictions with only a few training data points. The proposed
+procedure is fully NISQ compatible. This is achieved by using a quantum circuit
+that requires a number of parameters linearly proportional to the number of
+molecular orbitals, along with a parameterized measurement observable, thereby
+reducing the number of necessary measurements. We benchmark the algorithm on
+three different molecules by evaluating its performance in predicting excited
+state transition energies and transition dipole moments. We show that, in many
+instances, the procedure is able to outperform various classical models that
+rely solely on classical features.
+
+
+
+ comment: 10 + 4 pages, 7 + 3 figures
+
+
+
+
+
+
+ ☆ Mixture of neural fields for heterogeneous reconstruction in cryo-EM
+
+
+
+
+
+
+
+
+ Axel Levy, Rishwanth Raghu, David Shustin, Adele Rui-Yang Peng, Huan Li, Oliver Biggs Clarke, Gordon Wetzstein, Ellen D. Zhong
+
+
+ Cryo-electron microscopy (cryo-EM) is an experimental technique for protein
+structure determination that images an ensemble of macromolecules in
+near-physiological contexts. While recent advances enable the reconstruction of
+dynamic conformations of a single biomolecular complex, current methods do not
+adequately model samples with mixed conformational and compositional
+heterogeneity. In particular, datasets containing mixtures of multiple proteins
+require the joint inference of structure, pose, compositional class, and
+conformational states for 3D reconstruction. Here, we present Hydra, an
+approach that models both conformational and compositional heterogeneity fully
+ab initio by parameterizing structures as arising from one of K neural fields.
+We employ a new likelihood-based loss function and demonstrate the
+effectiveness of our approach on synthetic datasets composed of mixtures of
+proteins with large degrees of conformational variability. We additionally
+demonstrate Hydra on an experimental dataset of a cellular lysate containing a
+mixture of different protein complexes. Hydra expands the expressivity of
+heterogeneous reconstruction methods and thus broadens the scope of cryo-EM to
+increasingly complex samples.
+
+
+
+
+
+
+
+ ☆ Reinforcement Learning Within the Classical Robotics Stack: A Case Study
+ in Robot Soccer ICRA 2025
+
+
+
+
+
+
+
+
+ Adam Labiosa, Zhihan Wang, Siddhant Agarwal, William Cong, Geethika Hemkumar, Abhinav Narayan Harish, Benjamin Hong, Josh Kelle, Chen Li, Yuhao Li, Zisen Shao, Peter Stone, Josiah P. Hanna
+
+
+ Robot decision-making in partially observable, real-time, dynamic, and
+multi-agent environments remains a difficult and unsolved challenge. Model-free
+reinforcement learning (RL) is a promising approach to learning decision-making
+in such domains, however, end-to-end RL in complex environments is often
+intractable. To address this challenge in the RoboCup Standard Platform League
+(SPL) domain, we developed a novel architecture integrating RL within a
+classical robotics stack, while employing a multi-fidelity sim2real approach
+and decomposing behavior into learned sub-behaviors with heuristic selection.
+Our architecture led to victory in the 2024 RoboCup SPL Challenge Shield
+Division. In this work, we fully describe our system's architecture and
+empirically analyze key design decisions that contributed to its success. Our
+approach demonstrates how RL-based behaviors can be integrated into complete
+robot behavior architectures.
+
+
+
+
+
+
+
+
+ Dan Jacobellis, Neeraja J. Yadwadkar
+
+
+ Modern sensors produce increasingly rich streams of high-resolution data. Due
+to resource constraints, machine learning systems discard the vast majority of
+this information via resolution reduction. Compressed-domain learning allows
+models to operate on compact latent representations, allowing higher effective
+resolution for the same budget. However, existing compression systems are not
+ideal for compressed learning. Linear transform coding and end-to-end learned
+compression systems reduce bitrate, but do not uniformly reduce dimensionality;
+thus, they do not meaningfully increase efficiency. Generative autoencoders
+reduce dimensionality, but their adversarial or perceptual objectives lead to
+significant information loss. To address these limitations, we introduce WaLLoC
+(Wavelet Learned Lossy Compression), a neural codec architecture that combines
+linear transform coding with nonlinear dimensionality-reducing autoencoders.
+WaLLoC sandwiches a shallow, asymmetric autoencoder and entropy bottleneck
+between an invertible wavelet packet transform. Across several key metrics,
+WaLLoC outperforms the autoencoders used in state-of-the-art latent diffusion
+models. WaLLoC does not require perceptual or adversarial losses to represent
+high-frequency detail, providing compatibility with modalities beyond RGB
+images and stereo audio. WaLLoC's encoder consists almost entirely of linear
+operations, making it exceptionally efficient and suitable for mobile
+computing, remote sensing, and learning directly from compressed data. We
+demonstrate WaLLoC's capability for compressed-domain learning across several
+tasks, including image classification, colorization, document understanding,
+and music source separation. Our code, experiments, and pre-trained audio and
+image codecs are available at https://ut-sysml.org/walloc
+
+
+
+ comment: Accepted as paper to 2025 IEEE Data Compression Conference
+
+
+
+
+
+
+ ☆ Opinion de-polarization of social networks with GNNs
+
+
+ Nowadays, social media is the ground for political debate and exchange of
+opinions. There is a significant amount of research that suggests that social
+media are highly polarized. A phenomenon that is commonly observed is the echo
+chamber structure, where users are organized in polarized communities and form
+connections only with similar-minded individuals, limiting themselves to
+consume specific content. In this paper we explore a way to decrease the
+polarization of networks with two echo chambers. Particularly, we observe that
+if some users adopt a moderate opinion about a topic, the polarization of the
+network decreases. Based on this observation, we propose an efficient algorithm
+to identify a good set of K users, such that if they adopt a moderate stance
+around a topic, the polarization is minimized. Our algorithm employs a Graph
+Neural Network and thus it can handle large graphs more effectively than other
+approaches
+
+
+
+
+
+
+
+ ☆ A Geometry-Aware Message Passing Neural Network for Modeling
+ Aerodynamics over Airfoils
+
+
+
+
+
+
+
+
+ Jacob Helwig, Xuan Zhang, Haiyang Yu, Shuiwang Ji
+
+
+ Computational modeling of aerodynamics is a key problem in aerospace
+engineering, often involving flows interacting with solid objects such as
+airfoils. Deep surrogate models have emerged as purely data-driven approaches
+that learn direct mappings from simulation conditions to solutions based on
+either simulation or experimental data. Here, we consider modeling of
+incompressible flows over solid objects, wherein geometric structures are a key
+factor in determining aerodynamics. To effectively incorporate geometries, we
+propose a message passing scheme that efficiently and expressively integrates
+the airfoil shape with the mesh representation. Under this framework, we first
+obtain a representation of the geometry in the form of a latent graph on the
+airfoil surface. We subsequently propagate this representation to all
+collocation points through message passing on a directed, bipartite graph. We
+demonstrate that this framework supports efficient training by downsampling the
+solution mesh while avoiding distribution shifts at test time when evaluated on
+the full mesh. To enable our model to be able to distinguish between distinct
+spatial regimes of dynamics relative to the airfoil, we represent mesh points
+in both a leading edge and trailing edge coordinate system. We further enhance
+the expressiveness of our coordinate system representations by embedding our
+hybrid Polar-Cartesian coordinates using sinusoidal and spherical harmonics
+bases. We additionally find that a change of basis to canonicalize input
+representations with respect to inlet velocity substantially improves
+generalization. Altogether, these design choices lead to a purely data-driven
+machine learning framework known as GeoMPNN, which won the Best Student
+Submission award at the NeurIPS 2024 ML4CFD Competition, placing 4th overall.
+Our code is publicly available as part of the AIRS library
+(https://github.com/divelab/AIRS).
+
+
+ The segmentation and classification of cardiac magnetic resonance imaging are
+critical for diagnosing heart conditions, yet current approaches face
+challenges in accuracy and generalizability. In this study, we aim to further
+advance the segmentation and classification of cardiac magnetic resonance
+images by introducing a novel deep learning-based approach. Using a multi-stage
+process with U-Net and ResNet models for segmentation, followed by Gaussian
+smoothing, the method improved segmentation accuracy, achieving a Dice
+coefficient of 0.974 for the left ventricle and 0.947 for the right ventricle.
+For classification, a cascade of deep learning classifiers was employed to
+distinguish heart conditions, including hypertrophic cardiomyopathy, myocardial
+infarction, and dilated cardiomyopathy, achieving an average accuracy of 97.2%.
+The proposed approach outperformed existing models, enhancing segmentation
+accuracy and classification precision. These advancements show promise for
+clinical applications, though further validation and interpretation across
+diverse imaging protocols is necessary.
+
+
+ Protein inverse folding is a fundamental problem in bioinformatics, aiming to
+recover the amino acid sequences from a given protein backbone structure.
+Despite the success of existing methods, they struggle to fully capture the
+intricate inter-residue relationships critical for accurate sequence
+prediction. We propose a novel method that leverages diffusion models with
+representation alignment (DMRA), which enhances diffusion-based inverse folding
+by (1) proposing a shared center that aggregates contextual information from
+the entire protein structure and selectively distributes it to each residue;
+and (2) aligning noisy hidden representations with clean semantic
+representations during the denoising process. This is achieved by predefined
+semantic representations for amino acid types and a representation alignment
+method that utilizes type embeddings as semantic feedback to normalize each
+residue. In experiments, we conduct extensive evaluations on the CATH4.2
+dataset to demonstrate that DMRA outperforms leading methods, achieving
+state-of-the-art performance and exhibiting strong generalization capabilities
+on the TS50 and TS500 datasets.
+
+
+ Graph-based representations for samples of computational mechanics-related
+datasets can prove instrumental when dealing with problems like irregular
+domains or molecular structures of materials, etc. To effectively analyze and
+process such datasets, deep learning offers Graph Neural Networks (GNNs) that
+utilize techniques like message-passing within their architecture. The issue,
+however, is that as the individual graph scales and/ or GNN architecture
+becomes increasingly complex, the increased energy budget of the overall deep
+learning model makes it unsustainable and restricts its applications in
+applications like edge computing. To overcome this, we propose in this paper
+Hybrid Variable Spiking Graph Neural Networks (HVS-GNNs) that utilize Variable
+Spiking Neurons (VSNs) within their architecture to promote sparse
+communication and hence reduce the overall energy budget. VSNs, while promoting
+sparse event-driven computations, also perform well for regression tasks, which
+are often encountered in computational mechanics applications and are the main
+target of this paper. Three examples dealing with prediction of mechanical
+properties of material based on microscale/ mesoscale structures are shown to
+test the performance of the proposed HVS-GNNs in regression tasks. We have also
+compared the performance of HVS-GNN architectures with the performance of
+vanilla GNNs and GNNs utilizing leaky integrate and fire neurons. The results
+produced show that HVS-GNNs perform well for regression tasks, all while
+promoting sparse communication and, hence, energy efficiency.
+
+
+
+
+
+
+
+ ☆ A comprehensive interpretable machine learning framework for Mild
+ Cognitive Impairment and Alzheimer's disease diagnosis
+
+
+ An interpretable machine learning (ML) framework is introduced to enhance the
+diagnosis of Mild Cognitive Impairment (MCI) and Alzheimer's disease (AD) by
+ensuring robustness of the ML models' interpretations. The dataset used
+comprises volumetric measurements from brain MRI and genetic data from healthy
+individuals and patients with MCI/AD, obtained through the Alzheimer's Disease
+Neuroimaging Initiative. The existing class imbalance is addressed by an
+ensemble learning approach, while various attribution-based and
+counterfactual-based interpretability methods are leveraged towards producing
+diverse explanations related to the pathophysiology of MCI/AD. A unification
+method combining SHAP with counterfactual explanations assesses the
+interpretability techniques' robustness. The best performing model yielded
+87.5% balanced accuracy and 90.8% F1-score. The attribution-based
+interpretability methods highlighted significant volumetric and genetic
+features related to MCI/AD risk. The unification method provided useful
+insights regarding those features' necessity and sufficiency, further
+showcasing their significance in MCI/AD diagnosis.
+
+
+
+ comment: This preprint has not been peer-reviewed yet but has been submitted
+ to a journal
+
+
+
+
+
+
+ ☆ Distribution free uncertainty quantification in neuroscience-inspired
+ deep operators
+
+
+ Energy-efficient deep learning algorithms are essential for a sustainable
+future and feasible edge computing setups. Spiking neural networks (SNNs),
+inspired from neuroscience, are a positive step in the direction of achieving
+the required energy efficiency. However, in a bid to lower the energy
+requirements, accuracy is marginally sacrificed. Hence, predictions of such
+deep learning algorithms require an uncertainty measure that can inform users
+regarding the bounds of a certain output. In this paper, we introduce the
+Conformalized Randomized Prior Operator (CRP-O) framework that leverages
+Randomized Prior (RP) networks and Split Conformal Prediction (SCP) to quantify
+uncertainty in both conventional and spiking neural operators. To further
+enable zero-shot super-resolution in UQ, we propose an extension incorporating
+Gaussian Process Regression. This enhanced super-resolution-enabled CRP-O
+framework is integrated with the recently developed Variable Spiking Wavelet
+Neural Operator (VSWNO). To test the performance of the obtained calibrated
+uncertainty bounds, we discuss four different examples covering both
+one-dimensional and two-dimensional partial differential equations. Results
+demonstrate that the uncertainty bounds produced by the conformalized RP-VSWNO
+significantly enhance UQ estimates compared to vanilla RP-VSWNO, Quantile WNO
+(Q-WNO), and Conformalized Quantile WNO (CQ-WNO). These findings underscore the
+potential of the proposed approach for practical applications.
+
+
+
+
+
+
+
+ ☆ Quantitative Evaluation of Motif Sets in Time Series
+
+
+
+
+
+
+
+
+ Daan Van Wesenbeeck, Aras Yurtman, Wannes Meert, Hendrik Blockeel
+
+
+ Time Series Motif Discovery (TSMD), which aims at finding recurring patterns
+in time series, is an important task in numerous application domains, and many
+methods for this task exist. These methods are usually evaluated qualitatively.
+A few metrics for quantitative evaluation, where discovered motifs are compared
+to some ground truth, have been proposed, but they typically make implicit
+assumptions that limit their applicability. This paper introduces PROM, a
+broadly applicable metric that overcomes those limitations, and TSMD-Bench, a
+benchmark for quantitative evaluation of time series motif discovery.
+Experiments with PROM and TSMD-Bench show that PROM provides a more
+comprehensive evaluation than existing metrics, that TSMD-Bench is a more
+challenging benchmark than earlier ones, and that the combination can help
+understand the relative performance of TSMD methods. More generally, the
+proposed approach enables large-scale, systematic performance comparisons in
+this field.
+
+
+
+
+
+
+
+ ☆ Diffusion Predictive Control with Constraints
+
+
+
+
+
+
+
+
+ Ralf Römer, Alexander von Rohr, Angela P. Schoellig
+
+
+ Diffusion models have recently gained popularity for policy learning in
+robotics due to their ability to capture high-dimensional and multimodal
+distributions. However, diffusion policies are inherently stochastic and
+typically trained offline, limiting their ability to handle unseen and dynamic
+conditions where novel constraints not represented in the training data must be
+satisfied. To overcome this limitation, we propose diffusion predictive control
+with constraints (DPCC), an algorithm for diffusion-based control with explicit
+state and action constraints that can deviate from those in the training data.
+DPCC uses constraint tightening and incorporates model-based projections into
+the denoising process of a trained trajectory diffusion model. This allows us
+to generate constraint-satisfying, dynamically feasible, and goal-reaching
+trajectories for predictive control. We show through simulations of a robot
+manipulator that DPCC outperforms existing methods in satisfying novel
+test-time constraints while maintaining performance on the learned control
+task.
+
+
+ Time series forecasting (TSF) is essential in various domains, and recent
+advancements in diffusion-based TSF models have shown considerable promise.
+However, these models typically adopt traditional diffusion patterns, treating
+TSF as a noise-based conditional generation task. This approach neglects the
+inherent continuous sequential nature of time series, leading to a fundamental
+misalignment between diffusion mechanisms and the TSF objective, thereby
+severely impairing performance. To bridge this misalignment, and inspired by
+the classic Auto-Regressive Moving Average (ARMA) theory, which views time
+series as continuous sequential progressions evolving from previous data
+points, we propose a novel Auto-Regressive Moving Diffusion (ARMD) model to
+first achieve the continuous sequential diffusion-based TSF. Unlike previous
+methods that start from white Gaussian noise, our model employs chain-based
+diffusion with priors, accurately modeling the evolution of time series and
+leveraging intermediate state information to improve forecasting accuracy and
+stability. Specifically, our approach reinterprets the diffusion process by
+considering future series as the initial state and historical series as the
+final state, with intermediate series generated using a sliding-based technique
+during the forward process. This design aligns the diffusion model's sampling
+procedure with the forecasting objective, resulting in an unconditional,
+continuous sequential diffusion TSF model. Extensive experiments conducted on
+seven widely used datasets demonstrate that our model achieves state-of-the-art
+performance, significantly outperforming existing diffusion-based TSF models.
+Our code is available on GitHub: https://github.com/daxin007/ARMD.
+
+
+
+ comment: no comment
+
+
+
+
+
+
+ ☆ Dynamic Prompt Allocation and Tuning for Continual Test-Time Adaptation
+
+
+ Continual test-time adaptation (CTTA) has recently emerged to adapt a
+pre-trained source model to continuously evolving target distributions, which
+accommodates the dynamic nature of real-world environments. To mitigate the
+risk of catastrophic forgetting in CTTA, existing methods typically incorporate
+explicit regularization terms to constrain the variation of model parameters.
+However, they cannot fundamentally resolve catastrophic forgetting because they
+rely on a single shared model to adapt across all target domains, which
+inevitably leads to severe inter-domain interference. In this paper, we
+introduce learnable domain-specific prompts that guide the model to adapt to
+corresponding target domains, thereby partially disentangling the parameter
+space of different domains. In the absence of domain identity for target
+samples, we propose a novel dynamic Prompt AllocatIon aNd Tuning (PAINT)
+method, which utilizes a query mechanism to dynamically determine whether the
+current samples come from a known domain or an unexplored one. For known
+domains, the corresponding domain-specific prompt is directly selected, while
+for previously unseen domains, a new prompt is allocated. Prompt tuning is
+subsequently performed using mutual information maximization along with
+structural regularization. Extensive experiments on three benchmark datasets
+demonstrate the effectiveness of our PAINT method for CTTA. We have released
+our code at https://github.com/Cadezzyr/PAINT.
+
+
+
+ comment: 21 pages, 5 figures, and 6 tables
+
+
+
+
+
+
+ ☆ Transfer Learning of RSSI to Improve Indoor Localisation Performance
+
+
+
+
+
+
+
+
+ Thanaphon Suwannaphong, Ryan McConville, Ian Craddock
+
+
+ With the growing demand for health monitoring systems, in-home localisation
+is essential for tracking patient conditions. The unique spatial
+characteristics of each house required annotated data for Bluetooth Low Energy
+(BLE) Received Signal Strength Indicator (RSSI)-based monitoring system.
+However, collecting annotated training data is time-consuming, particularly for
+patients with limited health conditions. To address this, we propose
+Conditional Generative Adversarial Networks (ConGAN)-based augmentation,
+combined with our transfer learning framework (T-ConGAN), to enable the
+transfer of generic RSSI information between different homes, even when data is
+collected using different experimental protocols. This enhances the performance
+and scalability of such intelligent systems by reducing the need for annotation
+in each home. We are the first to demonstrate that BLE RSSI data can be shared
+across different homes, and that shared information can improve the indoor
+localisation performance. Our T-ConGAN enhances the macro F1 score of
+room-level indoor localisation by up to 12.2%, with a remarkable 51%
+improvement in challenging areas such as stairways or outside spaces. This
+state-of-the-art RSSI augmentation model significantly enhances the robustness
+of in-home health monitoring systems.
+
+
+
+
+
+
+
+ ☆ Optimising TinyML with Quantization and Distillation of Transformer and
+ Mamba Models for Indoor Localisation on Edge Devices
+
+
+
+
+
+
+
+
+ Thanaphon Suwannaphong, Ferdian Jovan, Ian Craddock, Ryan McConville
+
+
+ This paper proposes small and efficient machine learning models (TinyML) for
+resource-constrained edge devices, specifically for on-device indoor
+localisation. Typical approaches for indoor localisation rely on centralised
+remote processing of data transmitted from lower powered devices such as
+wearables. However, there are several benefits for moving this to the edge
+device itself, including increased battery life, enhanced privacy, reduced
+latency and lowered operational costs, all of which are key for common
+applications such as health monitoring. The work focuses on model compression
+techniques, including quantization and knowledge distillation, to significantly
+reduce the model size while maintaining high predictive performance. We base
+our work on a large state-of-the-art transformer-based model and seek to deploy
+it within low-power MCUs. We also propose a state-space-based architecture
+using Mamba as a more compact alternative to the transformer. Our results show
+that the quantized transformer model performs well within a 64 KB RAM
+constraint, achieving an effective balance between model size and localisation
+precision. Additionally, the compact Mamba model has strong performance under
+even tighter constraints, such as a 32 KB of RAM, without the need for model
+compression, making it a viable option for more resource-limited environments.
+We demonstrate that, through our framework, it is feasible to deploy advanced
+indoor localisation models onto low-power MCUs with restricted memory
+limitations. The application of these TinyML models in healthcare has the
+potential to revolutionize patient monitoring by providing accurate, real-time
+location data while minimizing power consumption, increasing data privacy,
+improving latency and reducing infrastructure costs.
+
+
+ Current robot learning algorithms for acquiring novel skills often rely on
+demonstration datasets or environment interactions, resulting in high labor
+costs and potential safety risks. To address these challenges, this study
+proposes a skill-learning framework that enables robots to acquire novel skills
+from natural language instructions. The proposed pipeline leverages
+vision-language models to generate demonstration videos of novel skills, which
+are processed by an inverse dynamics model to extract actions from the
+unlabeled demonstrations. These actions are subsequently mapped to
+environmental contexts via imitation learning, enabling robots to learn new
+skills effectively. Experimental evaluations in the MetaWorld simulation
+environments demonstrate the pipeline's capability to generate high-fidelity
+and reliable demonstrations. Using the generated demonstrations, various skill
+learning algorithms achieve an accomplishment rate three times the original on
+novel tasks. These results highlight a novel approach to robot learning,
+offering a foundation for the intuitive and intelligent acquisition of novel
+robotic skills.
+
+
+
+
+
+
+
+ ☆ CRVQ: Channel-relaxed Vector Quantization for Extreme Compression of
+ LLMs
+
+
+
+
+
+
+
+
+ Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, Wanxiang Che
+
+
+ Powerful large language models (LLMs) are increasingly expected to be
+deployed with lower computational costs, enabling their capabilities on
+resource-constrained devices. Post-training quantization (PTQ) has emerged as a
+star approach to achieve this ambition, with best methods compressing weights
+to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector
+Quantization (CRVQ), a novel technique that significantly improves the
+performance of PTQ baselines at the cost of only minimal additional bits. This
+state-of-the-art extreme compression method achieves its results through two
+key innovations: (1) carefully selecting and reordering a very small subset of
+critical weight channels, and (2) leveraging multiple codebooks to relax the
+constraint of critical channels. With our method, we demonstrate a 38.9%
+improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer
+lossless 1-bit compression. Furthermore, our approach offers flexible
+customization of quantization bit-width and performance, providing a wider
+range of deployment options for diverse hardware platforms.
+
+
+
+ comment: 5 figures, 4 tables
+
+
+
+
+
+
+ ☆ Score and Distribution Matching Policy: Advanced Accelerated Visuomotor
+ Policies via Matched Distillation
+
+
+ Visual-motor policy learning has advanced with architectures like
+diffusion-based policies, known for modeling complex robotic trajectories.
+However, their prolonged inference times hinder high-frequency control tasks
+requiring real-time feedback. While consistency distillation (CD) accelerates
+inference, it introduces errors that compromise action quality. To address
+these limitations, we propose the Score and Distribution Matching Policy (SDM
+Policy), which transforms diffusion-based policies into single-step generators
+through a two-stage optimization process: score matching ensures alignment with
+true action distributions, and distribution matching minimizes KL divergence
+for consistency. A dual-teacher mechanism integrates a frozen teacher for
+stability and an unfrozen teacher for adversarial training, enhancing
+robustness and alignment with target distributions. Evaluated on a 57-task
+simulation benchmark, SDM Policy achieves a 6x inference speedup while having
+state-of-the-art action quality, providing an efficient and reliable framework
+for high-frequency robotic tasks.
+
+
+
+
+
+
+
+
+ Qingqiang Sun, Chaoqi Chen, Ziyue Qiao, Xubin Zheng, Kai Wang
+
+
+ Most graph contrastive learning (GCL) methods heavily rely on cross-view
+contrast, thus facing several concomitant challenges, such as the complexity of
+designing effective augmentations, the potential for information loss between
+views, and increased computational costs. To mitigate reliance on cross-view
+contrasts, we propose \ttt{SIGNA}, a novel single-view graph contrastive
+learning framework. Regarding the inconsistency between structural connection
+and semantic similarity of neighborhoods, we resort to soft neighborhood
+awareness for GCL. Specifically, we leverage dropout to obtain
+structurally-related yet randomly-noised embedding pairs for neighbors, which
+serve as potential positive samples. At each epoch, the role of partial
+neighbors is switched from positive to negative, leading to probabilistic
+neighborhood contrastive learning effect. Furthermore, we propose a normalized
+Jensen-Shannon divergence estimator for a better effect of contrastive
+learning. Surprisingly, experiments on diverse node-level tasks demonstrate
+that our simple single-view GCL framework consistently outperforms existing
+methods by margins of up to 21.74% (PPI). In particular, with soft neighborhood
+awareness, SIGNA can adopt MLPs instead of complicated GCNs as the encoder to
+generate representations in transductive learning tasks, thus speeding up its
+inference process by 109 times to 331 times. The source code is available at
+https://github.com/sunisfighting/SIGNA.
+
+
+
+ comment: Accepted by AAAI2025; full version including appendix
+
+
+
+
+
+
+ ☆ When Can Memorization Improve Fairness?
+
+
+
+
+
+
+
+
+ Bob Pepin, Christian Igel, Raghavendra Selvan
+
+
+ We study to which extent additive fairness metrics (statistical parity, equal
+opportunity and equalized odds) can be influenced in a multi-class
+classification problem by memorizing a subset of the population. We give
+explicit expressions for the bias resulting from memorization in terms of the
+label and group membership distribution of the memorized dataset and the
+classifier bias on the unmemorized dataset. We also characterize the memorized
+datasets that eliminate the bias for all three metrics considered. Finally we
+provide upper and lower bounds on the total probability mass in the memorized
+dataset that is necessary for the complete elimination of these biases.
+
+
+ Fine-tuning large language models (LLMs) is computationally intensive because
+it requires updating all parameters. Low-Rank Adaptation (LoRA) improves
+efficiency by modifying only a subset of weights but introduces a trade-off
+between expressivity and computational cost: lower ranks reduce resources but
+limit expressiveness, while higher ranks enhance expressivity at increased
+cost. Despite recent advances in adaptive LoRA techniques, existing methods
+fail to provide a theoretical basis for optimizing the trade-off between model
+performance and efficiency. We propose Geometric Low-Rank Adaptation (GeLoRA),
+a novel framework that computes the intrinsic dimensionality of hidden state
+representations to adaptively select LoRA ranks. We demonstrate that the
+intrinsic dimension provides a lower bound for the optimal rank of LoRA
+matrices, allowing for a principled selection that balances efficiency and
+expressivity. GeLoRA dynamically adjusts the rank for each layer based on the
+intrinsic dimensionality of its input and output representations, recognizing
+that not all model parameters equally impact fine-tuning. Empirical validation
+on multiple tasks shows that GeLoRA consistently outperforms recent baselines
+within the same parameter budget.
+
+
+
+
+
+
+
+ ☆ Uplift modeling with continuous treatments: A predict-then-optimize
+ approach
+
+
+
+
+
+
+
+
+ Simon De Vos, Christopher Bockel-Rickermann, Stefan Lessmann, Wouter Verbeke
+
+
+ The goal of uplift modeling is to recommend actions that optimize specific
+outcomes by determining which entities should receive treatment. One common
+approach involves two steps: first, an inference step that estimates
+conditional average treatment effects (CATEs), and second, an optimization step
+that ranks entities based on their CATE values and assigns treatment to the top
+k within a given budget. While uplift modeling typically focuses on binary
+treatments, many real-world applications are characterized by continuous-valued
+treatments, i.e., a treatment dose. This paper presents a predict-then-optimize
+framework to allow for continuous treatments in uplift modeling. First, in the
+inference step, conditional average dose responses (CADRs) are estimated from
+data using causal machine learning techniques. Second, in the optimization
+step, we frame the assignment task of continuous treatments as a
+dose-allocation problem and solve it using integer linear programming (ILP).
+This approach allows decision-makers to efficiently and effectively allocate
+treatment doses while balancing resource availability, with the possibility of
+adding extra constraints like fairness considerations or adapting the objective
+function to take into account instance-dependent costs and benefits to maximize
+utility. The experiments compare several CADR estimators and illustrate the
+trade-offs between policy value and fairness, as well as the impact of an
+adapted objective function. This showcases the framework's advantages and
+flexibility across diverse applications in healthcare, lending, and human
+resource management. All code is available on github.com/SimonDeVos/UMCT.
+
+
+
+
+
+
+
+ ☆ On the Generation and Removal of Speaker Adversarial Perturbation for
+ Voice-Privacy Protection
+
+
+ Neural networks are commonly known to be vulnerable to adversarial attacks
+mounted through subtle perturbation on the input data. Recent development in
+voice-privacy protection has shown the positive use cases of the same technique
+to conceal speaker's voice attribute with additive perturbation signal
+generated by an adversarial network. This paper examines the reversibility
+property where an entity generating the adversarial perturbations is authorized
+to remove them and restore original speech (e.g., the speaker him/herself). A
+similar technique could also be used by an investigator to deanonymize a
+voice-protected speech to restore criminals' identities in security and
+forensic analysis. In this setting, the perturbation generative module is
+assumed to be known in the removal process. To this end, a joint training of
+perturbation generation and removal modules is proposed. Experimental results
+on the LibriSpeech dataset demonstrated that the subtle perturbations added to
+the original speech can be predicted from the anonymized speech while achieving
+the goal of privacy protection. By removing these perturbations from the
+anonymized sample, the original speech can be restored. Audio samples can be
+found in \url{https://voiceprivacy.github.io/Perturbation-Generation-Removal/}.
+
+
+
+ comment: 6 pages, 3 figures, published to IEEE SLT Workshop 2024
+
+
+
+
+
+
+ ☆ Dimensionality Reduction Techniques for Global Bayesian Optimisation NeurIPS 2024
+
+
+ Bayesian Optimisation (BO) is a state-of-the-art global optimisation
+technique for black-box problems where derivative information is unavailable,
+and sample efficiency is crucial. However, improving the general scalability of
+BO has proved challenging. Here, we explore Latent Space Bayesian Optimisation
+(LSBO), that applies dimensionality reduction to perform BO in a
+reduced-dimensional subspace. While early LSBO methods used (linear) random
+projections (Wang et al., 2013), we employ Variational Autoencoders (VAEs) to
+manage more complex data structures and general DR tasks. Building on Grosnit
+et. al. (2021), we analyse the VAE-based LSBO framework, focusing on VAE
+retraining and deep metric loss. We suggest a few key corrections in their
+implementation, originally designed for tasks such as molecule generation, and
+reformulate the algorithm for broader optimisation purposes. Our numerical
+results show that structured latent manifolds improve BO performance.
+Additionally, we examine the use of the Mat\'{e}rn-$\frac{5}{2}$ kernel for
+Gaussian Processes in this LSBO context. We also integrate Sequential Domain
+Reduction (SDR), a standard global optimization efficiency strategy, into BO.
+SDR is included in a GPU-based environment using \textit{BoTorch}, both in the
+original and VAE-generated latent spaces, marking the first application of SDR
+within LSBO.
+
+
+
+ comment: Accepted at NeurIPS 2024 Workshop OPT for ML: Optimization for
+ Machine Learning (Submission Number:67)
+
+ As data-privacy requirements are becoming increasingly stringent and
+statistical models based on sensitive data are being deployed and used more
+routinely, protecting data-privacy becomes pivotal. Partial Least Squares (PLS)
+regression is the premier tool for building such models in analytical
+chemistry, yet it does not inherently provide privacy guarantees, leaving
+sensitive (training) data vulnerable to privacy attacks. To address this gap,
+we propose an $(\epsilon, \delta)$-differentially private PLS (edPLS)
+algorithm, which integrates well-studied and theoretically motivated Gaussian
+noise-adding mechanisms into the PLS algorithm to ensure the privacy of the
+data underlying the model. Our approach involves adding carefully calibrated
+Gaussian noise to the outputs of four key functions in the PLS algorithm: the
+weights, scores, $X$-loadings, and $Y$-loadings. The noise variance is
+determined based on the global sensitivity of each function, ensuring that the
+privacy loss is controlled according to the $(\epsilon, \delta)$-differential
+privacy framework. Specifically, we derive the sensitivity bounds for each
+function and use these bounds to calibrate the noise added to the model
+components. Experimental results demonstrate that edPLS effectively renders
+privacy attacks, aimed at recovering unique sources of variability in the
+training data, ineffective. Application of edPLS to the NIR corn benchmark
+dataset shows that the root mean squared error of prediction (RMSEP) remains
+competitive even at strong privacy levels (i.e., $\epsilon=1$), given proper
+pre-processing of the corresponding spectra. These findings highlight the
+practical utility of edPLS in creating privacy-preserving multivariate
+calibrations and for the analysis of their privacy-utility trade-offs.
+
+
+
+
+
+
+
+
+ Svetlana Pavlitska, Leopold Müller, J. Marius Zöllner
+
+
+ Adversarial attacks on traffic sign classification models were among the
+first successfully tried in the real world. Since then, the research in this
+area has been mainly restricted to repeating baseline models, such as LISA-CNN
+or GTSRB-CNN, and similar experiment settings, including white and black
+patches on traffic signs. In this work, we decouple model architectures from
+the datasets and evaluate on further generic models to make a fair comparison.
+Furthermore, we compare two attack settings, inconspicuous and visible, which
+are usually regarded without direct comparison. Our results show that standard
+baselines like LISA-CNN or GTSRB-CNN are significantly more susceptible than
+the generic ones. We, therefore, suggest evaluating new attacks on a broader
+spectrum of baselines in the future. Our code is available at
+\url{https://github.com/KASTEL-MobilityLab/attacks-on-traffic-sign-recognition/}.
+
+
+
+ comment: Accepted for publication at ICMLA 2024
+
+ Imitation learning with a privileged teacher has proven effective for
+learning complex control behaviors from high-dimensional inputs, such as
+images. In this framework, a teacher is trained with privileged task
+information, while a student tries to predict the actions of the teacher with
+more limited observations, e.g., in a robot navigation task, the teacher might
+have access to distances to nearby obstacles, while the student only receives
+visual observations of the scene. However, privileged imitation learning faces
+a key challenge: the student might be unable to imitate the teacher's behavior
+due to partial observability. This problem arises because the teacher is
+trained without considering if the student is capable of imitating the learned
+behavior. To address this teacher-student asymmetry, we propose a framework for
+joint training of the teacher and student policies, encouraging the teacher to
+learn behaviors that can be imitated by the student despite the latters'
+limited access to information and its partial observability. Based on the
+performance bound in imitation learning, we add (i) the approximated action
+difference between teacher and student as a penalty term to the reward function
+of the teacher, and (ii) a supervised teacher-student alignment step. We
+motivate our method with a maze navigation task and demonstrate its
+effectiveness on complex vision-based quadrotor flight and manipulation tasks.
+
+
+
+
+
+
+
+ ☆ A Brief Discussion on KPI Development in Public Administration
+
+
+ Efficient and effective service delivery in Public Administration (PA) relies
+on the development and utilization of key performance indicators (KPIs) for
+evaluating and measuring performance. This paper presents an innovative
+framework for KPI construction within performance evaluation systems,
+leveraging Random Forest algorithms and variable importance analysis. The
+proposed approach identifies key variables that significantly influence PA
+performance, offering valuable insights into the critical factors driving
+organizational success. By integrating variable importance analysis with expert
+consultation, relevant KPIs can be systematically developed, ensuring that
+improvement strategies address performance-critical areas. The framework
+incorporates continuous monitoring mechanisms and adaptive phases to refine
+KPIs in response to evolving administrative needs. This study aims to enhance
+PA performance through the application of machine learning techniques,
+fostering a more agile and results-driven approach to public administration.
+
+
+
+
+
+
+
+ ☆ Enhancing Modality Representation and Alignment for Multimodal
+ Cold-start Active Learning
+
+
+
+
+
+
+
+
+ Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See
+
+
+ Training multimodal models requires a large amount of labeled data. Active
+learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start
+approaches, which rely on sufficient labeled data to train a well-calibrated
+model that can assess the uncertainty and diversity of unlabeled data. However,
+when assembling a dataset, labeled data are often scarce initially, leading to
+a cold-start problem. Additionally, most AL methods seldom address multimodal
+data, highlighting a research gap in this field. Our research addresses these
+issues by developing a two-stage method for Multi-Modal Cold-Start Active
+Learning (MMCSAL).
+ Firstly, we observe the modality gap, a significant distance between the
+centroids of representations from different modalities, when only using
+cross-modal pairing information as self-supervision signals. This modality gap
+affects data selection process, as we calculate both uni-modal and cross-modal
+distances. To address this, we introduce uni-modal prototypes to bridge the
+modality gap. Secondly, conventional AL methods often falter in multimodal
+scenarios where alignment between modalities is overlooked. Therefore, we
+propose enhancing cross-modal alignment through regularization, thereby
+improving the quality of selected multimodal data pairs in AL. Finally, our
+experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs
+across three multimodal datasets.
+
+
+
+ comment: 11 pages, ACMMM Asia 2024, Oral Presentation
+
+
+
+
+
+
+ ☆ MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk
+ Minimization for Autonomous Driving
+
+
+ We propose MMD-OPT: a sample-efficient approach for minimizing the risk of
+collision under arbitrary prediction distribution of the dynamic obstacles.
+MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space
+(RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two
+concepts can be used to define a sample efficient surrogate for collision risk
+estimate. We perform extensive simulations to validate the effectiveness of
+MMD-OPT on both synthetic and real-world datasets. Importantly, we show that
+trajectory optimization with our MMD-based collision risk surrogate leads to
+safer trajectories at low sample regimes than popular alternatives based on
+Conditional Value at Risk (CVaR).
+
+
+
+
+
+
+
+ ☆ The Utility and Complexity of In- and Out-of-Distribution Machine
+ Unlearning
+
+
+ Machine unlearning, the process of selectively removing data from trained
+models, is increasingly crucial for addressing privacy concerns and knowledge
+gaps post-deployment. Despite this importance, existing approaches are often
+heuristic and lack formal guarantees. In this paper, we analyze the fundamental
+utility, time, and space complexity trade-offs of approximate unlearning,
+providing rigorous certification analogous to differential privacy. For
+in-distribution forget data -- data similar to the retain set -- we show that a
+surprisingly simple and general procedure, empirical risk minimization with
+output perturbation, achieves tight unlearning-utility-complexity trade-offs,
+addressing a previous theoretical gap on the separation from unlearning "for
+free" via differential privacy, which inherently facilitates the removal of
+such data. However, such techniques fail with out-of-distribution forget data
+-- data significantly different from the retain set -- where unlearning time
+complexity can exceed that of retraining, even for a single sample. To address
+this, we propose a new robust and noisy gradient descent variant that provably
+amortizes unlearning time complexity without compromising utility.
+
+
+
+
+
+
+
+ ☆ An Algorithm-Centered Approach To Model Streaming Data
+
+
+
+
+
+
+
+
+ Fabian Hinder, Valerie Vaquet, David Komnick, Barbara Hammer
+
+
+ Besides the classical offline setup of machine learning, stream learning
+constitutes a well-established setup where data arrives over time in
+potentially non-stationary environments. Concept drift, the phenomenon that the
+underlying distribution changes over time poses a significant challenge. Yet,
+despite high practical relevance, there is little to no foundational theory for
+learning in the drifting setup comparable to classical statistical learning
+theory in the offline setting. This can be attributed to the lack of an
+underlying object comparable to a probability distribution as in the classical
+setup. While there exist approaches to transfer ideas to the streaming setup,
+these start from a data perspective rather than an algorithmic one. In this
+work, we suggest a new model of data over time that is aimed at the algorithm's
+perspective. Instead of defining the setup using time points, we utilize a
+window-based approach that resembles the inner workings of most stream learning
+algorithms. We compare our framework to others from the literature on a
+theoretical basis, showing that in many cases both model the same situation.
+Furthermore, we perform a numerical evaluation and showcase an application in
+the domain of critical infrastructure.
+
+
+
+ comment: This manuscript is currently under review at the Symposium on
+ Intelligent Data Analysis (IDA 2025)
+
+
+
+
+
+
+ ☆ How to Re-enable PDE Loss for Physical Systems Modeling Under Partial
+ Observation AAAI2025
+
+
+ In science and engineering, machine learning techniques are increasingly
+successful in physical systems modeling (predicting future states of physical
+systems). Effectively integrating PDE loss as a constraint of system transition
+can improve the model's prediction by overcoming generalization issues due to
+data scarcity, especially when data acquisition is costly. However, in many
+real-world scenarios, due to sensor limitations, the data we can obtain is
+often only partial observation, making the calculation of PDE loss seem to be
+infeasible, as the PDE loss heavily relies on high-resolution states. We
+carefully study this problem and propose a novel framework named Re-enable PDE
+Loss under Partial Observation (RPLPO). The key idea is that although enabling
+PDE loss to constrain system transition solely is infeasible, we can re-enable
+PDE loss by reconstructing the learnable high-resolution state and constraining
+system transition simultaneously. Specifically, RPLPO combines an encoding
+module for reconstructing learnable high-resolution states with a transition
+module for predicting future states. The two modules are jointly trained by
+data and PDE loss. We conduct experiments in various physical systems to
+demonstrate that RPLPO has significant improvement in generalization, even when
+observation is sparse, irregular, noisy, and PDE is inaccurate. The code is
+available on GitHub: RPLPO.
+
+
+
+
+
+
+
+
+ Yudi Xie, Weichen Huang, Esther Alter, Jeremy Schwartz, Joshua B. Tenenbaum, James J. DiCarlo
+
+
+ Studies of the functional role of the primate ventral visual stream have
+traditionally focused on object categorization, often ignoring -- despite much
+prior evidence -- its role in estimating "spatial" latents such as object
+position and pose. Most leading ventral stream models are derived by optimizing
+networks for object categorization, which seems to imply that the ventral
+stream is also derived under such an objective. Here, we explore an alternative
+hypothesis: Might the ventral stream be optimized for estimating spatial
+latents? And a closely related question: How different -- if at all -- are
+representations learned from spatial latent estimation compared to
+categorization? To ask these questions, we leveraged synthetic image datasets
+generated by a 3D graphic engine and trained convolutional neural networks
+(CNNs) to estimate different combinations of spatial and category latents. We
+found that models trained to estimate just a few spatial latents achieve neural
+alignment scores comparable to those trained on hundreds of categories, and the
+spatial latent performance of models strongly correlates with their neural
+alignment. Spatial latent and category-trained models have very similar -- but
+not identical -- internal representations, especially in their early and middle
+layers. We provide evidence that this convergence is partly driven by
+non-target latent variability in the training data, which facilitates the
+implicit learning of representations of those non-target latents. Taken
+together, these results suggest that many training objectives, such as spatial
+latents, can lead to similar models aligned neurally with the ventral stream.
+Thus, one should not assume that the ventral stream is optimized for object
+categorization only. As a field, we need to continue to sharpen our measures of
+comparing models to brains to better understand the functional roles of the
+ventral stream.
+
+
+ Offline preference-based reinforcement learning (PbRL) typically operates in
+two phases: first, use human preferences to learn a reward model and annotate
+rewards for a reward-free offline dataset; second, learn a policy by optimizing
+the learned reward via offline RL. However, accurately modeling step-wise
+rewards from trajectory-level preference feedback presents inherent challenges.
+The reward bias introduced, particularly the overestimation of predicted
+rewards, leads to optimistic trajectory stitching, which undermines the
+pessimism mechanism critical to the offline RL phase. To address this
+challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for
+offline PbRL, which leverages conditional sequence modeling to mitigate the
+risk of learning inaccurate trajectory stitching under reward bias.
+Specifically, DTR employs Decision Transformer and TD-Learning to strike a
+balance between maintaining fidelity to the behavior policy with high
+in-dataset trajectory returns and selecting optimal actions based on high
+reward labels. Additionally, we introduce an ensemble normalization technique
+that effectively integrates multiple reward models, balancing the tradeoff
+between reward differentiation and accuracy. Empirical evaluations on various
+benchmarks demonstrate the superiority of DTR over other state-of-the-art
+baselines
+
+
+
+ comment: 7 pages, Proceedings of the 39th AAAI Conference on Artificial
+ Intelligence (AAAI-25)
+
+
+
+
+
+
+ ☆ Filter-then-Generate: Large Language Models with Structure-Text Adapter
+ for Knowledge Graph Completion COLING 2025
+
+
+
+
+
+
+
+
+ Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng
+
+
+ Large Language Models (LLMs) present massive inherent knowledge and superior
+semantic comprehension capability, which have revolutionized various tasks in
+natural language processing. Despite their success, a critical gap remains in
+enabling LLMs to perform knowledge graph completion (KGC). Empirical evidence
+suggests that LLMs consistently perform worse than conventional KGC approaches,
+even through sophisticated prompt design or tailored instruction-tuning.
+Fundamentally, applying LLMs on KGC introduces several critical challenges,
+including a vast set of entity candidates, hallucination issue of LLMs, and
+under-exploitation of the graph structure. To address these challenges, we
+propose a novel instruction-tuning-based method, namely FtG. Specifically, we
+present a \textit{filter-then-generate} paradigm and formulate the KGC task
+into a multiple-choice question format. In this way, we can harness the
+capability of LLMs while mitigating the issue casused by hallucinations.
+Moreover, we devise a flexible ego-graph serialization prompt and employ a
+structure-text adapter to couple structure and text information in a
+contextualized manner. Experimental results demonstrate that FtG achieves
+substantial performance gain compared to existing state-of-the-art methods. The
+instruction dataset and code are available at
+\url{https://github.com/LB0828/FtG}.
+
+
+
+ comment: COLING 2025 Main Conference
+
+
+
+
+
+
+ ☆ Integrated trucks assignment and scheduling problem with mixed service
+ mode docks: A Q-learning based adaptive large neighborhood search algorithm
+
+
+
+
+
+
+
+
+ Yueyi Li, Mehrdad Mohammadi, Xiaodong Zhang, Yunxing Lan, Willem van Jaarsveld
+
+
+ Mixed service mode docks enhance efficiency by flexibly handling both loading
+and unloading trucks in warehouses. However, existing research often
+predetermines the number and location of these docks prior to planning truck
+assignment and sequencing. This paper proposes a new model integrating dock
+mode decision, truck assignment, and scheduling, thus enabling adaptive dock
+mode arrangements. Specifically, we introduce a Q-learning-based adaptive large
+neighborhood search (Q-ALNS) algorithm to address the integrated problem. The
+algorithm adjusts dock modes via perturbation operators, while truck assignment
+and scheduling are solved using destroy and repair local search operators.
+Q-learning adaptively selects these operators based on their performance
+history and future gains, employing the epsilon-greedy strategy. Extensive
+experimental results and statistical analysis indicate that the Q-ALNS benefits
+from efficient operator combinations and its adaptive mechanism, consistently
+outperforming benchmark algorithms in terms of optimality gap and Pareto front
+discovery. In comparison to the predetermined service mode, our adaptive
+strategy results in lower average tardiness and makespan, highlighting its
+superior adaptability to varying demands.
+
+
+ We introduce two convolutional neural network (CNN) architectures, inspired
+by the Merriman-Bence-Osher (MBO) algorithm and by cellular automatons, to
+model and learn threshold dynamics for front evolution from video data. The
+first model, termed the (single-dynamics) MBO network, learns a specific kernel
+and threshold for each input video without adapting to new dynamics, while the
+second, a meta-learning MBO network, generalizes across diverse threshold
+dynamics by adapting its parameters per input. Both models are evaluated on
+synthetic and real-world videos (ice melting and fire front propagation), with
+performance metrics indicating effective reconstruction and extrapolation of
+evolving boundaries, even under noisy conditions. Empirical results highlight
+the robustness of both networks across varied synthetic and real-world
+dynamics.
+
+
+ Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from seen
+source domains to unseen target domains, which is crucial for evaluating the
+generalization and robustness of models. Recent studies focus on utilizing
+visual styles to bridge the domain gap between different domains. However, the
+serious dilemma of gradient instability and local optimization problem occurs
+in those style-based CD-FSL methods. This paper addresses these issues and
+proposes a novel crop-global style perturbation method, called
+\underline{\textbf{S}}elf-\underline{\textbf{V}}ersatility
+\underline{\textbf{A}}dversarial \underline{\textbf{S}}tyle
+\underline{\textbf{P}}erturbation (\textbf{SVasP}), which enhances the gradient
+stability and escapes from poor sharp minima jointly. Specifically, SVasP
+simulates more diverse potential target domain adversarial styles via
+diversifying input patterns and aggregating localized crop style gradients, to
+serve as global style perturbation stabilizers within one image, a concept we
+refer to as self-versatility. Then a novel objective function is proposed to
+maximize visual discrepancy while maintaining semantic consistency between
+global, crop, and adversarial features. Having the stabilized global style
+perturbation in the training phase, one can obtain a flattened minima in the
+loss landscape, boosting the transferability of the model to the target
+domains. Extensive experiments on multiple benchmark datasets demonstrate that
+our method significantly outperforms existing state-of-the-art methods. Our
+codes are available at https://github.com/liwenqianSEU/SVasP.
+
+
+
+
+
+
+
+ ☆ Multi-view Clustering via Unified Multi-kernel Learning and Matrix
+ Factorization
+
+
+ Multi-view clustering has become increasingly important due to the
+multi-source character of real-world data. Among existing multi-view clustering
+methods, multi-kernel clustering and matrix factorization-based multi-view
+clustering have gained widespread attention as mainstream approaches. However,
+multi-kernel clustering tends to learn an optimal kernel and then perform
+eigenvalue decomposition on it, which leads to high computational complexity.
+Matrix factorization-based multi-view clustering methods impose orthogonal
+constraints on individual views. This overly emphasizes the accuracy of
+clustering structures within single views and restricts the learning of
+individual views. Based on this analysis, we propose a multi-view clustering
+method that integrates multi-kernel learning with matrix factorization. This
+approach combines the advantages of both multi-kernel learning and matrix
+factorization. It removes the orthogonal constraints on individual views and
+imposes orthogonal constraints on the consensus matrix, resulting in an
+accurate final clustering structure. Ultimately, the method is unified into a
+simple form of multi-kernel clustering, but avoids learning an optimal kernel,
+thus reducing the time complexity. Furthermore, we propose an efficient
+three-step optimization algorithm to achieve a locally optimal solution.
+Experiments on widely-used real-world datasets demonstrate the effectiveness of
+our proposed method.
+
+
+
+
+
+
+
+ ☆ Go With the Flow: Fast Diffusion for Gaussian Mixture Models
+
+
+
+
+
+
+
+
+ George Rapakoulias, Ali Reza Pedram, Panagiotis Tsiotras
+
+
+ Schr\"{o}dinger Bridges (SB) are diffusion processes that steer, in finite
+time, a given initial distribution to another final one while minimizing a
+suitable cost functional. Although various methods for computing SBs have
+recently been proposed in the literature, most of these approaches require
+computationally expensive training schemes, even for solving low-dimensional
+problems. In this work, we propose an analytic parametrization of a set of
+feasible policies for steering the distribution of a dynamical system from one
+Gaussian Mixture Model (GMM) to another. Instead of relying on standard
+non-convex optimization techniques, the optimal policy within the set can be
+approximated as the solution of a low-dimensional linear program whose
+dimension scales linearly with the number of components in each mixture.
+Furthermore, our method generalizes naturally to more general classes of
+dynamical systems such as controllable Linear Time-Varying systems that cannot
+currently be solved using traditional neural SB approaches. We showcase the
+potential of this approach in low-to-moderate dimensional problems such as
+image-to-image translation in the latent space of an autoencoder, and various
+other examples. We also benchmark our approach on an Entropic Optimal Transport
+(EOT) problem and show that it outperforms state-of-the-art methods in cases
+where the boundary distributions are mixture models while requiring virtually
+no training.
+
+
+
+
+
+
+
+ ☆ Safe Active Learning for Gaussian Differential Equations
+
+
+
+
+
+
+
+
+ Leon Glass, Katharina Ensinger, Christoph Zimmer
+
+
+ Gaussian Process differential equations (GPODE) have recently gained momentum
+due to their ability to capture dynamics behavior of systems and also represent
+uncertainty in predictions. Prior work has described the process of training
+the hyperparameters and, thereby, calibrating GPODE to data. How to design
+efficient algorithms to collect data for training GPODE models is still an open
+field of research. Nevertheless high-quality training data is key for model
+performance. Furthermore, data collection leads to time-cost and financial-cost
+and might in some areas even be safety critical to the system under test.
+Therefore, algorithms for safe and efficient data collection are central for
+building high quality GPODE models. Our novel Safe Active Learning (SAL) for
+GPODE algorithm addresses this challenge by suggesting a mechanism to propose
+efficient and non-safety-critical data to collect. SAL GPODE does so by
+sequentially suggesting new data, measuring it and updating the GPODE model
+with the new data. In this way, subsequent data points are iteratively
+suggested. The core of our SAL GPODE algorithm is a constrained optimization
+problem maximizing information of new data for GPODE model training constrained
+by the safety of the underlying system. We demonstrate our novel SAL GPODE's
+superiority compared to a standard, non-active way of measuring new data on two
+relevant examples.
+
+
+ The discovery of customer intention from dialogue plays an important role in
+automated support system. However, traditional text clustering methods are
+poorly aligned with human perceptions due to the shift from embedding distance
+to semantic distance, and existing quantitative metrics for text clustering may
+not accurately reflect the true quality of intent clusters. In this paper, we
+leverage the superior language understanding capabilities of Large Language
+Models (LLMs) for designing better-calibrated intent clustering algorithms. We
+first establish the foundation by verifying the robustness of fine-tuned LLM
+utility in semantic coherence evaluation and cluster naming, resulting in an
+accuracy of 97.50% and 94.40%, respectively, when compared to the human-labeled
+ground truth. Then, we propose an iterative clustering algorithm that
+facilitates cluster-level refinement and the continuous discovery of
+high-quality intent clusters. Furthermore, we present several LLM-in-the-loop
+semi-supervised clustering techniques tailored for intent discovery from
+customer service dialogue. Experiments on a large-scale industrial dataset
+comprising 1,507 intent clusters demonstrate the effectiveness of the proposed
+techniques. The methods outperformed existing counterparts, achieving 6.25%
+improvement in quantitative metrics and 12% enhancement in application-level
+performance when constructing an intent classifier.
+
+
+
+
+
+
+
+ ☆ Beyond Confusion: A Fine-grained Dialectical Examination of Human
+ Activity Recognition Benchmark Datasets
+
+
+
+
+
+
+
+
+ Daniel Geissler, Dominique Nshimyimana, Vitor Fortes Rey, Sungho Suh, Bo Zhou, Paul Lukowicz
+
+
+ The research of machine learning (ML) algorithms for human activity
+recognition (HAR) has made significant progress with publicly available
+datasets. However, most research prioritizes statistical metrics over examining
+negative sample details. While recent models like transformers have been
+applied to HAR datasets with limited success from the benchmark metrics, their
+counterparts have effectively solved problems on similar levels with near 100%
+accuracy. This raises questions about the limitations of current approaches.
+This paper aims to address these open questions by conducting a fine-grained
+inspection of six popular HAR benchmark datasets. We identified for some parts
+of the data, none of the six chosen state-of-the-art ML methods can correctly
+classify, denoted as the intersect of false classifications (IFC). Analysis of
+the IFC reveals several underlying problems, including ambiguous annotations,
+irregularities during recording execution, and misaligned transition periods.
+We contribute to the field by quantifying and characterizing annotated data
+ambiguities, providing a trinary categorization mask for dataset patching, and
+stressing potential improvements for future data collections.
+
+
+
+
+
+
+
+ ☆ Pulling the Carpet Below the Learner's Feet: Genetic Algorithm To Learn
+ Ensemble Machine Learning Model During Concept Drift
+
+
+ Data-driven models, in general, and machine learning (ML) models, in
+particular, have gained popularity over recent years with an increased usage of
+such models across the scientific and engineering domains. When using ML models
+in realistic and dynamic environments, users need to often handle the challenge
+of concept drift (CD). In this study, we explore the application of genetic
+algorithms (GAs) to address the challenges posed by CD in such settings. We
+propose a novel two-level ensemble ML model, which combines a global ML model
+with a CD detector, operating as an aggregator for a population of ML pipeline
+models, each one with an adjusted CD detector by itself responsible for
+re-training its ML model. In addition, we show one can further improve the
+proposed model by utilizing off-the-shelf automatic ML methods. Through
+extensive synthetic dataset analysis, we show that the proposed model
+outperforms a single ML pipeline with a CD algorithm, particularly in scenarios
+with unknown CD characteristics. Overall, this study highlights the potential
+of ensemble ML and CD models obtained through a heuristic and adaptive
+optimization process such as the GA one to handle complex CD events.
+
+
+
+
+
+
+
+ ☆ RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell
+ Property Prediction AAAI 2025
+
+
+ Organic Solar Cells (OSCs) are a promising technology for sustainable energy
+production. However, the identification of molecules with desired OSC
+properties typically involves laborious experimental research. To accelerate
+progress in the field, it is crucial to develop machine learning models capable
+of accurately predicting the properties of OSC molecules. While graph
+representation learning has demonstrated success in molecular property
+prediction, it remains underexplored for OSC-specific tasks. Existing methods
+fail to capture the unique structural features of OSC molecules, particularly
+the intricate ring systems that critically influence OSC properties, leading to
+suboptimal performance. To fill the gap, we present RingFormer, a novel graph
+transformer framework specially designed to capture both atom and ring level
+structural patterns in OSC molecules. RingFormer constructs a hierarchical
+graph that integrates atomic and ring structures and employs a combination of
+local message passing and global attention mechanisms to generate expressive
+graph representations for accurate OSC property prediction. We evaluate
+RingFormer's effectiveness on five curated OSC molecule datasets through
+extensive experiments. The results demonstrate that RingFormer consistently
+outperforms existing methods, achieving a 22.77% relative improvement over the
+nearest competitor on the CEPDB dataset.
+
+
+
+ comment: 12 pages, 4 figures. This is the extended version of the paper
+ accepted at AAAI 2025, which includes all technical appendices and additional
+ experimental details
+
+
+
+
+
+
+ ☆ Learning and Current Prediction of PMSM Drive via Differential Neural
+ Networks
+
+
+
+
+
+
+
+
+ Wenjie Mei, Xiaorui Wang, Yanrong Lu, Ke Yu, Shihua Li
+
+
+ Learning models for dynamical systems in continuous time is significant for
+understanding complex phenomena and making accurate predictions. This study
+presents a novel approach utilizing differential neural networks (DNNs) to
+model nonlinear systems, specifically permanent magnet synchronous motors
+(PMSMs), and to predict their current trajectories. The efficacy of our
+approach is validated through experiments conducted under various load
+disturbances and no-load conditions. The results demonstrate that our method
+effectively and accurately reconstructs the original systems, showcasing strong
+short-term and long-term prediction capabilities and robustness. This study
+provides valuable insights into learning the inherent dynamics of complex
+dynamical data and holds potential for further applications in fields such as
+weather forecasting, robotics, and collective behavior analysis.
+
+
+
+
+
+
+
+ ☆ Training Physical Neural Networks for Analog In-Memory Computing
+
+
+ In-memory computing (IMC) architectures mitigate the von Neumann bottleneck
+encountered in traditional deep learning accelerators. Its energy efficiency
+can realize deep learning-based edge applications. However, because IMC is
+implemented using analog circuits, inherent non-idealities in the hardware pose
+significant challenges. This paper presents physical neural networks (PNNs) for
+constructing physical models of IMC. PNNs can address the synaptic current's
+dependence on membrane potential, a challenge in charge-domain IMC systems. The
+proposed model is mathematically equivalent to spiking neural networks with
+reversal potentials. With a novel technique called differentiable spike-time
+discretization, the PNNs are efficiently trained. We show that hardware
+non-idealities traditionally viewed as detrimental can enhance the model's
+learning performance. This bottom-up methodology was validated by designing an
+IMC circuit with non-ideal characteristics using the sky130 process. When
+employing this bottom-up approach, the modeling error reduced by an order of
+magnitude compared to conventional top-down methods in post-layout simulations.
+
+
+
+ comment: 53 pages, 20 figures
+
+
+
+
+
+
+ ☆ A physics-informed transformer neural operator for learning generalized
+ solutions of initial boundary value problems
+
+
+ Initial boundary value problems arise commonly in applications with
+engineering and natural systems governed by nonlinear partial differential
+equations (PDEs). Operator learning is an emerging field for solving these
+equations by using a neural network to learn a map between infinite dimensional
+input and output function spaces. These neural operators are trained using a
+combination of data (observations or simulations) and PDE-residuals
+(physics-loss). A major drawback of existing neural approaches is the
+requirement to retrain with new initial/boundary conditions, and the necessity
+for a large amount of simulation data for training. We develop a
+physics-informed transformer neural operator (named PINTO) that efficiently
+generalizes to unseen initial and boundary conditions, trained in a
+simulation-free setting using only physics loss. The main innovation lies in
+our new iterative kernel integral operator units, implemented using
+cross-attention, to transform the PDE solution's domain points into an
+initial/boundary condition-aware representation vector, enabling efficient
+learning of the solution function for new scenarios. The PINTO architecture is
+applied to simulate the solutions of important equations used in engineering
+applications: advection, Burgers, and steady and unsteady Navier-Stokes
+equations (three flow scenarios). For these five test cases, we show that the
+relative errors during testing under challenging conditions of unseen
+initial/boundary conditions are only one-fifth to one-third of other leading
+physics informed operator learning methods. Moreover, our PINTO model is able
+to accurately solve the advection and Burgers equations at time steps that are
+not included in the training collocation points. The code is available at
+$\texttt{https://github.com/quest-lab-iisc/PINTO}$
+
+
+
+ comment: 29 pages, 11 figures, 4 tables
+
+
+
+
+
+
+ ☆ Motor Imagery Classification for Asynchronous EEG-Based Brain-Computer
+ Interfaces
+
+
+ Motor imagery (MI) based brain-computer interfaces (BCIs) enable the direct
+control of external devices through the imagined movements of various body
+parts. Unlike previous systems that used fixed-length EEG trials for MI
+decoding, asynchronous BCIs aim to detect the user's MI without explicit
+triggers. They are challenging to implement, because the algorithm needs to
+first distinguish between resting-states and MI trials, and then classify the
+MI trials into the correct task, all without any triggers. This paper proposes
+a sliding window prescreening and classification (SWPC) approach for MI-based
+asynchronous BCIs, which consists of two modules: a prescreening module to
+screen MI trials out of the resting-state, and a classification module for MI
+classification. Both modules are trained with supervised learning followed by
+self-supervised learning, which refines the feature extractors. Within-subject
+and cross-subject asynchronous MI classifications on four different EEG
+datasets validated the effectiveness of SWPC, i.e., it always achieved the
+highest average classification accuracy, and outperformed the best
+state-of-the-art baseline on each dataset by about 2%.
+
+
+
+
+
+
+
+ ☆ Stellar parameter prediction and spectral simulation using machine
+ learning
+
+
+ We applied machine learning to the entire data history of ESO's High Accuracy
+Radial Velocity Planet Searcher (HARPS) instrument. Our primary goal was to
+recover the physical properties of the observed objects, with a secondary
+emphasis on simulating spectra. We systematically investigated the impact of
+various factors on the accuracy and fidelity of the results, including the use
+of simulated data, the effect of varying amounts of real training data, network
+architectures, and learning paradigms. Our approach integrates supervised and
+unsupervised learning techniques within autoencoder frameworks. Our methodology
+leverages an existing simulation model that utilizes a library of existing
+stellar spectra in which the emerging flux is computed from first principles
+rooted in physics and a HARPS instrument model to generate simulated spectra
+comparable to observational data. We trained standard and variational
+autoencoders on HARPS data to predict spectral parameters and generate spectra.
+Our models excel at predicting spectral parameters and compressing real
+spectra, and they achieved a mean prediction error of approximately 50 K for
+effective temperatures, making them relevant for most astrophysical
+applications. Furthermore, the models predict metallicity ([M/H]) and surface
+gravity (log g) with an accuracy of approximately 0.03 dex and 0.04 dex,
+respectively, underscoring their broad applicability in astrophysical research.
+The models' computational efficiency, with processing times of 779.6 ms on CPU
+and 3.97 ms on GPU, makes them valuable for high-throughput applications like
+massive spectroscopic surveys and large archival studies. By achieving accuracy
+comparable to classical methods with significantly reduced computation time,
+our methodology enhances the scope and efficiency of spectroscopic analysis.
+
+
+
+ comment: Accepted for publication in Astronomy & Astrophysics
+
+
+
+
+
+
+ ☆ Predicting Emergency Department Visits for Patients with Type II
+ Diabetes
+
+
+
+
+
+
+
+
+ Javad M Alizadeh, Jay S Patel, Gabriel Tajeu, Yuzhou Chen, Ilene L Hollin, Mukesh K Patel, Junchao Fei, Huanmei Wu
+
+
+ Over 30 million Americans are affected by Type II diabetes (T2D), a treatable
+condition with significant health risks. This study aims to develop and
+validate predictive models using machine learning (ML) techniques to estimate
+emergency department (ED) visits among patients with T2D. Data for these
+patients was obtained from the HealthShare Exchange (HSX), focusing on
+demographic details, diagnoses, and vital signs. Our sample contained 34,151
+patients diagnosed with T2D which resulted in 703,065 visits overall between
+2017 and 2021. A workflow integrated EMR data with SDoH for ML predictions. A
+total of 87 out of 2,555 features were selected for model construction. Various
+machine learning algorithms, including CatBoost, Ensemble Learning, K-nearest
+Neighbors (KNN), Support Vector Classification (SVC), Random Forest, and
+Extreme Gradient Boosting (XGBoost), were employed with tenfold
+cross-validation to predict whether a patient is at risk of an ED visit. The
+ROC curves for Random Forest, XGBoost, Ensemble Learning, CatBoost, KNN, and
+SVC, were 0.82, 0.82, 0.82, 0.81, 0.72, 0.68, respectively. Ensemble Learning
+and Random Forest models demonstrated superior predictive performance in terms
+of discrimination, calibration, and clinical applicability. These models are
+reliable tools for predicting risk of ED visits among patients with T2D. They
+can estimate future ED demand and assist clinicians in identifying critical
+factors associated with ED utilization, enabling early interventions to reduce
+such visits. The top five important features were age, the difference between
+visitation gaps, visitation gaps, R10 or abdominal and pelvic pain, and the
+Index of Concentration at the Extremes (ICE) for income.
+
+
+
+ comment: This manuscript has been accepted and presented at AI-PHSS 2024: The
+ 2024 International Workshop on AI Applications in Public Health and Social
+ Services in conjunction with the 22nd International Conference of Artificial
+ Intelligence in Medicine (AIME 2024)
+
+
+
+
+
+
+ ☆ A Wander Through the Multimodal Landscape: Efficient Transfer Learning
+ via Low-rank Sequence Multimodal Adapter AAAI 2025
+
+
+
+
+
+
+
+
+ Zirun Guo, Xize Cheng, Yangyang Wu, Tao Jin
+
+
+ Efficient transfer learning methods such as adapter-based methods have shown
+great success in unimodal models and vision-language models. However, existing
+methods have two main challenges in fine-tuning multimodal models. Firstly,
+they are designed for vision-language tasks and fail to extend to situations
+where there are more than two modalities. Secondly, they exhibit limited
+exploitation of interactions between modalities and lack efficiency. To address
+these issues, in this paper, we propose the loW-rank sequence multimodal
+adapter (Wander). We first use the outer product to fuse the information from
+different modalities in an element-wise way effectively. For efficiency, we use
+CP decomposition to factorize tensors into rank-one components and achieve
+substantial parameter reduction. Furthermore, we implement a token-level
+low-rank decomposition to extract more fine-grained features and sequence
+relationships between modalities. With these designs, Wander enables
+token-level interactions between sequences of different modalities in a
+parameter-efficient way. We conduct extensive experiments on datasets with
+different numbers of modalities, where Wander outperforms state-of-the-art
+efficient transfer learning methods consistently. The results fully demonstrate
+the effectiveness, efficiency and universality of Wander.
+
+
+
+ comment: Accepted at AAAI 2025
+
+
+
+
+
+
+ ☆ Enhancing Facial Consistency in Conditional Video Generation via Facial
+ Landmark Transformation
+
+
+ Landmark-guided character animation generation is an important field.
+Generating character animations with facial features consistent with a
+reference image remains a significant challenge in conditional video
+generation, especially involving complex motions like dancing. Existing methods
+often fail to maintain facial feature consistency due to mismatches between the
+facial landmarks extracted from source videos and the target facial features in
+the reference image. To address this problem, we propose a facial landmark
+transformation method based on the 3D Morphable Model (3DMM). We obtain
+transformed landmarks that align with the target facial features by
+reconstructing 3D faces from the source landmarks and adjusting the 3DMM
+parameters to match the reference image. Our method improves the facial
+consistency between the generated videos and the reference images, effectively
+improving the facial feature mismatch problem.
+
+
+
+
+
+
+
+ ☆ Deep Learning Model Security: Threats and Defenses
+
+
+ Deep learning has transformed AI applications but faces critical security
+challenges, including adversarial attacks, data poisoning, model theft, and
+privacy leakage. This survey examines these vulnerabilities, detailing their
+mechanisms and impact on model integrity and confidentiality. Practical
+implementations, including adversarial examples, label flipping, and backdoor
+attacks, are explored alongside defenses such as adversarial training,
+differential privacy, and federated learning, highlighting their strengths and
+limitations.
+ Advanced methods like contrastive and self-supervised learning are presented
+for enhancing robustness. The survey concludes with future directions,
+emphasizing automated defenses, zero-trust architectures, and the security
+challenges of large AI models. A balanced approach to performance and security
+is essential for developing reliable deep learning systems.
+
+
+
+
+
+
+
+ ☆ Belted and Ensembled Neural Network for Linear and Nonlinear Sufficient
+ Dimension Reduction
+
+
+ We introduce a unified, flexible, and easy-to-implement framework of
+sufficient dimension reduction that can accommodate both linear and nonlinear
+dimension reduction, and both the conditional distribution and the conditional
+mean as the targets of estimation. This unified framework is achieved by a
+specially structured neural network -- the Belted and Ensembled Neural Network
+(BENN) -- that consists of a narrow latent layer, which we call the belt, and a
+family of transformations of the response, which we call the ensemble. By
+strategically placing the belt at different layers of the neural network, we
+can achieve linear or nonlinear sufficient dimension reduction, and by choosing
+the appropriate transformation families, we can achieve dimension reduction for
+the conditional distribution or the conditional mean. Moreover, thanks to the
+advantage of the neural network, the method is very fast to compute, overcoming
+a computation bottleneck of the traditional sufficient dimension reduction
+estimators, which involves the inversion of a matrix of dimension either p or
+n. We develop the algorithm and convergence rate of our method, compare it with
+existing sufficient dimension reduction methods, and apply it to two data
+examples.
+
+
+
+ comment: 35 pages, 5 figures, 2 tables
+
+
+
+
+
+
+ ☆ Stochastic Learning of Non-Conjugate Variational Posterior for Image
+ Classification
+
+
+ Large scale Bayesian nonparametrics (BNP) learner such as stochastic
+variational inference (SVI) can handle datasets with large class number and
+large training size at fractional cost. Like its predecessor, SVI rely on the
+assumption of conjugate variational posterior to approximate the true
+posterior. A more challenging problem is to consider large scale learning on
+non-conjugate posterior. Recent works in this direction are mostly associated
+with using Monte Carlo methods for approximating the learner. However, these
+works are usually demonstrated on non-BNP related task and less complex models
+such as logistic regression, due to higher computational complexity. In order
+to overcome the issue faced by SVI, we develop a novel approach based on the
+recently proposed variational maximization-maximization (VMM) learner to allow
+large scale learning on non-conjugate posterior. Unlike SVI, our VMM learner
+does not require closed-form expression for the variational posterior
+expectatations. Our only requirement is that the variational posterior is
+differentiable. In order to ensure convergence in stochastic settings, SVI rely
+on decaying step-sizes to slow its learning. Inspired by SVI and Adam, we
+propose the novel use of decaying step-sizes on both gradient and ascent
+direction in our VMM to significantly improve its learning. We show that our
+proposed methods is compatible with ResNet features when applied to large class
+number datasets such as MIT67 and SUN397. Finally, we compare our proposed
+learner with several recent works such as deep clustering algorithms and showed
+we were able to produce on par or outperform the state-of-the-art methods in
+terms of clustering measures.
+
+
+ Recently, LoRA has emerged as a crucial technique for fine-tuning large
+pre-trained models, yet its performance in multi-task learning scenarios often
+falls short. In contrast, the MoE architecture presents a natural solution to
+this issue. However, it introduces challenges such as mutual interference of
+data across multiple domains and knowledge forgetting of various tasks.
+Additionally, MoE significantly increases the number of parameters, posing a
+computational cost challenge. Therefore, in this paper, we propose MoSLD, a
+mixture-of-shared-LoRAs model with a dropout strategy. MoSLD addresses these
+challenges by sharing the upper projection matrix in LoRA among different
+experts, encouraging the model to learn general knowledge across tasks, while
+still allowing the lower projection matrix to focus on the unique features of
+each task. The application of dropout alleviates the imbalanced update of
+parameter matrix and mitigates parameter overfitting in LoRA. Extensive
+experiments demonstrate that our model exhibits excellent performance in both
+single-task and multi-task scenarios, with robust out-of-domain generalization
+capabilities.
+
+
+
+ comment: Accept by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living
+
+
+
+
+
+
+
+
+ Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
+
+
+ Current Large Language Vision Models (LLVMs) trained on web videos perform
+well in general video understanding but struggle with fine-grained details,
+complex human-object interactions (HOI), and view-invariant representation
+learning essential for Activities of Daily Living (ADL). This limitation stems
+from a lack of specialized ADL video instruction-tuning datasets and
+insufficient modality integration to capture discriminative action
+representations. To address this, we propose a semi-automated framework for
+curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS
+instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM
+integrating videos, 3D skeletons, and HOIs to model ADL's complex
+spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of
+all modalities yields suboptimal results; thus, we propose a Multimodal
+Progressive (MMPro) training strategy, incorporating modalities in stages
+following a curriculum. We also establish ADL MCQ and video description
+benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL
+achieves state-of-the-art performance across ADL benchmarks. Code and data will
+be made publicly available at: https://adl-x.github.io/.
+
+
+
+
+
+
+
+
+ Wenhao Wang, Adam Dziedzic, Michael Backes, Franziska Boenisch
+
+
+ Recent work on studying memorization in self-supervised learning (SSL)
+suggests that even though SSL encoders are trained on millions of images, they
+still memorize individual data points. While effort has been put into
+characterizing the memorized data and linking encoder memorization to
+downstream utility, little is known about where the memorization happens inside
+SSL encoders. To close this gap, we propose two metrics for localizing
+memorization in SSL encoders on a per-layer (layermem) and per-unit basis
+(unitmem). Our localization methods are independent of the downstream task, do
+not require any label information, and can be performed in a forward pass. By
+localizing memorization in various encoder architectures (convolutional and
+transformer-based) trained on diverse datasets with contrastive and
+non-contrastive SSL frameworks, we find that (1) while SSL memorization
+increases with layer depth, highly memorizing units are distributed across the
+entire encoder, (2) a significant fraction of units in SSL encoders experiences
+surprisingly high memorization of individual data points, which is in contrast
+to models trained under supervision, (3) atypical (or outlier) data points
+cause much higher layer and unit memorization than standard data points, and
+(4) in vision transformers, most memorization happens in the fully-connected
+layers. Finally, we show that localizing memorization in SSL has the potential
+to improve fine-tuning and to inform pruning strategies.
+
+
+
+ comment: Accepted at NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ From Imitation to Refinement -- Residual RL for Precise Assembly
+
+
+
+
+
+
+
+
+ Lars Ankile, Anthony Simeonov, Idan Shenfeld, Marcel Torne, Pulkit Agrawal
+
+
+ Recent advances in Behavior Cloning (BC) have made it easy to teach robots
+new tasks. However, we find that the ease of teaching comes at the cost of
+unreliable performance that saturates with increasing data for tasks requiring
+precision. The performance saturation can be attributed to two critical
+factors: (a) distribution shift resulting from the use of offline data and (b)
+the lack of closed-loop corrective control caused by action chucking
+(predicting a set of future actions executed open-loop) critical for BC
+performance. Our key insight is that by predicting action chunks, BC policies
+function more like trajectory "planners" than closed-loop controllers necessary
+for reliable execution. To address these challenges, we devise a simple yet
+effective method, ResiP (Residual for Precise Manipulation), that overcomes the
+reliability problem while retaining BC's ease of teaching and long-horizon
+capabilities. ResiP augments a frozen, chunked BC model with a fully
+closed-loop residual policy trained with reinforcement learning (RL) that
+addresses distribution shifts and introduces closed-loop corrections over
+open-loop execution of action chunks predicted by the BC trajectory planner.
+Videos, code, and data: https://residual-assembly.github.io.
+
+
+
+
+
+
+
+ ♻ ☆ Disentangling Mean Embeddings for Better Diagnostics of Image Generators NeurIPS 2024
+
+
+
+
+
+
+
+
+ Sebastian G. Gruber, Pascal Tobias Ziegler, Florian Buettner
+
+
+ The evaluation of image generators remains a challenge due to the limitations
+of traditional metrics in providing nuanced insights into specific image
+regions. This is a critical problem as not all regions of an image may be
+learned with similar ease. In this work, we propose a novel approach to
+disentangle the cosine similarity of mean embeddings into the product of cosine
+similarities for individual pixel clusters via central kernel alignment.
+Consequently, we can quantify the contribution of the cluster-wise performance
+to the overall image generation performance. We demonstrate how this enhances
+the explainability and the likelihood of identifying pixel regions of model
+misbehavior across various real-world use cases.
+
+
+
+ comment: Published at Interpretable AI: Past, Present and Future Workshop at
+ NeurIPS 2024
+
+
+
+
+
+
+ ♻ ☆ Addressing common misinterpretations of KART and UAT in neural network
+ literature
+
+
+ This note addresses the Kolmogorov-Arnold Representation Theorem (KART) and
+the Universal Approximation Theorem (UAT), focusing on their common
+misinterpretations in some papers related to neural network approximation. Our
+remarks aim to support a more accurate understanding of KART and UAT among
+neural network specialists.
+
+
+
+ comment: 10 pages; a section, two theorems and several references added
+
+
+
+
+
+
+ ♻ ☆ Non-IID data in Federated Learning: A Survey with Taxonomy, Metrics,
+ Methods, Frameworks and Future Directions
+
+
+
+
+
+
+
+
+ Daniel M. Jimenez G., David Solans, Mikko Heikkila, Andrea Vitaletti, Nicolas Kourtellis, Aris Anagnostopoulos, Ioannis Chatzigiannakis
+
+
+ Recent advances in machine learning have highlighted Federated Learning (FL)
+as a promising approach that enables multiple distributed users (so-called
+clients) to collectively train ML models without sharing their private data.
+While this privacy-preserving method shows potential, it struggles when data
+across clients is not independent and identically distributed (non-IID) data.
+The latter remains an unsolved challenge that can result in poorer model
+performance and slower training times. Despite the significance of non-IID data
+in FL, there is a lack of consensus among researchers about its classification
+and quantification. This technical survey aims to fill that gap by providing a
+detailed taxonomy for non-IID data, partition protocols, and metrics to
+quantify data heterogeneity. Additionally, we describe popular solutions to
+address non-IID data and standardized frameworks employed in FL with
+heterogeneous data. Based on our state-of-the-art survey, we present key
+lessons learned and suggest promising future research directions.
+
+
+
+
+
+
+
+ ♻ ☆ BEACON: Benchmark for Comprehensive RNA Tasks and Language Models NeurIPS 2024
+
+
+ RNA plays a pivotal role in translating genetic instructions into functional
+outcomes, underscoring its importance in biological processes and disease
+mechanisms. Despite the emergence of numerous deep learning approaches for RNA,
+particularly universal RNA language models, there remains a significant lack of
+standardized benchmarks to assess the effectiveness of these methods. In this
+study, we introduce the first comprehensive RNA benchmark BEACON
+(\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and
+Language Models). First, BEACON comprises 13 distinct tasks derived from
+extensive previous work covering structural analysis, functional studies, and
+engineering applications, enabling a comprehensive assessment of the
+performance of methods on various RNA understanding tasks. Second, we examine a
+range of models, including traditional approaches like CNNs, as well as
+advanced RNA foundation models based on language models, offering valuable
+insights into the task-specific performances of these models. Third, we
+investigate the vital RNA language model components from the tokenizer and
+positional encoding aspects. Notably, our findings emphasize the superiority of
+single nucleotide tokenization and the effectiveness of Attention with Linear
+Biases (ALiBi) over traditional positional encoding methods. Based on these
+insights, a simple yet strong baseline called BEACON-B is proposed, which can
+achieve outstanding performance with limited data and computational resources.
+The datasets and source code of our benchmark are available at
+https://github.com/terry-r123/RNABenchmark.
+
+
+
+ comment: Accepted by NeurIPS 2024 Dataset and Benchmark Track
+
+
+
+
+
+
+ ♻ ☆ Achieving Constant Regret in Linear Markov Decision Processes NeurIPS 2024
+
+
+ We study the constant regret guarantees in reinforcement learning (RL). Our
+objective is to design an algorithm that incurs only finite regret over
+infinite episodes with high probability. We introduce an algorithm,
+Cert-LSVI-UCB, for misspecified linear Markov decision processes (MDPs) where
+both the transition kernel and the reward function can be approximated by some
+linear function up to misspecification level $\zeta$. At the core of
+Cert-LSVI-UCB is an innovative \method, which facilitates a fine-grained
+concentration analysis for multi-phase value-targeted regression, enabling us
+to establish an instance-dependent regret bound that is constant w.r.t. the
+number of episodes. Specifically, we demonstrate that for a linear MDP
+characterized by a minimal suboptimality gap $\Delta$, Cert-LSVI-UCB has a
+cumulative regret of $\tilde{\mathcal{O}}(d^3H^5/\Delta)$ with high
+probability, provided that the misspecification level $\zeta$ is below
+$\tilde{\mathcal{O}}(\Delta / (\sqrt{d}H^2))$. Here $d$ is the dimension of the
+feature space and $H$ is the horizon. Remarkably, this regret bound is
+independent of the number of episodes $K$. To the best of our knowledge,
+Cert-LSVI-UCB is the first algorithm to achieve a constant, instance-dependent,
+high-probability regret bound in RL with linear function approximation without
+relying on prior distribution assumptions.
+
+
+
+ comment: 45 pages, 3 tables, 2 figures, in 38th Conference on Neural
+ Information Processing Systems (NeurIPS 2024)
+
+
+
+
+
+
+ ♻ ☆ The rate of convergence of Bregman proximal methods: Local geometry vs.
+ regularity vs. sharpness
+
+
+ We examine the last-iterate convergence rate of Bregman proximal methods -
+from mirror descent to mirror-prox and its optimistic variants - as a function
+of the local geometry induced by the prox-mapping defining the method. For
+generality, we focus on local solutions of constrained, non-monotone
+variational inequalities, and we show that the convergence rate of a given
+method depends sharply on its associated Legendre exponent, a notion that
+measures the growth rate of the underlying Bregman function (Euclidean,
+entropic, or other) near a solution. In particular, we show that boundary
+solutions exhibit a stark separation of regimes between methods with a zero and
+non-zero Legendre exponent: the former converge at a linear rate, while the
+latter converge, in general, sublinearly. This dichotomy becomes even more
+pronounced in linearly constrained problems where methods with entropic
+regularization achieve a linear convergence rate along sharp directions,
+compared to convergence in a finite number of steps under Euclidean
+regularization.
+
+
+ We consider maximizing an unknown monotonic, submodular set function $f:
+2^{[n]} \rightarrow [0,1]$ with cardinality constraint under stochastic bandit
+feedback. At each time $t=1,\dots,T$ the learner chooses a set $S_t \subset
+[n]$ with $|S_t| \leq k$ and receives reward $f(S_t) + \eta_t$ where $\eta_t$
+is mean-zero sub-Gaussian noise. The objective is to minimize the learner's
+regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| =
+k$, obtained through robust greedy maximization of $f$. To date, the best
+regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially
+treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$
+is also achievable using standard multi-armed bandit algorithms. In this work,
+we establish the first minimax lower bound for this setting that scales like
+$\tilde{\Omega}(\min_{L \le k}(L^{1/3}n^{1/3}T^{2/3} + \sqrt{{n \choose k -
+L}T}))$. For a slightly restricted algorithm class, we prove a stronger regret
+lower bound of $\tilde{\Omega}(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n
+\choose k - L}T}))$. Moreover, we propose an algorithm Sub-UCB that achieves
+regret $\tilde{\mathcal{O}}(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n \choose k
+- L}T}))$ capable of matching the lower bound on regret for the restricted
+class up to logarithmic factors.
+
+
+
+
+
+
+
+ ♻ ☆ Training Free Guided Flow Matching with Optimal Control
+
+
+
+
+
+
+
+
+ Luran Wang, Chaoran Cheng, Yizhen Liao, Yanru Qu, Ge Liu
+
+
+ Controlled generation with pre-trained Diffusion and Flow Matching models has
+vast applications. One strategy for guiding ODE-based generative models is
+through optimizing a target loss $R(x_1)$ while staying close to the prior
+distribution. Along this line, some recent work showed the effectiveness of
+guiding flow model by differentiating through its ODE sampling process. Despite
+the superior performance, the theoretical understanding of this line of methods
+is still preliminary, leaving space for algorithm improvement. Moreover,
+existing methods predominately focus on Euclidean data manifold, and there is a
+compelling need for guided flow methods on complex geometries such as SO(3),
+which prevails in high-stake scientific applications like protein design. We
+present OC-Flow, a general and theoretically grounded training-free framework
+for guided flow matching using optimal control. Building upon advances in
+optimal control theory, we develop effective and practical algorithms for
+solving optimal control in guided ODE-based generation and provide a systematic
+theoretical analysis of the convergence guarantee in both Euclidean and SO(3).
+We show that existing backprop-through-ODE methods can be interpreted as
+special cases of Euclidean OC-Flow. OC-Flow achieved superior performance in
+extensive experiments on text-guided image manipulation, conditional molecule
+generation, and all-atom peptide design.
+
+
+
+
+
+
+
+ ♻ ☆ Autonomous Goal Detection and Cessation in Reinforcement Learning: A
+ Case Study on Source Term Estimation
+
+
+ Reinforcement Learning has revolutionized decision-making processes in
+dynamic environments, yet it often struggles with autonomously detecting and
+achieving goals without clear feedback signals. For example, in a Source Term
+Estimation problem, the lack of precise environmental information makes it
+challenging to provide clear feedback signals and to define and evaluate how
+the source's location is determined. To address this challenge, the Autonomous
+Goal Detection and Cessation (AGDC) module was developed, enhancing various RL
+algorithms by incorporating a self-feedback mechanism for autonomous goal
+detection and cessation upon task completion. Our method effectively identifies
+and ceases undefined goals by approximating the agent's belief, significantly
+enhancing the capabilities of RL algorithms in environments with limited
+feedback. To validate effectiveness of our approach, we integrated AGDC with
+deep Q-Network, proximal policy optimization, and deep deterministic policy
+gradient algorithms, and evaluated its performance on the Source Term
+Estimation problem. The experimental results showed that AGDC-enhanced RL
+algorithms significantly outperformed traditional statistical methods such as
+infotaxis, entrotaxis, and dual control for exploitation and exploration, as
+well as a non-statistical random action selection method. These improvements
+were evident in terms of success rate, mean traveled distance, and search time,
+highlighting AGDC's effectiveness and efficiency in complex, real-world
+scenarios.
+
+
+
+
+
+
+
+ ♻ ☆ Evaluating GPT-4 at Grading Handwritten Solutions in Math Exams
+
+
+
+
+
+
+
+
+ Adriana Caraeni, Alexander Scarlatos, Andrew Lan
+
+
+ Recent advances in generative artificial intelligence (AI) have shown promise
+in accurately grading open-ended student responses. However, few prior works
+have explored grading handwritten responses due to a lack of data and the
+challenge of combining visual and textual information. In this work, we
+leverage state-of-the-art multi-modal AI models, in particular GPT-4o, to
+automatically grade handwritten responses to college-level math exams. Using
+real student responses to questions in a probability theory exam, we evaluate
+GPT-4o's alignment with ground-truth scores from human graders using various
+prompting techniques. We find that while providing rubrics improves alignment,
+the model's overall accuracy is still too low for real-world settings, showing
+there is significant room for growth in this task.
+
+
+
+ comment: Published in LAK 2025: The 15th International Learning Analytics and
+ Knowledge Conference
+
+
+
+
+
+
+ ♻ ☆ Differential learning kinetics govern the transition from memorization
+ to generalization during in-context learning
+
+
+ Transformers exhibit in-context learning (ICL): the ability to use novel
+information presented in the context without additional weight updates. Recent
+work shows that ICL emerges when models are trained on a sufficiently diverse
+set of tasks and the transition from memorization to generalization is sharp
+with increasing task diversity. One interpretation is that a network's limited
+capacity to memorize favors generalization. Here, we examine the mechanistic
+underpinnings of this transition using a small transformer applied to a
+synthetic ICL task. Using theory and experiment, we show that the sub-circuits
+that memorize and generalize can be viewed as largely independent. The relative
+rates at which these sub-circuits learn explains the transition from
+memorization to generalization, rather than capacity constraints. We uncover a
+memorization scaling law, which determines the task diversity threshold at
+which the network generalizes. The theory quantitatively explains a variety of
+other ICL-related phenomena, including the long-tailed distribution of when ICL
+is acquired, the bimodal behavior of solutions close to the task diversity
+threshold, the influence of contextual and data distributional statistics on
+ICL, and the transient nature of ICL.
+
+
+
+
+
+
+
+
+ Angelica Chen, Samuel D. Stanton, Robert G. Alberstein, Andrew M. Watkins, Richard Bonneau, Vladimir Gligorijević, Kyunghyun Cho, Nathan C. Frey
+
+
+ Large language models (LLMs) have recently shown significant potential in
+various biological tasks such as protein engineering and molecule design. These
+tasks typically involve black-box discrete sequence optimization, where the
+challenge lies in generating sequences that are not only biologically feasible
+but also adhere to hard fine-grained constraints. However, LLMs often struggle
+with such constraints, especially in biological contexts where verifying
+candidate solutions is costly and time-consuming. In this study, we explore the
+possibility of employing LLMs as highly-constrained bilevel optimizers through
+a methodology we refer to as Language Model Optimization with Margin
+Expectation (LLOME). This approach combines both offline and online
+optimization, utilizing limited oracle evaluations to iteratively enhance the
+sequences generated by the LLM. We additionally propose a novel training
+objective -- Margin-Aligned Expectation (MargE) -- that trains the LLM to
+smoothly interpolate between the reward and reference distributions. Lastly, we
+introduce a synthetic test suite that bears strong geometric similarity to real
+biophysical problems and enables rapid evaluation of LLM optimizers without
+time-consuming lab validation. Our findings reveal that, in comparison to
+genetic algorithm baselines, LLMs achieve significantly lower regret solutions
+while requiring fewer test function evaluations. However, we also observe that
+LLMs exhibit moderate miscalibration, are susceptible to generator collapse,
+and have difficulty finding the optimal solution when no explicit ground truth
+rewards are available.
+
+
+
+ comment: Supercedes arXiv:2407.00236v1
+
+
+
+
+
+
+ ♻ ☆ Model Developmental Safety: A Retention-Centric Method and Applications
+ in Vision-Language Models
+
+
+
+
+
+
+
+
+ Gang Li, Wendi Yu, Yao Yao, Wei Tong, Yingbin Liang, Qihang Lin, Tianbao Yang
+
+
+ In the real world, a learning-enabled system usually undergoes multiple
+cycles of model development to enhance the system's ability to handle difficult
+or emerging tasks. This continual model development process raises a
+significant issue that the model development for acquiring new or improving
+existing capabilities may inadvertently lose capabilities of the old model,
+also known as catastrophic forgetting. Existing continual learning studies
+focus on mitigating catastrophic forgetting by trading off performance on
+previous tasks and new tasks to ensure good average performance. However, they
+are inadequate for many applications especially in safety-critical domains, as
+failure to strictly preserve the good performance of the old model not only
+introduces safety risks and uncertainties but also imposes substantial expenses
+in the re-improving and re-validation of existing properties. To address this
+issue, we introduce model developmental safety as a guarantee of a learning
+system such that in the model development process the new model should strictly
+preserve the existing protected capabilities of the old model while improving
+its performance on target tasks. To ensure the model developmental safety, we
+present a retention-centric framework by formulating the model developmental
+safety as data-dependent constraints. Under this framework, we study how to
+develop a pretrained vision-language model, specifically the CLIP model, for
+acquiring new capabilities or improving existing capabilities of image
+classification. We propose an efficient constrained optimization algorithm with
+theoretical guarantee and use its insights to finetune a CLIP model with
+task-dependent heads for promoting the model developmental safety. Our
+experiments on improving vision perception capabilities on autonomous driving
+and scene recognition datasets demonstrate the efficacy of the proposed
+approach.
+
+
+
+ comment: 43 pages, 7 figures
+
+
+
+
+
+
+ ♻ ☆ STARC: A General Framework For Quantifying Differences Between Reward
+ Functions
+
+
+ In order to solve a task using reinforcement learning, it is necessary to
+first formalise the goal of that task as a reward function. However, for many
+real-world tasks, it is very difficult to manually specify a reward function
+that never incentivises undesirable behaviour. As a result, it is increasingly
+popular to use reward learning algorithms, which attempt to learn a reward
+function from data. However, the theoretical foundations of reward learning are
+not yet well-developed. In particular, it is typically not known when a given
+reward learning algorithm with high probability will learn a reward function
+that is safe to optimise. This means that reward learning algorithms generally
+must be evaluated empirically, which is expensive, and that their failure modes
+are difficult to anticipate in advance. One of the roadblocks to deriving
+better theoretical guarantees is the lack of good methods for quantifying the
+difference between reward functions. In this paper we provide a solution to
+this problem, in the form of a class of pseudometrics on the space of all
+reward functions that we call STARC (STAndardised Reward Comparison) metrics.
+We show that STARC metrics induce both an upper and a lower bound on worst-case
+regret, which implies that our metrics are tight, and that any metric with the
+same properties must be bilipschitz equivalent to ours. Moreover, we also
+identify a number of issues with reward metrics proposed by earlier works.
+Finally, we evaluate our metrics empirically, to demonstrate their practical
+efficacy. STARC metrics can be used to make both theoretical and empirical
+analysis of reward learning algorithms both easier and more principled.
+
+
+
+
+
+
+
+
+ Konstantin Burlachenko, Peter Richtárik
+
+
+ Federated Learning (FL) is an emerging paradigm that enables intelligent
+agents to collaboratively train Machine Learning (ML) models in a distributed
+manner, eliminating the need for sharing their local data. The recent work
+(arXiv:2106.02969) introduces a family of Federated Newton Learn (FedNL)
+algorithms, marking a significant step towards applying second-order methods to
+FL and large-scale optimization. However, the reference FedNL prototype
+exhibits three serious practical drawbacks: (i) It requires 4.8 hours to launch
+a single experiment in a sever-grade workstation; (ii) The prototype only
+simulates multi-node setting; (iii) Prototype integration into
+resource-constrained applications is challenging. To bridge the gap between
+theory and practice, we present a self-contained implementation of FedNL,
+FedNL-LS, FedNL-PP for single-node and multi-node settings. Our work resolves
+the aforementioned issues and reduces the wall clock time by x1000. With this
+FedNL outperforms alternatives for training logistic regression in a
+single-node -- CVXPY (arXiv:1603.00943), and in a multi-node -- Apache Spark
+(arXiv:1505.06807), Ray/Scikit-Learn (arXiv:1712.05889). Finally, we propose
+two practical-orientated compressors for FedNL - adaptive TopLEK and
+cache-aware RandSeqK, which fulfill the theory of FedNL.
+
+
+
+ comment: 55 pages, 12 figures, 12 tables
+
+
+
+
+
+
+ ♻ ☆ Perturb and Recover: Fine-tuning for Effective Backdoor Removal from
+ CLIP
+
+
+
+
+
+
+
+
+ Naman Deep Singh, Francesco Croce, Matthias Hein
+
+
+ Vision-Language models like CLIP have been shown to be highly effective at
+linking visual perception and natural language understanding, enabling
+sophisticated image-text capabilities, including strong retrieval and zero-shot
+classification performance. Their widespread use, as well as the fact that CLIP
+models are trained on image-text pairs from the web, make them both a
+worthwhile and relatively easy target for backdoor attacks. As training
+foundational models, such as CLIP, from scratch is very expensive, this paper
+focuses on cleaning potentially poisoned models via fine-tuning. We first show
+that existing cleaning techniques are not effective against simple structured
+triggers used in Blended or BadNet backdoor attacks, exposing a critical
+vulnerability for potential real-world deployment of these models. Then, we
+introduce PAR, Perturb and Recover, a surprisingly simple yet effective
+mechanism to remove backdoors from CLIP models. Through extensive experiments
+across different encoders and types of backdoor attacks, we show that PAR
+achieves high backdoor removal rate while preserving good standard performance.
+Finally, we illustrate that our approach is effective even only with synthetic
+text-image pairs, i.e. without access to real training data. The code and
+models are available at https://github.com/nmndeep/PerturbAndRecover.
+
+
+
+
+
+
+
+ ♻ ☆ Parallel simulation for sampling under isoperimetry and score-based
+ diffusion models
+
+
+ In recent years, there has been a surge of interest in proving discretization
+bounds for sampling under isoperimetry and for diffusion models. As data size
+grows, reducing the iteration cost becomes an important goal. Inspired by the
+great success of the parallel simulation of the initial value problem in
+scientific computation, we propose parallel Picard methods for sampling tasks.
+Rigorous theoretical analysis reveals that our algorithm achieves better
+dependence on dimension $d$ than prior works in iteration complexity (i.e.,
+reduced from $\widetilde{O}(\log^2 d)$ to $\widetilde{O}(\log d)$), which is
+even optimal for sampling under isoperimetry with specific iteration
+complexity. Our work highlights the potential advantages of simulation methods
+in scientific computation for dynamics-based sampling and diffusion models.
+
+
+
+
+
+
+
+ ♻ ☆ FedAA: A Reinforcement Learning Perspective on Adaptive Aggregation for
+ Fair and Robust Federated Learning AAAI 2025
+
+
+ Federated Learning (FL) has emerged as a promising approach for
+privacy-preserving model training across decentralized devices. However, it
+faces challenges such as statistical heterogeneity and susceptibility to
+adversarial attacks, which can impact model robustness and fairness.
+Personalized FL attempts to provide some relief by customizing models for
+individual clients. However, it falls short in addressing server-side
+aggregation vulnerabilities. We introduce a novel method called \textbf{FedAA},
+which optimizes client contributions via \textbf{A}daptive \textbf{A}ggregation
+to enhance model robustness against malicious clients and ensure fairness
+across participants in non-identically distributed settings. To achieve this
+goal, we propose an approach involving a Deep Deterministic Policy
+Gradient-based algorithm for continuous control of aggregation weights, an
+innovative client selection method based on model parameter distances, and a
+reward mechanism guided by validation set performance. Empirically, extensive
+experiments demonstrate that, in terms of robustness, \textbf{FedAA}
+outperforms the state-of-the-art methods, while maintaining comparable levels
+of fairness, offering a promising solution to build resilient and fair
+federated systems. Our code is available at https://github.com/Gp1g/FedAA.
+
+
+
+ comment: AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Scikit-fingerprints: easy and efficient computation of molecular
+ fingerprints in Python
+
+
+ In this work, we present scikit-fingerprints, a Python package for
+computation of molecular fingerprints for applications in chemoinformatics. Our
+library offers an industry-standard scikit-learn interface, allowing intuitive
+usage and easy integration with machine learning pipelines. It is also highly
+optimized, featuring parallel computation that enables efficient processing of
+large molecular datasets. Currently, scikit-fingerprints stands as the most
+feature-rich library in the open source Python ecosystem, offering over 30
+molecular fingerprints. Our library simplifies chemoinformatics tasks based on
+molecular fingerprints, including molecular property prediction and virtual
+screening. It is also flexible, highly efficient, and fully open source.
+
+
+ In this paper we consider a nonconvex unconstrained optimization problem
+minimizing a twice differentiable objective function with H\"older continuous
+Hessian. Specifically, we first propose a Newton-conjugate gradient (Newton-CG)
+method for finding an approximate first- and second-order stationary point of
+this problem, assuming the associated the H\"older parameters are explicitly
+known. Then we develop a parameter-free Newton-CG method without requiring any
+prior knowledge of these parameters. To the best of our knowledge, this method
+is the first parameter-free second-order method achieving the best-known
+iteration and operation complexity for finding an approximate first- and
+second-order stationary point of this problem. Finally, we present preliminary
+numerical results to demonstrate the superior practical performance of our
+parameter-free Newton-CG method over a well-known regularized Newton method.
+
+
+
+ comment: arXiv admin note: text overlap with arXiv:2301.03139
+
+
+
+
+
+
+ ♻ ☆ Injectivity of ReLU networks: perspectives from statistical physics
+
+
+
+
+
+
+
+
+ Antoine Maillard, Afonso S. Bandeira, David Belius, Ivan Dokmanić, Shuta Nakajima
+
+
+ When can the input of a ReLU neural network be inferred from its output? In
+other words, when is the network injective? We consider a single layer, $x
+\mapsto \mathrm{ReLU}(Wx)$, with a random Gaussian $m \times n$ matrix $W$, in
+a high-dimensional setting where $n, m \to \infty$. Recent work connects this
+problem to spherical integral geometry giving rise to a conjectured sharp
+injectivity threshold for $\alpha = \frac{m}{n}$ by studying the expected Euler
+characteristic of a certain random set. We adopt a different perspective and
+show that injectivity is equivalent to a property of the ground state of the
+spherical perceptron, an important spin glass model in statistical physics. By
+leveraging the (non-rigorous) replica symmetry-breaking theory, we derive
+analytical equations for the threshold whose solution is at odds with that from
+the Euler characteristic. Furthermore, we use Gordon's min--max theorem to
+prove that a replica-symmetric upper bound refutes the Euler characteristic
+prediction. Along the way we aim to give a tutorial-style introduction to key
+ideas from statistical physics in an effort to make the exposition accessible
+to a broad audience. Our analysis establishes a connection between spin glasses
+and integral geometry but leaves open the problem of explaining the
+discrepancies.
+
+
+
+ comment: 62 pages ; Changes to match the published version (v2), in particular
+ Appendix A.7 was added, and Appendix G was re-worked as an alternative proof
+ of Theorem 1.8
+
+
+
+
+
+
+ ♻ ☆ Personalized Coupled Tensor Decomposition for Multimodal Data Fusion:
+ Uniqueness and Algorithms
+
+
+
+
+
+
+
+
+ Ricardo Augusto Borsoi, Konstantin Usevich, David Brie, Tülay Adali
+
+
+ Coupled tensor decompositions (CTDs) perform data fusion by linking factors
+from different datasets. Although many CTDs have been already proposed, current
+works do not address important challenges of data fusion, where: 1) the
+datasets are often heterogeneous, constituting different "views" of a given
+phenomena (multimodality); and 2) each dataset can contain personalized or
+dataset-specific information, constituting distinct factors that are not
+coupled with other datasets. In this work, we introduce a personalized CTD
+framework tackling these challenges. A flexible model is proposed where each
+dataset is represented as the sum of two components, one related to a common
+tensor through a multilinear measurement model, and another specific to each
+dataset. Both the common and distinct components are assumed to admit a
+polyadic decomposition. This generalizes several existing CTD models. We
+provide conditions for specific and generic uniqueness of the decomposition
+that are easy to interpret. These conditions employ uni-mode uniqueness of
+different individual datasets and properties of the measurement model. Two
+algorithms are proposed to compute the common and distinct components: a
+semi-algebraic one and a coordinate-descent optimization method. Experimental
+results illustrate the advantage of the proposed framework compared with the
+state of the art approaches.
+
+
+
+
+
+
+
+ ♻ ☆ A Multi-Stage Framework for Joint Chest X-Ray Diagnosis and Visual
+ Attention Prediction Using Deep Learning
+
+
+ Purpose: As visual inspection is an inherent process during radiological
+screening, the associated eye gaze data can provide valuable insights into
+relevant clinical decisions. As deep learning has become the state-of-the-art
+for computer-assisted diagnosis, integrating human behavior, such as eye gaze
+data, into these systems is instrumental to help align machine predictions with
+clinical diagnostic criteria, thus enhancing the quality of automatic
+radiological diagnosis. Methods: We propose a novel deep learning framework for
+joint disease diagnosis and prediction of corresponding clinical visual
+attention maps for chest X-ray scans. Specifically, we introduce a new
+dual-encoder multi-task UNet, which leverages both a DenseNet201 backbone and a
+Residual and Squeeze-and-Excitation block-based encoder to extract diverse
+features for visual attention map prediction, and a multi-scale feature-fusion
+classifier to perform disease classification. To tackle the issue of
+asynchronous training schedules of individual tasks in multi-task learning, we
+proposed a multi-stage cooperative learning strategy, with contrastive learning
+for feature encoder pretraining to boost performance. Results: Our proposed
+method is shown to significantly outperform existing techniques for chest X-ray
+diagnosis (AUC=0.93) and the quality of visual attention map prediction
+(Correlation coefficient=0.58). Conclusion: Benefiting from the proposed
+multi-task multi-stage cooperative learning, our technique demonstrates the
+benefit of integrating clinicians' eye gaze into clinical AI systems to boost
+performance and potentially explainability.
+
+
+ Quaternion contains one real part and three imaginary parts, which provided a
+more expressive hypercomplex space for learning knowledge graph. Existing
+quaternion embedding models measure the plausibility of a triplet either
+through semantic matching or geometric distance scoring functions. However, it
+appears that semantic matching diminishes the separability of entities, while
+the distance scoring function weakens the semantics of entities. To address
+this issue, we propose a novel quaternion knowledge graph embedding model. Our
+model combines semantic matching with entity's geometric distance to better
+measure the plausibility of triplets. Specifically, in the quaternion space, we
+perform a right rotation on head entity and a reverse rotation on tail entity
+to learn rich semantic features. Then, we utilize distance adaptive
+translations to learn geometric distance between entities. Furthermore, we
+provide mathematical proofs to demonstrate our model can handle complex logical
+relationships. Extensive experimental results and analyses show our model
+significantly outperforms previous models on well-known knowledge graph
+completion benchmark datasets. Our code is available at
+https://github.com/llqy123/DaBR.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ♻ ☆ A Survey of Artificial Intelligence in Gait-Based Neurodegenerative
+ Disease Diagnosis
+
+
+ Recent years have witnessed an increasing global population affected by
+neurodegenerative diseases (NDs), which traditionally require extensive
+healthcare resources and human effort for medical diagnosis and monitoring. As
+a crucial disease-related motor symptom, human gait can be exploited to
+characterize different NDs. The current advances in artificial intelligence
+(AI) models enable automatic gait analysis for NDs identification and
+classification, opening a new avenue to facilitate faster and more
+cost-effective diagnosis of NDs. In this paper, we provide a comprehensive
+survey on recent progress of machine learning and deep learning based AI
+techniques applied to diagnosis of five typical NDs through gait. We provide an
+overview of the process of AI-assisted NDs diagnosis, and present a systematic
+taxonomy of existing gait data and AI models. Meanwhile, a novel quality
+evaluation criterion is proposed to quantitatively assess the quality of
+existing studies. Through an extensive review and analysis of 169 studies, we
+present recent technical advancements, discuss existing challenges, potential
+solutions, and future directions in this field. Finally, we envision the
+prospective utilization of 3D skeleton data for human gait representation and
+the development of more efficient AI models for NDs diagnosis.
+
+
+
+ comment: Article: 57 pages, citing 290 papers. Appendix: 30 pages. A
+ up-to-date resource (papers, data, etc.) of this survey (AI4NDD) is provided
+ at https://github.com/minlinzeng/AI4NDD-Survey
+
+
+
+
+
+
+ ♻ ☆ Biology-inspired joint distribution neurons based on Hierarchical
+ Correlation Reconstruction allowing for multidirectional neural networks
+
+
+ Biological neural networks seem qualitatively superior (e.g. in learning,
+flexibility, robustness) to current artificial like Multi-Layer Perceptron
+(MLP) or Kolmogorov-Arnold Network (KAN). Simultaneously, in contrast to them:
+biological have fundamentally multidirectional signal propagation \cite{axon},
+also of probability distributions e.g. for uncertainty estimation, and are
+believed not being able to use standard backpropagation training
+\cite{backprop}. There are proposed novel artificial neurons based on HCR
+(Hierarchical Correlation Reconstruction) allowing to remove the above low
+level differences: with neurons containing local joint distribution model (of
+its connections), representing joint density on normalized variables as just
+linear combination of $(f_\mathbf{j})$ orthonormal polynomials:
+$\rho(\mathbf{x})=\sum_{\mathbf{j}\in B} a_\mathbf{j} f_\mathbf{j}(\mathbf{x})$
+for $\mathbf{x} \in [0,1]^d$ and $B\subset \mathbb{N}^d$ some chosen basis. By
+various index summations of such $(a_\mathbf{j})_{\mathbf{j}\in B}$ tensor as
+neuron parameters, we get simple formulas for e.g. conditional expected values
+for propagation in any direction, like $E[x|y,z]$, $E[y|x]$, which degenerate
+to KAN-like parametrization if restricting to pairwise dependencies. Such HCR
+network can also propagate probability distributions (also joint) like
+$\rho(y,z|x)$. It also allows for additional training approaches, like direct
+$(a_\mathbf{j})$ estimation, through tensor decomposition, or more biologically
+plausible information bottleneck training: layers directly influencing only
+neighbors, optimizing content to maximize information about the next layer, and
+minimizing about the previous to remove noise, extract crucial information.
+
+
+ In this paper, we consider a class of non-convex and non-smooth sparse
+optimization problems, which encompass most existing nonconvex
+sparsity-inducing terms. We show the second-order optimality conditions only
+depend on the nonzeros of the stationary points. We propose two damped
+iterative reweighted algorithms including the iteratively reweighted $\ell_1$
+algorithm (DIRL$_1$) and the iteratively reweighted $\ell_2$ (DIRL$_2$)
+algorithm, to solve these problems. For DIRL$_1$, we show the reweighted
+$\ell_1$ subproblem has support identification property so that DIRL$_1$
+locally reverts to a gradient descent algorithm around a stationary point. For
+DIRL$_2$, we show the solution map of the reweighted $\ell_2$ subproblem is
+differentiable and Lipschitz continuous everywhere. Therefore, the map of
+DIRL$_1$ and DIRL$_2$ and their inverse are Lipschitz continuous, and the
+strict saddle points are their unstable fixed points. By applying the stable
+manifold theorem, these algorithms are shown to converge only to local
+minimizers with randomly initialization when the strictly saddle point property
+is assumed.
+
+
+
+ comment: 24 pages
+
+
+
+
+
+
+ ♻ ☆ PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU SOSP 2024
+
+
+ This paper introduces PowerInfer, a high-speed Large Language Model (LLM)
+inference engine on a personal computer (PC) equipped with a single
+consumer-grade GPU. The key principle underlying the design of PowerInfer is
+exploiting the high locality inherent in LLM inference, characterized by a
+power-law distribution in neuron activation. This distribution indicates that a
+small subset of neurons, termed hot neurons, are consistently activated across
+inputs, while the majority, cold neurons, vary based on specific inputs.
+PowerInfer exploits such an insight to design a GPU-CPU hybrid inference
+engine: hot-activated neurons are preloaded onto the GPU for fast access, while
+cold-activated neurons are computed on the CPU, thus significantly reducing GPU
+memory demands and CPU-GPU data transfers. PowerInfer further integrates
+adaptive predictors and neuron-aware sparse operators, optimizing the
+efficiency of neuron activation and computational sparsity. The evaluation
+shows that PowerInfer significantly outperforms llama.cpp by up to 11.69x while
+retaining model accuracy across various LLMs (including OPT-175B) on a single
+NVIDIA RTX 4090 GPU. For the OPT-30B model, PowerInfer achieves performance
+comparable to that of a high-end server-grade A100 GPU, reaching 82% of its
+token generation rate on a single consumer-grade RTX 4090 GPU.
+
+
+
+ comment: SOSP 2024
+
+
+
+
+
+
+ ♻ ☆ CommonPower: A Framework for Safe Data-Driven Smart Grid Control
+
+
+ The growing complexity of power system management has led to an increased
+interest in reinforcement learning (RL). However, vanilla RL controllers cannot
+themselves ensure satisfaction of system constraints. Therefore, combining them
+with formally correct safeguarding mechanisms is an important aspect when
+studying RL for power system management. Integrating safeguarding into complex
+use cases requires tool support. To address this need, we introduce the Python
+tool CommonPower. CommonPower's unique contribution lies in its symbolic
+modeling approach, which enables flexible, model-based safeguarding of RL
+controllers. Moreover, CommonPower offers a unified interface for single-agent
+RL, multi-agent RL, and optimal control, with seamless integration of different
+forecasting methods. This allows users to validate the effectiveness of safe RL
+controllers across a large variety of case studies and investigate the
+influence of specific aspects on overall performance. We demonstrate
+CommonPower's versatility through a numerical case study that compares RL
+agents featuring different safeguards with a model predictive controller in the
+context of building energy management.
+
+
+
+ comment: For the corresponding code repository, see
+ https://github.com/TUMcps/commonpower
+
+
+
+
+
+
+ ♻ ☆ PowerInfer-2: Fast Large Language Model Inference on a Smartphone
+
+
+ Large language models (LLMs) on smartphones enable real-time AI assistance
+and privacy-preserving, offline operation. However, resource constraints of
+smartphones limit current deployments to small language models (SLMs),
+significantly compromising their capabilities. This paper introduces
+PowerInfer-2, a smartphone-based framework that enables fast inference for LLMs
+exceeding the memory capacity. The key insight is decomposing matrix operations
+into neuron clusters as the basic processing unit, which enables flexible
+scheduling and efficient I/O-computation pipelining. PowerInfer-2 leverages
+this neuron-cluster-based design in both computation and storage. For
+computation, neuron clusters with dense activations are processed on NPU, while
+sparse clusters use CPU. The storage engine provides a fine-grained pipeline
+mechanism that coordinates cluster-level computation and I/O operations,
+enhanced by a segmented neuron cache to reduce I/O activities. PowerInfer-2
+achieves up to a 27.8x speed increase compared to state-of-the-art frameworks.
+PowerInfer-2 is the first system to serve a 47B LLM on a smartphone, achieving
+11.68 tokens/s. Notably, these performance improvements preserve model quality
+with negligible accuracy degradation.
+
+
+
+
+
+
+
+ ♻ ☆ A second-order-like optimizer with adaptive gradient scaling for deep
+ learning
+
+
+
+
+
+
+
+
+ Jérôme Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica
+
+
+ In this empirical article, we introduce INNAprop, an optimization algorithm
+that combines the INNA method with the RMSprop adaptive gradient scaling. It
+leverages second-order information and rescaling while keeping the memory
+requirements of standard DL methods as AdamW or SGD with momentum. After giving
+geometrical insights, we evaluate INNAprop on CIFAR-10, Food101, and ImageNet
+with ResNets, VGG, DenseNet, and ViT, and on GPT-2 (OpenWebText) train from
+scratch and with LoRA fine-tuning (E2E). INNAprop consistently matches or
+outperforms AdamW both in training speed and accuracy, with minimal
+hyperparameter tuning in large-scale settings. Our code is publicly available
+at \url{https://github.com/innaprop/innaprop}.
+
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Multi-scale Approach for Speech and Dynamics Synchrony
+ in Talking Head Generation
+
+
+ Animating still face images with deep generative models using a speech input
+signal is an active research topic and has seen important recent
+progress.However, much of the effort has been put into lip syncing and
+rendering quality while the generation of natural head motion, let alone the
+audio-visual correlation between head motion and speech, has often been
+neglected.In this work, we propose a multi-scale audio-visual synchrony loss
+and a multi-scale autoregressive GAN to better handle short and long-term
+correlation between speech and the dynamics of the head and lips.In particular,
+we train a stack of syncer models on multimodal input pyramids and use these
+models as guidance in a multi-scale generator network to produce audio-aligned
+motion unfolding over diverse time scales.Both the pyramid of audio-visual
+syncers and the generative models are trained in a low-dimensional space that
+fully preserves dynamics cues.The experiments show significant improvements
+over the state-of-the-art in head motion dynamics quality and especially in
+multi-scale audio-visual synchrony on a collection of benchmark datasets.
+
+
+
+
+
+
+
+ ♻ ☆ How Likely Do LLMs with CoT Mimic Human Reasoning? COLING 2025
+
+
+ Chain-of-thought emerges as a promising technique for eliciting reasoning
+capabilities from Large Language Models (LLMs). However, it does not always
+improve task performance or accurately represent reasoning processes, leaving
+unresolved questions about its usage. In this paper, we diagnose the underlying
+mechanism by comparing the reasoning process of LLMs with humans, using causal
+analysis to understand the relationships between the problem instruction,
+reasoning, and the answer in LLMs. Our empirical study reveals that LLMs often
+deviate from the ideal causal chain, resulting in spurious correlations and
+potential consistency errors (inconsistent reasoning and answers). We also
+examine various factors influencing the causal structure, finding that
+in-context learning with examples strengthens it, while post-training
+techniques like supervised fine-tuning and reinforcement learning on human
+feedback weaken it. To our surprise, the causal structure cannot be
+strengthened by enlarging the model size only, urging research on new
+techniques. We hope that this preliminary study will shed light on
+understanding and improving the reasoning process in LLM.
+
+
+
+ comment: COLING 2025 Camera Version (8 pages, 3 figures, 18 tables)
+
+
+
+
+
+
+ ♻ ☆ ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity
+ within Large Language Models
+
+
+ Activation sparsity refers to the existence of considerable
+weakly-contributed elements among activation outputs. As a prevalent property
+of the models using the ReLU activation function, activation sparsity has been
+proven a promising paradigm to boost model inference efficiency. Nevertheless,
+most large language models (LLMs) adopt activation functions without intrinsic
+activation sparsity (e.g., GELU and Swish). Some recent efforts have explored
+introducing ReLU or its variants as the substitutive activation function to
+help LLMs achieve activation sparsity and inference acceleration, but few can
+simultaneously obtain high sparsity and comparable model performance. This
+paper introduces a simple and effective sparsification method named "ProSparse"
+to push LLMs for higher activation sparsity while maintaining comparable
+performance. Specifically, after substituting the activation function of LLMs
+with ReLU, ProSparse adopts progressive sparsity regularization with a factor
+smoothly increasing along the multi-stage sine curves. This can enhance
+activation sparsity and mitigate performance degradation by avoiding radical
+shifts in activation distributions. With ProSparse, we obtain high sparsity of
+89.32% for LLaMA2-7B, 88.80% for LLaMA2-13B, and 87.89% for end-size
+MiniCPM-1B, respectively, achieving comparable performance to their original
+Swish-activated versions. These present the most sparsely activated models
+among open-source LLaMA versions and competitive end-size models, considerably
+surpassing ReluLLaMA-7B (66.98%) and ReluLLaMA-13B (71.56%). Our inference
+acceleration experiments further demonstrate the significant practical
+acceleration potential of LLMs with higher activation sparsity, obtaining up to
+4.52$\times$ inference speedup.
+
+
+
+ comment: 19 pages, 4 figures, 9 tables
+
+
+
+
+
+
+ ♻ ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission
+ of the Global South
+
+
+ Recent advances in generative AI have sparked renewed interest and expanded
+possibilities for music generation. However, the performance and versatility of
+these systems across musical genres are heavily influenced by the availability
+of training data. We conducted an extensive analysis of over one million hours
+of audio datasets used in AI music generation research and manually reviewed
+more than 200 papers from eleven prominent AI and music conferences and
+organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,
+NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and
+inclusion of the musical genres of the Global South in AI research. Our
+findings reveal a stark imbalance: approximately 86% of the total dataset hours
+and over 93% of researchers focus primarily on music from the Global North.
+However, around 40% of these datasets include some form of non-Western music,
+genres from the Global South account for only 14.6% of the data. Furthermore,
+approximately 51% of the papers surveyed concentrate on symbolic music
+generation, a method that often fails to capture the cultural nuances inherent
+in music from regions such as South Asia, the Middle East, and Africa. As AI
+increasingly shapes the creation and dissemination of music, the significant
+underrepresentation of music genres in datasets and research presents a serious
+threat to global musical diversity. We also propose some important steps to
+mitigate these risks and foster a more inclusive future for AI-driven music
+generation.
+
+
+
+ comment: Submitted to CACM, 12 pages, 2 figures
+
+
+
+
+
+
+ ♻ ☆ Large language models as oracles for instantiating ontologies with
+ domain-specific knowledge
+
+
+
+
+
+
+
+
+ Giovanni Ciatto, Andrea Agiollo, Matteo Magnini, Andrea Omicini
+
+
+ Background. Endowing intelligent systems with semantic data commonly requires
+designing and instantiating ontologies with domain-specific knowledge.
+Especially in the early phases, those activities are typically performed
+manually by human experts possibly leveraging on their own experience. The
+resulting process is therefore time-consuming, error-prone, and often biased by
+the personal background of the ontology designer. Objective. To mitigate that
+issue, we propose a novel domain-independent approach to automatically
+instantiate ontologies with domain-specific knowledge, by leveraging on large
+language models (LLMs) as oracles. Method. Starting from (i) an initial schema
+composed by inter-related classes and properties and (ii) a set of query
+templates, our method queries the LLM multiple times, and generates instances
+for both classes and properties from its replies. Thus, the ontology is
+automatically filled with domain-specific knowledge, compliant to the initial
+schema. As a result, the ontology is quickly and automatically enriched with
+manifold instances, which experts may consider to keep, adjust, discard, or
+complement according to their own needs and expertise. Contribution. We
+formalise our method in general way and instantiate it over various LLMs, as
+well as on a concrete case study. We report experiments rooted in the
+nutritional domain where an ontology of food meals and their ingredients is
+automatically instantiated from scratch, starting from a categorisation of
+meals and their relationships. There, we analyse the quality of the generated
+ontologies and compare ontologies attained by exploiting different LLMs.
+Experimentally, our approach achieves a quality metric that is up to five times
+higher than the state-of-the-art, while reducing erroneous entities and
+relations by up to ten times. Finally, we provide a SWOT analysis of the
+proposed method.
+
+
+
+
+
+
+
+ ♻ ☆ Watermarking Training Data of Music Generation Models
+
+
+
+
+
+
+
+
+ Pascal Epple, Igor Shilov, Bozhidar Stevanoski, Yves-Alexandre de Montjoye
+
+
+ Generative Artificial Intelligence (Gen-AI) models are increasingly used to
+produce content across domains, including text, images, and audio. While these
+models represent a major technical breakthrough, they gain their generative
+capabilities from being trained on enormous amounts of human-generated content,
+which often includes copyrighted material. In this work, we investigate whether
+audio watermarking techniques can be used to detect an unauthorized usage of
+content to train a music generation model. We compare outputs generated by a
+model trained on watermarked data to a model trained on non-watermarked data.
+We study factors that impact the model's generation behaviour: the watermarking
+technique, the proportion of watermarked samples in the training set, and the
+robustness of the watermarking technique against the model's tokenizer. Our
+results show that audio watermarking techniques, including some that are
+imperceptible to humans, can lead to noticeable shifts in the model's outputs.
+We also study the robustness of a state-of-the-art watermarking technique to
+removal techniques.
+
+
+
+
+
+
+
+ ♻ ☆ Golden Noise for Diffusion Models: A Learning Framework
+
+
+ Text-to-image diffusion model is a popular paradigm that synthesizes
+personalized images by providing a text prompt and a random Gaussian noise.
+While people observe that some noises are ``golden noises'' that can achieve
+better text-image alignment and higher human preference than others, we still
+lack a machine learning framework to obtain those golden noises. To learn
+golden noises for diffusion sampling, we mainly make three contributions in
+this paper. First, we identify a new concept termed the \textit{noise prompt},
+which aims at turning a random Gaussian noise into a golden noise by adding a
+small desirable perturbation derived from the text prompt. Following the
+concept, we first formulate the \textit{noise prompt learning} framework that
+systematically learns ``prompted'' golden noise associated with a text prompt
+for diffusion models. Second, we design a noise prompt data collection pipeline
+and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains
+100k pairs of random noises and golden noises with the associated text prompts.
+With the prepared NPD as the training dataset, we trained a small \textit{noise
+prompt network}~(NPNet) that can directly learn to transform a random noise
+into a golden noise. The learned golden noise perturbation can be considered as
+a kind of prompt for noise, as it is rich in semantic information and tailored
+to the given text prompt. Third, our extensive experiments demonstrate the
+impressive effectiveness and generalization of NPNet on improving the quality
+of synthesized images across various diffusion models, including SDXL,
+DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and
+efficient controller that acts as a plug-and-play module with very limited
+additional inference and computational costs, as it just provides a golden
+noise instead of a random noise without accessing the original pipeline.
+
+
+
+
+
+
+
+
+ Núria Armengol Urpí, Marco Bagatella, Marin Vlastelica, Georg Martius
+
+
+ Offline data are both valuable and practical resources for teaching robots
+complex behaviors. Ideally, learning agents should not be constrained by the
+scarcity of available demonstrations, but rather generalize beyond the training
+distribution. However, the complexity of real-world scenarios typically
+requires huge amounts of data to prevent neural network policies from picking
+up on spurious correlations and learning non-causal relationships. We propose
+CAIAC, a data augmentation method that can create feasible synthetic
+transitions from a fixed dataset without having access to online environment
+interactions. By utilizing principled methods for quantifying causal influence,
+we are able to perform counterfactual reasoning by swapping
+$\it{action}$-unaffected parts of the state-space between independent
+trajectories in the dataset. We empirically show that this leads to a
+substantial increase in robustness of offline learning algorithms against
+distributional shift.
+
+
+
+ comment: Accepted in 41st International Conference on Machine Learning (ICML
+ 2024)
+
+
+
+
+
+
+
+ Tim Selig, Thomas März, Martin Storath, Andreas Weinmann
+
+
+ Computed tomography from a low radiation dose (LDCT) is challenging due to
+high noise in the projection data. Popular approaches for LDCT image
+reconstruction are two-stage methods, typically consisting of the filtered
+backprojection (FBP) algorithm followed by a neural network for LDCT image
+enhancement. Two-stage methods are attractive for their simplicity and
+potential for computational efficiency, typically requiring only a single FBP
+and a neural network forward pass for inference. However, the best
+reconstruction quality is currently achieved by unrolled iterative methods
+(Learned Primal-Dual and ItNet), which are more complex and thus have a higher
+computational cost for training and inference. We propose a method combining
+the simplicity and efficiency of two-stage methods with state-of-the-art
+reconstruction quality. Our strategy utilizes a neural network pretrained for
+Gaussian noise removal from natural grayscale images, fine-tuned for LDCT image
+enhancement. We call this method FBP-DTSGD (Domain and Task Shifted Gaussian
+Denoisers) as the fine-tuning is a task shift from Gaussian denoising to
+enhancing LDCT images and a domain shift from natural grayscale to LDCT images.
+An ablation study with three different pretrained Gaussian denoisers indicates
+that the performance of FBP-DTSGD does not depend on a specific denoising
+architecture, suggesting future advancements in Gaussian denoising could
+benefit the method. The study also shows that pretraining on natural images
+enhances LDCT reconstruction quality, especially with limited training data.
+Notably, pretraining involves no additional cost, as existing pretrained models
+are used. The proposed method currently holds the top mean position in the
+LoDoPaB-CT challenge.
+
+
+
+ comment: 13 pages, 4 figures
+
+
+
+
+
+
+ ♻ ☆ Vanilla Bayesian Optimization Performs Great in High Dimensions
+
+
+
+
+
+
+
+
+ Carl Hvarfner, Erik Orm Hellsten, Luigi Nardi
+
+
+ High-dimensional problems have long been considered the Achilles' heel of
+Bayesian optimization algorithms. Spurred by the curse of dimensionality, a
+large collection of algorithms aim to make it more performant in this setting,
+commonly by imposing various simplifying assumptions on the objective. In this
+paper, we identify the degeneracies that make vanilla Bayesian optimization
+poorly suited to high-dimensional tasks, and further show how existing
+algorithms address these degeneracies through the lens of lowering the model
+complexity. Moreover, we propose an enhancement to the prior assumptions that
+are typical to vanilla Bayesian optimization algorithms, which reduces the
+complexity to manageable levels without imposing structural restrictions on the
+objective. Our modification - a simple scaling of the Gaussian process
+lengthscale prior with the dimensionality - reveals that standard Bayesian
+optimization works drastically better than previously thought in high
+dimensions, clearly outperforming existing state-of-the-art algorithms on
+multiple commonly considered real-world high-dimensional tasks.
+
+
+
+
+
+
+
+ ♻ ☆ A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts
+
+
+ Machine learning methods strive to acquire a robust model during the training
+process that can effectively generalize to test samples, even in the presence
+of distribution shifts. However, these methods often suffer from performance
+degradation due to unknown test distributions. Test-time adaptation (TTA), an
+emerging paradigm, has the potential to adapt a pre-trained model to unlabeled
+data during testing, before making predictions. Recent progress in this
+paradigm has highlighted the significant benefits of using unlabeled data to
+train self-adapted models prior to inference. In this survey, we categorize TTA
+into several distinct groups based on the form of test data, namely, test-time
+domain adaptation, test-time batch adaptation, and online test-time adaptation.
+For each category, we provide a comprehensive taxonomy of advanced algorithms
+and discuss various learning scenarios. Furthermore, we analyze relevant
+applications of TTA and discuss open challenges and promising areas for future
+research. For a comprehensive list of TTA methods, kindly refer to
+\url{https://github.com/tim-learn/awesome-test-time-adaptation}.
+
+
+
+ comment: Discussions, comments, and questions are all welcomed in
+ \url{https://github.com/tim-learn/awesome-test-time-adaptation}
+
+
+
+
+
+
+ ♻ ☆ AdaStop: adaptive statistical testing for sound comparisons of Deep RL
+ agents
+
+
+ Recently, the scientific community has questioned the statistical
+reproducibility of many empirical results, especially in the field of machine
+learning. To contribute to the resolution of this reproducibility crisis, we
+propose a theoretically sound methodology for comparing the performance of a
+set of algorithms. We exemplify our methodology in Deep Reinforcement Learning
+(Deep RL). The performance of one execution of a Deep RL algorithm is a random
+variable. Therefore, several independent executions are needed to evaluate its
+performance. When comparing algorithms with random performance, a major
+question concerns the number of executions to perform to ensure that the result
+of the comparison is theoretically sound. Researchers in Deep RL often use less
+than 5 independent executions to compare algorithms: we claim that this is not
+enough in general. Moreover, when comparing more than 2 algorithms at once, we
+have to use a multiple tests procedure to preserve low error guarantees. We
+introduce AdaStop, a new statistical test based on multiple group sequential
+tests. When used to compare algorithms, AdaStop adapts the number of executions
+to stop as early as possible while ensuring that enough information has been
+collected to distinguish algorithms that have different score distributions. We
+prove theoretically that AdaStop has a low probability of making a
+(family-wise) error. We illustrate the effectiveness of AdaStop in various
+use-cases, including toy examples and Deep RL algorithms on challenging Mujoco
+environments. AdaStop is the first statistical test fitted to this sort of
+comparisons: it is both a significant contribution to statistics, and an
+important contribution to computational studies performed in reinforcement
+learning and in other domains.
+
+
+
+
+
+
+
+
+ Cheng Tan, Zhangyang Gao, Siyuan Li, Stan Z. Li
+
+
+ Recent years have witnessed remarkable advances in spatiotemporal predictive
+learning, with methods incorporating auxiliary inputs, complex neural
+architectures, and sophisticated training strategies. While SimVP has
+introduced a simpler, CNN-based baseline for this task, it still relies on
+heavy Unet-like architectures for spatial and temporal modeling, which still
+suffers from high complexity and computational overhead. In this paper, we
+propose SimVPv2, a streamlined model that eliminates the need for Unet
+architectures and demonstrates that plain stacks of convolutional layers,
+enhanced with an efficient Gated Spatiotemporal Attention mechanism, can
+deliver state-of-the-art performance. SimVPv2 not only simplifies the model
+architecture but also improves both performance and computational efficiency.
+On the standard Moving MNIST benchmark, SimVPv2 achieves superior performance
+compared to SimVP, with fewer FLOPs, about half the training time, and 60%
+faster inference efficiency. Extensive experiments across eight diverse
+datasets, including real-world tasks such as traffic forecasting and climate
+prediction, further demonstrate that SimVPv2 offers a powerful yet
+straightforward solution, achieving robust generalization across various
+spatiotemporal learning scenarios. We believe the proposed SimVPv2 can serve as
+a solid baseline to benefit the spatiotemporal predictive learning community.
+
+
+
+ comment: Accepted by TMM
+
+
+
+
+
+
+ ♻ ☆ A simple thinking about the application of the attention mechanism in
+ medical ultrasound image segmentation task
+
+
+ The AI-based assisted diagnosis programs have been widely investigated on
+medical ultrasound images. Complex scenario of ultrasound image, in which the
+coupled interference of internal and external factors is severe, brings a
+unique challenge for localize the object region automatically and precisely in
+ultrasound images. In this study, we seek to propose a more general and robust
+Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or
+diagnose lesions and tissues in ultrasound images more quickly and accurately.
+Different from existing attention schemes, the BAAF consists of a parallel
+hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM).
+Specifically, BAAF first coarsely calibrates the input features from the
+channel and spatial dimensions, and then adaptively selects more robust lesion
+or tissue characterizations from the coarse-calibrated feature maps. The design
+of BAAF further optimizes the "what" and "where" focus and selection problems
+in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in
+medical ultrasound images. The method is evaluated on four medical ultrasound
+segmentation tasks, and the adequate experimental results demonstrate the
+remarkable performance improvement over existing state-of-the-art methods. In
+addition, the comparison with existing attention mechanisms also demonstrates
+the superiority of BAAF. This work provides the possibility for automated
+medical ultrasound assisted diagnosis and reduces reliance on human accuracy
+and precision.
+
+
+
+ comment: 10 pages, 11 figures
+
+
+
+
+
+
+ ♻ ☆ Transfer Learning with Partially Observable Offline Data via Causal
+ Bounds
+
+
+ Transfer learning has emerged as an effective approach to accelerate learning
+by integrating knowledge from related source agents. However, challenges arise
+due to data heterogeneity-such as differences in feature sets or incomplete
+datasets-which often results in the nonidentifiability of causal effects. In
+this paper, we investigate transfer learning in partially observable contextual
+bandits, where agents operate with incomplete information and limited access to
+hidden confounders. To address the challenges posed by unobserved confounders,
+we formulate optimization problems to derive tight bounds on the
+nonidentifiable causal effects. We then propose an efficient method that
+discretizes the functional constraints of unknown distributions into linear
+constraints, allowing us to sample compatible causal models through a
+sequential process of solving linear programs. This method takes into account
+estimation errors and exhibits strong convergence properties, ensuring robust
+and reliable causal bounds. Leveraging these causal bounds, we improve
+classical bandit algorithms, achieving tighter regret upper and lower bounds
+relative to the sizes of action sets and function spaces. In tasks involving
+function approximation, which are crucial for handling complex context spaces,
+our method significantly improves the dependence on function space size
+compared to previous work. We formally prove that our causally enhanced
+algorithms outperform classical bandit algorithms, achieving notably faster
+convergence rates. The applicability of our approach is further illustrated
+through an example of offline pricing policy learning with censored
+demand.Simulations confirm the superiority of our approach over
+state-of-the-art methods, demonstrating its potential to enhance contextual
+bandit agents in real-world applications, especially when data is scarce,
+costly, or restricted due to privacy concerns.
+
+
+
+ comment: 57 pages
+
+
+
+
+
+
+ ♻ ☆ GARLIC: GPT-Augmented Reinforcement Learning with Intelligent Control
+ for Vehicle Dispatching AAAI 2025
+
+
+ As urban residents demand higher travel quality, vehicle dispatch has become
+a critical component of online ride-hailing services. However, current vehicle
+dispatch systems struggle to navigate the complexities of urban traffic
+dynamics, including unpredictable traffic conditions, diverse driver behaviors,
+and fluctuating supply and demand patterns. These challenges have resulted in
+travel difficulties for passengers in certain areas, while many drivers in
+other areas are unable to secure orders, leading to a decline in the overall
+quality of urban transportation services. To address these issues, this paper
+introduces GARLIC: a framework of GPT-Augmented Reinforcement Learning with
+Intelligent Control for vehicle dispatching. GARLIC utilizes multiview graphs
+to capture hierarchical traffic states, and learns a dynamic reward function
+that accounts for individual driving behaviors. The framework further
+integrates a GPT model trained with a custom loss function to enable
+high-precision predictions and optimize dispatching policies in real-world
+scenarios. Experiments conducted on two real-world datasets demonstrate that
+GARLIC effectively aligns with driver behaviors while reducing the empty load
+rate of vehicles.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ Application of Neural Ordinary Differential Equations for ITER Burning
+ Plasma Dynamics
+
+
+ The dynamics of burning plasmas in tokamaks are crucial for advancing
+controlled thermonuclear fusion. This study applies the NeuralPlasmaODE, a
+multi-region multi-timescale transport model, to simulate the complex energy
+transfer processes in ITER deuterium-tritium (D-T) plasmas. Our model captures
+the interactions between energetic alpha particles, electrons, and ions, which
+are vital for understanding phenomena such as thermal runaway instability. We
+employ neural ordinary differential equations (Neural ODEs) for the numerical
+derivation of diffusivity parameters, enabling precise modeling of energy
+interactions between different plasma regions. By leveraging transfer learning,
+we utilize model parameters derived from DIII-D experimental data, enhancing
+the efficiency and accuracy of our simulations without training from scratch.
+Applying this model to ITER's inductive and non-inductive operational
+scenarios, our results demonstrate that radiation and transport processes
+effectively remove excess heat from the core plasma, preventing thermal runaway
+instability. This study underscores the potential of machine learning in
+advancing our understanding and control of burning plasma dynamics in fusion
+reactors.
+
+
+
+
+
+
+
+ ♻ ☆ Training on the Test Task Confounds Evaluation and Emergence
+
+
+
+
+
+
+
+
+ Ricardo Dominguez-Olmedo, Florian E. Dorner, Moritz Hardt
+
+
+ We study a fundamental problem in the evaluation of large language models
+that we call training on the test task. Unlike wrongful practices like training
+on the test data, leakage, or data contamination, training on the test task is
+not a malpractice. Rather, the term describes a growing set of practices that
+utilize knowledge about evaluation tasks at training time. We demonstrate that
+training on the test task confounds both relative model evaluations and claims
+about emergent capabilities. We argue that the seeming superiority of one model
+family over another may be explained by a different degree of training on the
+test task. To this end, we propose an effective method to adjust for the effect
+of training on the test task on benchmark evaluations. Put simply, to fine-tune
+each model under comparison on the same task-relevant data before evaluation.
+We then show that instances of emergent behavior disappear gradually as models
+train on the test task. Our work promotes a new perspective on the evaluation
+of large language models with broad implications for benchmarking and the study
+of emergent capabilities
+
+
+
+
+
+
+
+ ♻ ☆ Deep Learning and Machine Learning, Advancing Big Data Analytics and
+ Management: Unveiling AI's Potential Through Tools, Techniques, and
+ Applications
+
+
+
+
+
+
+
+
+ Pohsun Feng, Ziqian Bi, Yizhu Wen, Xuanhe Pan, Benji Peng, Ming Liu, Jiawei Xu, Keyu Chen, Junyu Liu, Caitlyn Heqi Yin, Sen Zhang, Jinlang Wang, Qian Niu, Ming Li, Tianyang Wang
+
+
+ Artificial intelligence (AI), machine learning, and deep learning have become
+transformative forces in big data analytics and management, enabling
+groundbreaking advancements across diverse industries. This article delves into
+the foundational concepts and cutting-edge developments in these fields, with a
+particular focus on large language models (LLMs) and their role in natural
+language processing, multimodal reasoning, and autonomous decision-making.
+Highlighting tools such as ChatGPT, Claude, and Gemini, the discussion explores
+their applications in data analysis, model design, and optimization.
+ The integration of advanced algorithms like neural networks, reinforcement
+learning, and generative models has enhanced the capabilities of AI systems to
+process, visualize, and interpret complex datasets. Additionally, the emergence
+of technologies like edge computing and automated machine learning (AutoML)
+democratizes access to AI, empowering users across skill levels to engage with
+intelligent systems. This work also underscores the importance of ethical
+considerations, transparency, and fairness in the deployment of AI
+technologies, paving the way for responsible innovation.
+ Through practical insights into hardware configurations, software
+environments, and real-world applications, this article serves as a
+comprehensive resource for researchers and practitioners. By bridging
+theoretical underpinnings with actionable strategies, it showcases the
+potential of AI and LLMs to revolutionize big data management and drive
+meaningful advancements across domains such as healthcare, finance, and
+autonomous systems.
+
+
+
+ comment: This book contains 155 pages and 9 figures
+
+
+
+
+
+
+ ♻ ☆ Accurate Link Prediction for Edge-Incomplete Graphs via PU Learning AAAI'25
+
+
+
+
+
+
+
+
+ Junghun Kim, Ka Hyun Park, Hoyoung Yoon, U Kang
+
+
+ Given an edge-incomplete graph, how can we accurately find the missing links?
+The link prediction in edge-incomplete graphs aims to discover the missing
+relations between entities when their relationships are represented as a graph.
+Edge-incomplete graphs are prevalent in real-world due to practical
+limitations, such as not checking all users when adding friends in a social
+network. Addressing the problem is crucial for various tasks, including
+recommending friends in social networks and finding references in citation
+networks. However, previous approaches rely heavily on the given
+edge-incomplete (observed) graph, making it challenging to consider the missing
+(unobserved) links during training. In this paper, we propose PULL
+(PU-Learning-based Link predictor), an accurate link prediction method based on
+the positive-unlabeled (PU) learning. PULL treats the observed edges in the
+training graph as positive examples, and the unconnected node pairs as
+unlabeled ones. PULL effectively prevents the link predictor from overfitting
+to the observed graph by proposing latent variables for every edge, and
+leveraging the expected graph structure with respect to the variables.
+Extensive experiments on five real-world datasets show that PULL consistently
+outperforms the baselines for predicting links in edge-incomplete graphs.
+
+
+
+ comment: AAAI'25
+
+
+
+
+
+
+ ♻ ☆ Unlearning or Concealment? A Critical Analysis and Evaluation Metrics
+ for Unlearning in Diffusion Models
+
+
+
+
+
+
+
+
+ Aakash Sen Sharma, Niladri Sarkar, Vikram Chundawat, Ankur A Mali, Murari Mandal
+
+
+ Recent research has seen significant interest in methods for concept removal
+and targeted forgetting in text-to-image diffusion models. In this paper, we
+conduct a comprehensive white-box analysis showing the vulnerabilities in
+existing diffusion model unlearning methods. We show that existing unlearning
+methods lead to decoupling of the targeted concepts (meant to be forgotten) for
+the corresponding prompts. This is concealment and not actual forgetting, which
+was the original goal. This paper presents a rigorous theoretical and empirical
+examination of five commonly used techniques for unlearning in diffusion
+models, while showing their potential weaknesses. We introduce two new
+evaluation metrics: Concept Retrieval Score (\textbf{CRS}) and Concept
+Confidence Score (\textbf{CCS}). These metrics are based on a successful
+adversarial attack setup that can recover \textit{forgotten} concepts from
+unlearned diffusion models. \textbf{CRS} measures the similarity between the
+latent representations of the unlearned and fully trained models after
+unlearning. It reports the extent of retrieval of the \textit{forgotten}
+concepts with increasing amount of guidance. CCS quantifies the confidence of
+the model in assigning the target concept to the manipulated data. It reports
+the probability of the \textit{unlearned} model's generations to be aligned
+with the original domain knowledge with increasing amount of guidance. The
+\textbf{CCS} and \textbf{CRS} enable a more robust evaluation of concept
+erasure methods. Evaluating existing five state-of-the-art methods with our
+metrics, reveal significant shortcomings in their ability to truly
+\textit{unlearn}. Source Code:
+\color{blue}{https://respailab.github.io/unlearning-or-concealment}
+
+
+ Heterogeneous data from multiple populations, sub-groups, or sources is often
+represented as a ``mixture model'' with a single latent class influencing all
+of the observed covariates. Heterogeneity can be resolved at multiple levels by
+grouping populations according to different notions of similarity. This paper
+proposes grouping with respect to the causal response of an intervention or
+perturbation on the system. This definition is distinct from previous notions,
+such as similar covariate values (e.g. clustering) or similar correlations
+between covariates (e.g. Gaussian mixture models). To solve the problem, we
+``synthetically sample'' from a counterfactual distribution using higher-order
+multi-linear moments of the observable data. To understand how these ``causal
+mixtures'' fit in with more classical notions, we develop a hierarchy of
+mixture identifiability.
+
+
+
+
+
+
+
+ ♻ ☆ Annotation-guided Protein Design with Multi-Level Domain Alignment KDD 2025
+
+
+ The core challenge of de novo protein design lies in creating proteins with
+specific functions or properties, guided by certain conditions. Current models
+explore to generate protein using structural and evolutionary guidance, which
+only provide indirect conditions concerning functions and properties. However,
+textual annotations of proteins, especially the annotations for protein
+domains, which directly describe the protein's high-level functionalities,
+properties, and their correlation with target amino acid sequences, remain
+unexplored in the context of protein design tasks. In this paper, we propose
+Protein-Annotation Alignment Generation, PAAG, a multi-modality protein design
+framework that integrates the textual annotations extracted from protein
+database for controllable generation in sequence space. Specifically, within a
+multi-level alignment module, PAAG can explicitly generate proteins containing
+specific domains conditioned on the corresponding domain annotations, and can
+even design novel proteins with flexible combinations of different kinds of
+annotations. Our experimental results underscore the superiority of the aligned
+protein representations from PAAG over 7 prediction tasks. Furthermore, PAAG
+demonstrates a significant increase in generation success rate (24.7% vs 4.7%
+in zinc finger, and 54.3% vs 22.0% in the immunoglobulin domain) in comparison
+to the existing model. We anticipate that PAAG will broaden the horizons of
+protein design by leveraging the knowledge from between textual annotation and
+proteins.
+
+
+
+ comment: Accepted by KDD 2025
+
+
+
+
+
+
+ ♻ ☆ CGGM: A conditional graph generation model with adaptive sparsity for
+ node anomaly detection in IoT networks
+
+
+
+
+
+
+
+
+ Munan Li, Xianshi Su, Runze Ma, Tongbang Jiang, Zijian Li, Tony Q. S. Quek
+
+
+ Dynamic graphs are extensively employed for detecting anomalous behavior in
+nodes within the Internet of Things (IoT). Graph generative models are often
+used to address the issue of imbalanced node categories in dynamic graphs.
+Nevertheless, the constraints it faces include the monotonicity of adjacency
+relationships, the difficulty in constructing multi-dimensional features for
+nodes, and the lack of a method for end-to-end generation of multiple
+categories of nodes. In this paper, we propose a novel graph generation model,
+called CGGM, specifically for generating samples belonging to the minority
+class. The framework consists two core module: a conditional graph generation
+module and a graph-based anomaly detection module. The generative module adapts
+to the sparsity of the matrix by downsampling a noise adjacency matrix, and
+incorporates a multi-dimensional feature encoder based on multi-head
+self-attention to capture latent dependencies among features. Additionally, a
+latent space constraint is combined with the distribution distance to
+approximate the latent distribution of real data. The graph-based anomaly
+detection module utilizes the generated balanced dataset to predict the node
+behaviors. Extensive experiments have shown that CGGM outperforms the
+state-of-the-art methods in terms of accuracy and divergence. The results also
+demonstrate CGGM can generated diverse data categories, that enhancing the
+performance of multi-category classification task.
+
+
+
+ comment: 10 pages, 19 figures
+
+
+
+
+
+
+ ♻ ☆ Guiding Vision-Language Model Selection for Visual Question-Answering
+ Across Tasks, Domains, and Knowledge Types COLING
+
+
+ Visual Question-Answering (VQA) has become key to user experience,
+particularly after improved generalization capabilities of Vision-Language
+Models (VLMs). But evaluating VLMs for an application requirement using a
+standardized framework in practical settings is still challenging. This paper
+aims to solve that using an end-to-end framework. We present VQA360 - a novel
+dataset derived from established VQA benchmarks, annotated with task types,
+application domains, and knowledge types, for a comprehensive evaluation. We
+also introduce GoEval, a multimodal evaluation metric developed using GPT-4o,
+achieving a correlation factor of 56.71% with human judgments. Our experiments
+with state-of-the-art VLMs reveal that no single model excels universally,
+thus, making a right choice a key design decision. Proprietary models such as
+Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source
+models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive
+strengths, while providing additional advantages. Our framework can also be
+extended to other tasks.
+
+
+
+ comment: Accepted at The First Workshop of Evaluation of Multi-Modal
+ Generation (EvalMG) in 31st International Conference on Computational
+ Linguistics (COLING), 2025. 8 pages + references + 6 pages of Appendix
+
+
+
+
+
+
+ ♻ ☆ VickreyFeedback: Cost-efficient Data Construction for Reinforcement
+ Learning from Human Feedback
+
+
+ This paper addresses the cost-efficiency aspect of Reinforcement Learning
+from Human Feedback (RLHF). RLHF leverages datasets of human preferences over
+outputs of large language models (LLM)s to instill human expectations into
+LLMs. Although preference annotation comes with a monetized cost, the economic
+utility of a preference dataset has not been considered by far. What
+exacerbates this situation is that, given complex intransitive or cyclic
+relationships in preference datasets, existing algorithms for fine-tuning LLMs
+are still far from capturing comprehensive preferences. This raises severe
+cost-efficiency concerns in production environments, where preference data
+accumulate over time. In this paper, we discuss the fine-tuning of LLMs as a
+monetized economy and introduce an auction mechanism to improve the efficiency
+of preference data collection in dollar terms. We show that introducing an
+auction mechanism can play an essential role in enhancing the cost-efficiency
+of RLHF, while maintaining satisfactory model performance. Experimental results
+demonstrate that our proposed auction-based protocol is cost-effective for
+fine-tuning LLMs concentrating on high-quality feedback.
+
+
+
+ comment: 16 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Learn To be Efficient: Build Structured Sparsity in Large Language
+ Models
+
+
+ Large Language Models (LLMs) have achieved remarkable success with their
+billion-level parameters, yet they incur high inference overheads. The
+emergence of activation sparsity in LLMs provides a natural approach to reduce
+this cost by involving only parts of the parameters for inference. However,
+existing methods only focus on utilizing this naturally formed activation
+sparsity in a post-training setting, overlooking the potential for further
+amplifying this inherent sparsity. In this paper, we hypothesize that LLMs can
+learn to be efficient by achieving more structured activation sparsity. To
+achieve this, we introduce a novel training algorithm, Learn-To-be-Efficient
+(LTE), designed to train efficiency-aware LLMs to learn to activate fewer
+neurons and achieve a better trade-off between sparsity and performance.
+Furthermore, unlike SOTA MoEfication methods, which mainly focus on ReLU-based
+models, LTE can also be applied to LLMs like LLaMA using non-ReLU activations.
+Extensive evaluation on language understanding, language generation, and
+instruction tuning tasks show that LTE consistently outperforms SOTA baselines.
+Along with our hardware-aware custom kernel implementation, LTE reduces
+LLaMA2-7B inference latency by 25% at 50% sparsity.
+
+
+
+
+
+
+
+ ♻ ☆ TorchCP: A Python Library for Conformal Prediction
+
+
+ Conformal Prediction (CP) has attracted great attention from the research
+community due to its strict theoretical guarantees. However, researchers and
+developers still face challenges of applicability and efficiency when applying
+CP algorithms to deep learning models. In this paper, we introduce \torchcp, a
+comprehensive PyTorch-based toolkit to strengthen the usability of CP for deep
+learning models. \torchcp implements a wide range of post-hoc and training
+methods of conformal prediction for various machine learning tasks, including
+classification, regression, GNN, and LLM. Moreover, we provide user-friendly
+interfaces and extensive evaluations to easily integrate CP algorithms into
+specific tasks. Our \torchcp toolkit, built entirely with PyTorch, enables
+high-performance GPU acceleration for deep learning models and mini-batch
+computation on large-scale datasets. With the LGPL license, the code is
+open-sourced at \url{https://github.com/ml-stat-Sustech/TorchCP} and will be
+continuously updated.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 16
+
+
+
+
+
+ ☆ Representing Long Volumetric Video with Temporal Gaussian Hierarchy SIGGRAPH
+
+
+ This paper aims to address the challenge of reconstructing long volumetric
+videos from multi-view RGB videos. Recent dynamic view synthesis methods
+leverage powerful 4D representations, like feature grids or point cloud
+sequences, to achieve high-quality rendering results. However, they are
+typically limited to short (1~2s) video clips and often suffer from large
+memory footprints when dealing with longer videos. To solve this issue, we
+propose a novel 4D representation, named Temporal Gaussian Hierarchy, to
+compactly model long volumetric videos. Our key observation is that there are
+generally various degrees of temporal redundancy in dynamic scenes, which
+consist of areas changing at different speeds. Motivated by this, our approach
+builds a multi-level hierarchy of 4D Gaussian primitives, where each level
+separately describes scene regions with different degrees of content change,
+and adaptively shares Gaussian primitives to represent unchanged scene content
+over different temporal segments, thus effectively reducing the number of
+Gaussian primitives. In addition, the tree-like structure of the Gaussian
+hierarchy allows us to efficiently represent the scene at a particular moment
+with a subset of Gaussian primitives, leading to nearly constant GPU memory
+usage during the training or rendering regardless of the video length.
+Extensive experimental results demonstrate the superiority of our method over
+alternative methods in terms of training cost, rendering speed, and storage
+usage. To our knowledge, this work is the first approach capable of efficiently
+handling minutes of volumetric video data while maintaining state-of-the-art
+rendering quality. Our project page is available at:
+https://zju3dv.github.io/longvolcap.
+
+
+ As Multi-modal Large Language Models (MLLMs) evolve, expanding beyond
+single-domain capabilities is essential to meet the demands for more versatile
+and efficient AI. However, previous omni-models have insufficiently explored
+speech, neglecting its integration with multi-modality. We introduce Lyra, an
+efficient MLLM that enhances multimodal abilities, including advanced
+long-speech comprehension, sound understanding, cross-modality efficiency, and
+seamless speech interaction. To achieve efficiency and speech-centric
+capabilities, Lyra employs three strategies: (1) leveraging existing
+open-source large models and a proposed multi-modality LoRA to reduce training
+costs and data requirements; (2) using a latent multi-modality regularizer and
+extractor to strengthen the relationship between speech and other modalities,
+thereby enhancing model performance; and (3) constructing a high-quality,
+extensive dataset that includes 1.5M multi-modal (language, vision, audio) data
+samples and 12K long speech samples, enabling Lyra to handle complex long
+speech inputs and achieve more robust omni-cognition. Compared to other
+omni-methods, Lyra achieves state-of-the-art performance on various
+vision-language, vision-speech, and speech-language benchmarks, while also
+using fewer computational resources and less training data.
+
+
+
+ comment: Tech report
+
+
+
+
+
+
+ ☆ Video Seal: Open and Efficient Video Watermarking
+
+
+
+
+
+
+
+
+ Pierre Fernandez, Hady Elsahar, I. Zeki Yalniz, Alexandre Mourachko
+
+
+ The proliferation of AI-generated content and sophisticated video editing
+tools has made it both important and challenging to moderate digital platforms.
+Video watermarking addresses these challenges by embedding imperceptible
+signals into videos, allowing for identification. However, the rare open tools
+and methods often fall short on efficiency, robustness, and flexibility. To
+reduce these gaps, this paper introduces Video Seal, a comprehensive framework
+for neural video watermarking and a competitive open-sourced model. Our
+approach jointly trains an embedder and an extractor, while ensuring the
+watermark robustness by applying transformations in-between, e.g., video
+codecs. This training is multistage and includes image pre-training, hybrid
+post-training and extractor fine-tuning. We also introduce temporal watermark
+propagation, a technique to convert any image watermarking model to an
+efficient video watermarking model without the need to watermark every
+high-resolution frame. We present experimental results demonstrating the
+effectiveness of the approach in terms of speed, imperceptibility, and
+robustness. Video Seal achieves higher robustness compared to strong baselines
+especially under challenging distortions combining geometric transformations
+and video compression. Additionally, we provide new insights such as the impact
+of video compression during training, and how to compare methods operating on
+different payloads. Contributions in this work - including the codebase,
+models, and a public demo - are open-sourced under permissive licenses to
+foster further research and development in the field.
+
+
+
+ comment: Code available at https://github.com/facebookresearch/videoseal
+
+
+
+
+
+
+ ☆ Multimodal Music Generation with Explicit Bridges and Retrieval
+ Augmentation
+
+
+
+
+
+
+
+
+ Baisen Wang, Le Zhuo, Zhaokai Wang, Chenxi Bao, Wu Chengjing, Xuecheng Nie, Jiao Dai, Jizhong Han, Yue Liao, Si Liu
+
+
+ Multimodal music generation aims to produce music from diverse input
+modalities, including text, videos, and images. Existing methods use a common
+embedding space for multimodal fusion. Despite their effectiveness in other
+modalities, their application in multimodal music generation faces challenges
+of data scarcity, weak cross-modal alignment, and limited controllability. This
+paper addresses these issues by using explicit bridges of text and music for
+multimodal alignment. We introduce a novel method named Visuals Music Bridge
+(VMB). Specifically, a Multimodal Music Description Model converts visual
+inputs into detailed textual descriptions to provide the text bridge; a
+Dual-track Music Retrieval module that combines broad and targeted retrieval
+strategies to provide the music bridge and enable user control. Finally, we
+design an Explicitly Conditioned Music Generation framework to generate music
+based on the two bridges. We conduct experiments on video-to-music,
+image-to-music, text-to-music, and controllable music generation tasks, along
+with experiments on controllability. The results demonstrate that VMB
+significantly enhances music quality, modality, and customization alignment
+compared to previous methods. VMB sets a new standard for interpretable and
+expressive multimodal music generation with applications in various multimedia
+fields. Demos and code are available at https://github.com/wbs2788/VMB.
+
+
+
+
+
+
+
+
+ Fiorenzo Parascandolo, Nicholas Moratelli, Enver Sangineto, Lorenzo Baraldi, Rita Cucchiara
+
+
+ Recent work has empirically shown that Vision-Language Models (VLMs) struggle
+to fully understand the compositional properties of the human language, usually
+modeling an image caption as a "bag of words". As a result, they perform poorly
+on compositional tasks, which require a deeper understanding of the different
+entities of a sentence (subject, verb, etc.) jointly with their mutual
+relationships in order to be solved. In this paper, we model the dependency
+relations among textual and visual tokens using a Causal Graphical Model (CGM),
+built using a dependency parser, and we train a decoder conditioned by the VLM
+visual encoder. Differently from standard autoregressive or parallel
+predictions, our decoder's generative process is partially-ordered following
+the CGM structure. This structure encourages the decoder to learn only the main
+causal dependencies in a sentence discarding spurious correlations. Using
+extensive experiments on five compositional benchmarks, we show that our method
+significantly outperforms all the state-of-the-art compositional approaches by
+a large margin, and it also improves over methods trained using much larger
+datasets.
+
+
+
+
+
+
+
+ ☆ Towards Open-Vocabulary Video Semantic Segmentation
+
+
+
+
+
+
+
+
+ Xinhao Li, Yun Liu, Guolei Sun, Min Wu, Le Zhang, Ce Zhu
+
+
+ Semantic segmentation in videos has been a focal point of recent research.
+However, existing models encounter challenges when faced with unfamiliar
+categories. To address this, we introduce the Open Vocabulary Video Semantic
+Segmentation (OV-VSS) task, designed to accurately segment every pixel across a
+wide range of open-vocabulary categories, including those that are novel or
+previously unexplored. To enhance OV-VSS performance, we propose a robust
+baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing
+the model to utilize temporal relationships across consecutive frames.
+Additionally, we incorporate a random frame enhancement module, broadening the
+model's understanding of semantic context throughout the entire video sequence.
+Our approach also includes video text encoding, which strengthens the model's
+capability to interpret textual information within the video context.
+Comprehensive evaluations on benchmark datasets such as VSPW and Cityscapes
+highlight OV-VSS's zero-shot generalization capabilities, especially in
+handling novel categories. The results validate OV2VSS's effectiveness,
+demonstrating improved performance in semantic segmentation tasks across
+diverse video datasets.
+
+
+
+ comment: 13 pages, 7 figures
+
+
+
+
+
+
+ ☆ Multimodal Sentiment Analysis based on Video and Audio Inputs SP
+
+
+ Despite the abundance of current researches working on the sentiment analysis
+from videos and audios, finding the best model that gives the highest accuracy
+rate is still considered a challenge for researchers in this field. The main
+objective of this paper is to prove the usability of emotion recognition models
+that take video and audio inputs. The datasets used to train the models are the
+CREMA-D dataset for audio and the RAVDESS dataset for video. The fine-tuned
+models that been used are: Facebook/wav2vec2-large for audio and the
+Google/vivit-b-16x2-kinetics400 for video. The avarage of the probabilities for
+each emotion generated by the two previous models is utilized in the decision
+making framework. After disparity in the results, if one of the models gets
+much higher accuracy, another test framework is created. The methods used are
+the Weighted Average method, the Confidence Level Threshold method, the Dynamic
+Weighting Based on Confidence method, and the Rule-Based Logic method. This
+limited approach gives encouraging results that make future research into these
+methods viable.
+
+
+
+ comment: Presented as a full paper in the 15th International Conference on
+ Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2024) October
+ 28-30, 2024, Leuven, Belgium
+
+ Generating sound effects for product-level videos, where only a small amount
+of labeled data is available for diverse scenes, requires the production of
+high-quality sounds in few-shot settings. To tackle the challenge of limited
+labeled data in real-world scenes, we introduce YingSound, a foundation model
+designed for video-guided sound generation that supports high-quality audio
+generation in few-shot settings. Specifically, YingSound consists of two major
+modules. The first module uses a conditional flow matching transformer to
+achieve effective semantic alignment in sound generation across audio and
+visual modalities. This module aims to build a learnable audio-visual
+aggregator (AVA) that integrates high-resolution visual features with
+corresponding audio features at multiple stages. The second module is developed
+with a proposed multi-modal visual-audio chain-of-thought (CoT) approach to
+generate finer sound effects in few-shot settings. Finally, an
+industry-standard video-to-audio (V2A) dataset that encompasses various
+real-world scenarios is presented. We show that YingSound effectively generates
+high-quality synchronized sounds across diverse conditional inputs through
+automated evaluations and human studies. Project Page:
+\url{https://giantailab.github.io/yingsound/}
+
+
+
+ comment: 16 pages, 4 figures
+
+
+
+
+
+
+ ☆ Enhancing Modality Representation and Alignment for Multimodal
+ Cold-start Active Learning
+
+
+
+
+
+
+
+
+ Meng Shen, Yake Wei, Jianxiong Yin, Deepu Rajan, Di Hu, Simon See
+
+
+ Training multimodal models requires a large amount of labeled data. Active
+learning (AL) aim to reduce labeling costs. Most AL methods employ warm-start
+approaches, which rely on sufficient labeled data to train a well-calibrated
+model that can assess the uncertainty and diversity of unlabeled data. However,
+when assembling a dataset, labeled data are often scarce initially, leading to
+a cold-start problem. Additionally, most AL methods seldom address multimodal
+data, highlighting a research gap in this field. Our research addresses these
+issues by developing a two-stage method for Multi-Modal Cold-Start Active
+Learning (MMCSAL).
+ Firstly, we observe the modality gap, a significant distance between the
+centroids of representations from different modalities, when only using
+cross-modal pairing information as self-supervision signals. This modality gap
+affects data selection process, as we calculate both uni-modal and cross-modal
+distances. To address this, we introduce uni-modal prototypes to bridge the
+modality gap. Secondly, conventional AL methods often falter in multimodal
+scenarios where alignment between modalities is overlooked. Therefore, we
+propose enhancing cross-modal alignment through regularization, thereby
+improving the quality of selected multimodal data pairs in AL. Finally, our
+experiments demonstrate MMCSAL's efficacy in selecting multimodal data pairs
+across three multimodal datasets.
+
+
+
+ comment: 11 pages, ACMMM Asia 2024, Oral Presentation
+
+ We present MS2Mesh-XR, a novel multi-modal sketch-to-mesh generation pipeline
+that enables users to create realistic 3D objects in extended reality (XR)
+environments using hand-drawn sketches assisted by voice inputs. In specific,
+users can intuitively sketch objects using natural hand movements in mid-air
+within a virtual environment. By integrating voice inputs, we devise ControlNet
+to infer realistic images based on the drawn sketches and interpreted text
+prompts. Users can then review and select their preferred image, which is
+subsequently reconstructed into a detailed 3D mesh using the Convolutional
+Reconstruction Model. In particular, our proposed pipeline can generate a
+high-quality 3D mesh in less than 20 seconds, allowing for immersive
+visualization and manipulation in run-time XR scenes. We demonstrate the
+practicability of our pipeline through two use cases in XR settings. By
+leveraging natural user inputs and cutting-edge generative AI capabilities, our
+approach can significantly facilitate XR-based creative production and enhance
+user experiences. Our code and demo will be available at:
+https://yueqiu0911.github.io/MS2Mesh-XR/
+
+
+
+ comment: IEEE AIxVR 2025
+
+
+
+
+
+
+ ☆ EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing
+
+
+
+
+
+
+
+
+ Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang
+
+
+ Given a piece of text, a video clip, and a reference audio, the movie dubbing
+task aims to generate speech that aligns with the video while cloning the
+desired voice. The existing methods have two primary deficiencies: (1) They
+struggle to simultaneously hold audio-visual sync and achieve clear
+pronunciation; (2) They lack the capacity to express user-defined emotions. To
+address these problems, we propose EmoDubber, an emotion-controllable dubbing
+architecture that allows users to specify emotion type and emotional intensity
+while satisfying high-quality lip sync and pronunciation. Specifically, we
+first design Lip-related Prosody Aligning (LPA), which focuses on learning the
+inherent consistency between lip motion and prosody variation by duration level
+contrastive learning to incorporate reasonable alignment. Then, we design
+Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences
+by efficient conformer to improve speech intelligibility. Next, the speaker
+identity adapting module aims to decode acoustics prior and inject the speaker
+style embedding. After that, the proposed Flow-based User Emotion Controlling
+(FUEC) is used to synthesize waveform by flow matching prediction network
+conditioned on acoustics prior. In this process, the FUEC determines the
+gradient direction and guidance scale based on the user's emotion instructions
+by the positive and negative guidance mechanism, which focuses on amplifying
+the desired emotion while suppressing others. Extensive experimental results on
+three benchmark datasets demonstrate favorable performance compared to several
+state-of-the-art methods.
+
+
+
+ comment: Under review
+
+
+
+
+
+
+ ☆ Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K
+ Video Restoration under Codec Compression
+
+
+
+
+
+
+
+
+ Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour
+
+
+ In this paper, we introduce DiQP; a novel Transformer-Diffusion model for
+restoring 8K video quality degraded by codec compression. To the best of our
+knowledge, our model is the first to consider restoring the artifacts
+introduced by various codecs (AV1, HEVC) by Denoising Diffusion without
+considering additional noise. This approach allows us to model the complex,
+non-Gaussian nature of compression artifacts, effectively learning to reverse
+the degradation. Our architecture combines the power of Transformers to capture
+long-range dependencies with an enhanced windowed mechanism that preserves
+spatiotemporal context within groups of pixels across frames. To further
+enhance restoration, the model incorporates auxiliary "Look Ahead" and "Look
+Around" modules, providing both future and surrounding frame information to aid
+in reconstructing fine details and enhancing overall visual quality. Extensive
+experiments on different datasets demonstrate that our model outperforms
+state-of-the-art methods, particularly for high-resolution videos such as 4K
+and 8K, showcasing its effectiveness in restoring perceptually pleasing videos
+from highly compressed sources.
+
+
+ Due to the challenges in acquiring paired Text-3D data and the inherent
+irregularity of 3D data structures, combined representation learning of 3D
+point clouds and text remains unexplored. In this paper, we propose a novel
+Riemann-based Multi-scale Attention Reasoning Network (RMARN) for text-3D
+retrieval. Specifically, the extracted text and point cloud features are
+refined by their respective Adaptive Feature Refiner (AFR). Furthermore, we
+introduce the innovative Riemann Local Similarity (RLS) module and the Global
+Pooling Similarity (GPS) module. However, as 3D point cloud data and text data
+often possess complex geometric structures in high-dimensional space, the
+proposed RLS employs a novel Riemann Attention Mechanism to reflect the
+intrinsic geometric relationships of the data. Without explicitly defining the
+manifold, RMARN learns the manifold parameters to better represent the
+distances between text-point cloud samples. To address the challenges of
+lacking paired text-3D data, we have created the large-scale Text-3D Retrieval
+dataset T3DR-HIT, which comprises over 3,380 pairs of text and point cloud
+data. T3DR-HIT contains coarse-grained indoor 3D scenes and fine-grained
+Chinese artifact scenes, consisting of 1,380 and over 2,000 text-3D pairs,
+respectively. Experiments on our custom datasets demonstrate the superior
+performance of the proposed method. Our code and proposed datasets are
+available at \url{https://github.com/liwrui/RMARN}.
+
+
+
+ comment: Accepted by AAAI25
+
+
+
+
+
+
+ ♻ ☆ OneAdapt: Fast Configuration Adaptation for Video Analytics Applications
+ via Backpropagation SoCC' 23
+
+
+ Deep learning inference on streaming media data, such as object detection in
+video or LiDAR feeds and text extraction from audio waves, is now ubiquitous.
+To achieve high inference accuracy, these applications typically require
+significant network bandwidth to gather high-fidelity data and extensive GPU
+resources to run deep neural networks (DNNs). While the high demand for network
+bandwidth and GPU resources could be substantially reduced by optimally
+adapting the configuration knobs, such as video resolution and frame rate,
+current adaptation techniques fail to meet three requirements simultaneously:
+adapt configurations (i) with minimum extra GPU or bandwidth overhead; (ii) to
+reach near-optimal decisions based on how the data affects the final DNN's
+accuracy, and (iii) do so for a range of configuration knobs. This paper
+presents OneAdapt, which meets these requirements by leveraging a
+gradient-ascent strategy to adapt configuration knobs. The key idea is to
+embrace DNNs' differentiability to quickly estimate the accuracy's gradient to
+each configuration knob, called AccGrad. Specifically, OneAdapt estimates
+AccGrad by multiplying two gradients: InputGrad (i.e. how each configuration
+knob affects the input to the DNN) and DNNGrad (i.e. how the DNN input affects
+the DNN inference output). We evaluate OneAdapt across five types of
+configurations, four analytic tasks, and five types of input data. Compared to
+state-of-the-art adaptation schemes, OneAdapt cuts bandwidth usage and GPU
+usage by 15-59% while maintaining comparable accuracy or improves accuracy by
+1-5% while using equal or fewer resources.
+
+
+ Navigating unseen environments based on natural language instructions remains
+difficult for egocentric agents in Vision-and-Language Navigation (VLN). While
+recent advancements have yielded promising outcomes, they primarily rely on RGB
+images for environmental representation, often overlooking the underlying
+semantic knowledge and spatial cues. Intuitively, humans inherently ground
+textual semantics within the spatial layout during indoor navigation. Inspired
+by this, we propose a versatile Semantic Understanding and Spatial Awareness
+(SUSA) architecture to facilitate navigation. SUSA includes a Textual Semantic
+Understanding (TSU) module, which narrows the modality gap between instructions
+and environments by generating and associating the descriptions of
+environmental landmarks in the agent's immediate surroundings. Additionally, a
+Depth-based Spatial Perception (DSP) module incrementally constructs a depth
+exploration map, enabling a more nuanced comprehension of environmental
+layouts. Experimental results demonstrate that SUSA hybrid semantic-spatial
+representations effectively enhance navigation performance, setting new
+state-of-the-art performance across three VLN benchmarks (REVERIE, R2R, and
+SOON). The source code will be publicly available.
+
+
+
+ comment: A technical report consisting of 16 pages, 12 figures, 10 tables
+
+
+
+
+
+
+ ♻ ☆ DriveMM: All-in-One Large Multimodal Model for Autonomous Driving
+
+
+ Large Multimodal Models (LMMs) have demonstrated exceptional comprehension
+and interpretation capabilities in Autonomous Driving (AD) by incorporating
+large language models. Despite the advancements, current data-driven AD
+approaches tend to concentrate on a single dataset and specific tasks,
+neglecting their overall capabilities and ability to generalize. To bridge
+these gaps, we propose DriveMM, a general large multimodal model designed to
+process diverse data inputs, such as images and multi-view videos, while
+performing a broad spectrum of AD tasks, including perception, prediction, and
+planning. Initially, the model undergoes curriculum pre-training to process
+varied visual signals and perform basic visual comprehension and perception
+tasks. Subsequently, we augment and standardize various AD-related datasets to
+fine-tune the model, resulting in an all-in-one LMM for autonomous driving. To
+assess the general capabilities and generalization ability, we conduct
+evaluations on six public benchmarks and undertake zero-shot transfer on an
+unseen dataset, where DriveMM achieves state-of-the-art performance across all
+tasks. We hope DriveMM as a promising solution for future end-to-end autonomous
+driving applications in the real world. Project page with code:
+https://github.com/zhijian11/DriveMM.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Computation and Language 46
+
+
+
+
+
+ ☆ Large Concept Models: Language Modeling in a Sentence Representation
+ Space
+
+
+
+
+
+
+
+
+ The LCM team, Loïc Barrault, Paul-Ambroise Duquenne, Maha Elbayad, Artyom Kozhevnikov, Belen Alastruey, Pierre Andrews, Mariano Coria, Guillaume Couairon, Marta R. Costa-jussà, David Dale, Hady Elsahar, Kevin Heffernan, João Maria Janeiro, Tuan Tran, Christophe Ropers, Eduardo Sánchez, Robin San Roman, Alexandre Mourachko, Safiyyah Saleem, Holger Schwenk
+
+
+ LLMs have revolutionized the field of artificial intelligence and have
+emerged as the de-facto tool for many tasks. The current established technology
+of LLMs is to process input and generate output at the token level. This is in
+sharp contrast to humans who operate at multiple levels of abstraction, well
+beyond single words, to analyze information and to generate creative content.
+In this paper, we present an attempt at an architecture which operates on an
+explicit higher-level semantic representation, which we name a concept.
+Concepts are language- and modality-agnostic and represent a higher level idea
+or action in a flow. Hence, we build a "Large Concept Model". In this study, as
+proof of feasibility, we assume that a concept corresponds to a sentence, and
+use an existing sentence embedding space, SONAR, which supports up to 200
+languages in both text and speech modalities.
+ The Large Concept Model is trained to perform autoregressive sentence
+prediction in an embedding space. We explore multiple approaches, namely MSE
+regression, variants of diffusion-based generation, and models operating in a
+quantized SONAR space. These explorations are performed using 1.6B parameter
+models and training data in the order of 1.3T tokens. We then scale one
+architecture to a model size of 7B parameters and training data of about 2.7T
+tokens. We perform an experimental evaluation on several generative tasks,
+namely summarization and a new task of summary expansion. Finally, we show that
+our model exhibits impressive zero-shot generalization performance to many
+languages, outperforming existing LLMs of the same size. The training code of
+our models is freely available.
+
+
+
+ comment: 49 pages
+
+
+
+
+
+
+ ☆ jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
+
+
+
+
+
+
+
+
+ Andreas Koukounas, Georgios Mastrapas, Bo Wang, Mohammad Kalim Akram, Sedigheh Eslami, Michael Günther, Isabelle Mohr, Saba Sturua, Scott Martens, Nan Wang, Han Xiao
+
+
+ Contrastive Language-Image Pretraining (CLIP) is a highly effective method
+for aligning images and texts in a shared embedding space. These models are
+widely used for tasks such as cross-modal information retrieval and multi-modal
+understanding. However, CLIP models often struggle with text-only tasks,
+underperforming compared to specialized text models. This performance disparity
+forces retrieval systems to rely on separate models for text-only and
+multi-modal tasks. In this work, we build upon our previous model,
+jina-clip-v1, by introducing a refined framework that utilizes multi-task,
+multi-stage contrastive learning across multiple languages, coupled with an
+improved training recipe to enhance text-only retrieval. The resulting model,
+jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks,
+while adding multilingual support, better understanding of complex visual
+documents and efficiency gains thanks to Matryoshka Representation Learning and
+vector truncation. The model performs comparably to the state-of-the-art in
+both multilingual-multimodal and multilingual text retrieval benchmarks,
+addressing the challenge of unifying text-only and multi-modal retrieval
+systems.
+
+
+ Fairness in multi-document summarization (MDS) measures whether a system can
+generate a summary fairly representing information from documents with
+different social attribute values. Fairness in MDS is crucial since a fair
+summary can offer readers a comprehensive view. Previous works focus on
+quantifying summary-level fairness using Proportional Representation, a
+fairness measure based on Statistical Parity. However, Proportional
+Representation does not consider redundancy in input documents and overlooks
+corpus-level unfairness. In this work, we propose a new summary-level fairness
+measure, Equal Coverage, which is based on coverage of documents with different
+social attribute values and considers the redundancy within documents. To
+detect the corpus-level unfairness, we propose a new corpus-level measure,
+Coverage Parity. Our human evaluations show that our measures align more with
+our definition of fairness. Using our measures, we evaluate the fairness of
+thirteen different LLMs. We find that Claude3-sonnet is the fairest among all
+evaluated LLMs. We also find that almost all LLMs overrepresent different
+social attribute values.
+
+
+
+
+
+
+
+ ☆ BDA: Bangla Text Data Augmentation Framework
+
+
+ Data augmentation involves generating synthetic samples that resemble those
+in a given dataset. In resource-limited fields where high-quality data is
+scarce, augmentation plays a crucial role in increasing the volume of training
+data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework
+that uses both pre-trained models and rule-based methods to create new variants
+of the text. A filtering process is included to ensure that the new text keeps
+the same meaning as the original while also adding variety in the words used.
+We conduct a comprehensive evaluation of the framework's effectiveness in
+Bangla text classification tasks. Our framework achieved significant
+improvement in F1 scores across five distinct datasets, delivering performance
+equivalent to models trained on 100\% of the data while utilizing only 50\% of
+the training dataset. Additionally, we explore the impact of data scarcity by
+progressively reducing the training data and augmenting it through BDA,
+resulting in notable F1 score enhancements. The study offers a thorough
+examination of BDA's performance, identifying key factors for optimal results
+and addressing its limitations through detailed analysis.
+
+
+
+
+
+
+
+ ☆ In-Context Learning with Topological Information for Knowledge Graph
+ Completion
+
+
+ Knowledge graphs (KGs) are crucial for representing and reasoning over
+structured information, supporting a wide range of applications such as
+information retrieval, question answering, and decision-making. However, their
+effectiveness is often hindered by incompleteness, limiting their potential for
+real-world impact. While knowledge graph completion (KGC) has been extensively
+studied in the literature, recent advances in generative AI models,
+particularly large language models (LLMs), have introduced new opportunities
+for innovation. In-context learning has recently emerged as a promising
+approach for leveraging pretrained knowledge of LLMs across a range of natural
+language processing tasks and has been widely adopted in both academia and
+industry. However, how to utilize in-context learning for effective KGC remains
+relatively underexplored. We develop a novel method that incorporates
+topological information through in-context learning to enhance KGC performance.
+By integrating ontological knowledge and graph structure into the context of
+LLMs, our approach achieves strong performance in the transductive setting
+i.e., nodes in the test graph dataset are present in the training graph
+dataset. Furthermore, we apply our approach to KGC in the more challenging
+inductive setting, i.e., nodes in the training graph dataset and test graph
+dataset are disjoint, leveraging the ontology to infer useful information about
+missing nodes which serve as contextual cues for the LLM during inference. Our
+method demonstrates superior performance compared to baselines on the
+ILPC-small and ILPC-large datasets.
+
+
+ Multimodal large language models (MLLMs) have made rapid progress in recent
+years, yet continue to struggle with low-level visual perception (LLVP) --
+particularly the ability to accurately describe the geometric details of an
+image. This capability is crucial for applications in areas such as robotics,
+medical image analysis, and manufacturing. In this paper, we first introduce
+Geoperception, a benchmark designed to evaluate an MLLM's ability to accurately
+transcribe 2D geometric information from an image. Using this benchmark, we
+demonstrate the limitations of leading MLLMs, and then conduct a comprehensive
+empirical study to explore strategies for improving their performance on
+geometric tasks. Our findings highlight the benefits of certain model
+architectures, training techniques, and data strategies, including the use of
+high-fidelity synthetic data and multi-stage training with a data curriculum.
+Notably, we find that a data curriculum enables models to learn challenging
+geometry understanding tasks which they fail to learn from scratch. Leveraging
+these insights, we develop Euclid, a family of models specifically optimized
+for strong low-level geometric perception. Although purely trained on synthetic
+multimodal data, Euclid shows strong generalization ability to novel geometry
+shapes. For instance, Euclid outperforms the best closed-source model,
+Gemini-1.5-Pro, by up to 58.56% on certain Geoperception benchmark tasks and
+10.65% on average across all tasks.
+
+
+
+
+
+
+
+ ☆ LatentQA: Teaching LLMs to Decode Activations Into Natural Language
+
+
+
+
+
+
+
+
+ Alexander Pan, Lijie Chen, Jacob Steinhardt
+
+
+ Interpretability methods seek to understand language model representations,
+yet the outputs of most such methods -- circuits, vectors, scalars -- are not
+immediately human-interpretable. In response, we introduce LatentQA, the task
+of answering open-ended questions about model activations in natural language.
+Towards solving LatentQA, we propose Latent Interpretation Tuning (LIT), which
+finetunes a decoder LLM on a dataset of activations and associated
+question-answer pairs, similar to how visual instruction tuning trains on
+question-answer pairs associated with images. We use the decoder for diverse
+reading applications, such as extracting relational knowledge from
+representations or uncovering system prompts governing model behavior. Our
+decoder also specifies a differentiable loss that we use to control models,
+such as debiasing models on stereotyped sentences and controlling the sentiment
+of generations. Finally, we extend LatentQA to reveal harmful model
+capabilities, such as generating recipes for bioweapons and code for hacking.
+
+
+
+ comment: Project page is at https://latentqa.github.io
+
+
+
+
+
+
+ ☆ Fast Prompt Alignment for Text-to-Image Generation
+
+
+ Text-to-image generation has advanced rapidly, yet aligning complex textual
+prompts with generated visuals remains challenging, especially with intricate
+object relationships and fine-grained details. This paper introduces Fast
+Prompt Alignment (FPA), a prompt optimization framework that leverages a
+one-pass approach, enhancing text-to-image alignment efficiency without the
+iterative overhead typical of current methods like OPT2I. FPA uses large
+language models (LLMs) for single-iteration prompt paraphrasing, followed by
+fine-tuning or in-context learning with optimized prompts to enable real-time
+inference, reducing computational demands while preserving alignment fidelity.
+Extensive evaluations on the COCO Captions and PartiPrompts datasets
+demonstrate that FPA achieves competitive text-image alignment scores at a
+fraction of the processing time, as validated through both automated metrics
+(TIFA, VQA) and human evaluation. A human study with expert annotators further
+reveals a strong correlation between human alignment judgments and automated
+scores, underscoring the robustness of FPA's improvements. The proposed method
+showcases a scalable, efficient alternative to iterative prompt optimization,
+enabling broader applicability in real-time, high-demand settings. The codebase
+is provided to facilitate further research:
+https://github.com/tiktok/fast_prompt_alignment
+
+
+
+ comment: TikTok Technical Report
+
+
+
+
+
+
+ ☆ Multimodal Latent Language Modeling with Next-Token Diffusion
+
+
+ Multimodal generative models require a unified approach to handle both
+discrete data (e.g., text and code) and continuous data (e.g., image, audio,
+video). In this work, we propose Latent Language Modeling (LatentLM), which
+seamlessly integrates continuous and discrete data using causal Transformers.
+Specifically, we employ a variational autoencoder (VAE) to represent continuous
+data as latent vectors and introduce next-token diffusion for autoregressive
+generation of these vectors. Additionally, we develop $\sigma$-VAE to address
+the challenges of variance collapse, which is crucial for autoregressive
+modeling. Extensive experiments demonstrate the effectiveness of LatentLM
+across various modalities. In image generation, LatentLM surpasses Diffusion
+Transformers in both performance and scalability. When integrated into
+multimodal large language models, LatentLM provides a general-purpose interface
+that unifies multimodal generation and understanding. Experimental results show
+that LatentLM achieves favorable performance compared to Transfusion and vector
+quantized models in the setting of scaling up training tokens. In
+text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2
+model in speaker similarity and robustness, while requiring 10x fewer decoding
+steps. The results establish LatentLM as a highly effective and scalable
+approach to advance large multimodal models.
+
+
+
+
+
+
+
+ ☆ Exploiting the Index Gradients for Optimization-Based Jailbreaking on
+ Large Language Models
+
+
+ Despite the advancements in training Large Language Models (LLMs) with
+alignment techniques to enhance the safety of generated content, these models
+remain susceptible to jailbreak, an adversarial attack method that exposes
+security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG)
+method has demonstrated the ability to automatically generate adversarial
+suffixes that jailbreak state-of-the-art LLMs. However, the optimization
+process involved in GCG is highly time-consuming, rendering the jailbreaking
+pipeline inefficient. In this paper, we investigate the process of GCG and
+identify an issue of Indirect Effect, the key bottleneck of the GCG
+optimization. To this end, we propose the Model Attack Gradient Index GCG
+(MAGIC), that addresses the Indirect Effect by exploiting the gradient
+information of the suffix tokens, thereby accelerating the procedure by having
+less computation and fewer iterations. Our experiments on AdvBench show that
+MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates
+(ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of
+74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on
+GPT-3.5. Code is available at https://github.com/jiah-li/magic.
+
+
+
+ comment: 13 pages,2 figures, accepted by The 31st International Conference on
+ Computational Linguistics
+
+
+
+
+
+
+ ☆ Der Effizienz- und Intelligenzbegriff in der Lexikographie und
+ kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte
+ nachbilden?
+
+
+
+
+
+
+
+
+ Ivan Arias-Arias, Maria Jose Dominguez Vazquez, Carlos Valcarcel Riveiro
+
+
+ By means of pilot experiments for the language pair German and Galician, this
+paper examines the concept of efficiency and intelligence in lexicography and
+artificial intelligence, AI. The aim of the experiments is to gain empirically
+and statistically based insights into the lexicographical text type,dictionary
+article, in the responses of ChatGPT 3.5, as well as into the lexicographical
+data on which this chatbot was trained. Both quantitative and qualitative
+methods are used for this purpose. The analysis is based on the evaluation of
+the outputs of several sessions with the same prompt in ChatGPT 3.5. On the one
+hand, the algorithmic performance of intelligent systems is evaluated in
+comparison with data from lexicographical works. On the other hand, the ChatGPT
+data supplied is analysed using specific text passages of the aforementioned
+lexicographical text type. The results of this study not only help to evaluate
+the efficiency of this chatbot regarding the creation of dictionary articles,
+but also to delve deeper into the concept of intelligence, the thought
+processes and the actions to be carried out in both disciplines.
+
+
+
+ comment: 25 pages, in German language
+
+
+
+
+
+
+ ☆ Advancing Single- and Multi-task Text Classification through Large
+ Language Model Fine-tuning
+
+
+
+
+
+
+
+
+ Hang Zhao, Qile P. Chen, Yijing Barry Zhang, Gang Yang
+
+
+ Both encoder-only models (e.g., BERT, RoBERTa) and large language models
+(LLMs, e.g., Llama3) have been widely used for text classification tasks.
+However, there is a lack of systematic studies comparing the performance of
+encoder-based models and LLMs in text classification, particularly when
+fine-tuning is involved. This study employed a diverse range of models and
+methods, varying in size and architecture, and including both fine-tuned and
+pre-trained approaches. We first assessed the performances of these LLMs on the
+20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only
+RoBERTa models. Additionally, we explored the multi-task capabilities of both
+model types by combining multiple classification tasks, including intent
+detection and slot-filling, into a single model using data from both datasets.
+Our results indicate that fully fine-tuned Llama3-70B models outperform
+RoBERTa-large and other decoder LLMs across various classification tasks and
+datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the
+performance of dual-model setups in both tasks across both datasets. Overall,
+our study provides a comprehensive benchmark of encoder-only and LLM models on
+text classification tasks and demonstrates a method to combine two or more
+fully fine-tuned decoder LLMs for reduced latency and equivalent performance.
+
+
+
+ comment: 9 pages, 3 tables
+
+
+
+
+
+
+ ☆ Machine Learning Information Retrieval and Summarisation to Support
+ Systematic Review on Outcomes Based Contracting
+
+
+
+
+
+
+
+
+ Iman Munire Bilal, Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Eleanor Carter, Mara Airoldi, Rob Procter
+
+
+ As academic literature proliferates, traditional review methods are
+increasingly challenged by the sheer volume and diversity of available
+research. This article presents a study that aims to address these challenges
+by enhancing the efficiency and scope of systematic reviews in the social
+sciences through advanced machine learning (ML) and natural language processing
+(NLP) tools. In particular, we focus on automating stages within the systematic
+reviewing process that are time-intensive and repetitive for human annotators
+and which lend themselves to immediate scalability through tools such as
+information retrieval and summarisation guided by expert advice. The article
+concludes with a summary of lessons learnt regarding the integrated approach
+towards systematic reviews and future directions for improvement, including
+explainability.
+
+
+
+
+
+
+
+ ☆ Can We Generate Visual Programs Without Prompting LLMs?
+
+
+ Visual programming prompts LLMs (large language mod-els) to generate
+executable code for visual tasks like visual question answering (VQA).
+Prompt-based methods are difficult to improve while also being unreliable and
+costly in both time and money. Our goal is to develop an efficient visual
+programming system without 1) using prompt-based LLMs at inference time and 2)
+a large set of program and answer annotations. We develop a synthetic data
+augmentation approach and alternative program generation method based on
+decoupling programs into higher-level skills called templates and the
+corresponding arguments. Our results show that with data augmentation,
+prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with
+state-of-the art models with the added benefit of much faster inference
+
+
+
+
+
+
+
+ ☆ Bilevel Joint Unsupervised and Supervised Training for Automatic Speech
+ Recognition
+
+
+
+
+
+
+
+
+ Xiaodong Cui, A F M Saif, Songtao Lu, Lisha Chen, Tianyi Chen, Brian Kingsbury, George Saon
+
+
+ In this paper, we propose a bilevel joint unsupervised and supervised
+training (BL-JUST) framework for automatic speech recognition. Compared to the
+conventional pre-training and fine-tuning strategy which is a disconnected
+two-stage process, BL-JUST tries to optimize an acoustic model such that it
+simultaneously minimizes both the unsupervised and supervised loss functions.
+Because BL-JUST seeks matched local optima of both loss functions, acoustic
+representations learned by the acoustic model strike a good balance between
+being generic and task-specific. We solve the BL-JUST problem using
+penalty-based bilevel gradient descent and evaluate the trained deep neural
+network acoustic models on various datasets with a variety of architectures and
+loss functions. We show that BL-JUST can outperform the widely-used
+pre-training and fine-tuning strategy and some other popular semi-supervised
+techniques.
+
+
+
+ comment: Accepted by IEEE/ACM Transactions on Audio, Speech and Language
+ Processing
+
+
+
+
+
+
+
+ Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D'Oro
+
+
+ Describing skills in natural language has the potential to provide an
+accessible way to inject human knowledge about decision-making into an AI
+system. We present MaestroMotif, a method for AI-assisted skill design, which
+yields high-performing and adaptable agents. MaestroMotif leverages the
+capabilities of Large Language Models (LLMs) to effectively create and reuse
+skills. It first uses an LLM's feedback to automatically design rewards
+corresponding to each skill, starting from their natural language description.
+Then, it employs an LLM's code generation abilities, together with
+reinforcement learning, for training the skills and combining them to implement
+complex behaviors specified in language. We evaluate MaestroMotif using a suite
+of complex tasks in the NetHack Learning Environment (NLE), demonstrating that
+it surpasses existing approaches in both performance and usability.
+
+
+
+
+
+
+
+ ☆ TECO: Improving Multimodal Intent Recognition with Text Enhancement
+ through Commonsense Knowledge Extraction ACL
+
+
+ The objective of multimodal intent recognition (MIR) is to leverage various
+modalities-such as text, video, and audio-to detect user intentions, which is
+crucial for understanding human language and context in dialogue systems.
+Despite advances in this field, two main challenges persist: (1) effectively
+extracting and utilizing semantic information from robust textual features; (2)
+aligning and fusing non-verbal modalities with verbal ones effectively. This
+paper proposes a Text Enhancement with CommOnsense Knowledge Extractor (TECO)
+to address these challenges. We begin by extracting relations from both
+generated and retrieved knowledge to enrich the contextual information in the
+text modality. Subsequently, we align and integrate visual and acoustic
+representations with these enhanced text features to form a cohesive multimodal
+representation. Our experimental results show substantial improvements over
+existing baseline methods.
+
+
+
+ comment: Accepted at PACLIC 2024
+
+
+
+
+
+
+ ☆ Continual Learning for Encoder-only Language Models via a Discrete
+ Key-Value Bottleneck
+
+
+ Continual learning remains challenging across various natural language
+understanding tasks. When models are updated with new training data, they risk
+catastrophic forgetting of prior knowledge. In the present work, we introduce a
+discrete key-value bottleneck for encoder-only language models, allowing for
+efficient continual learning by requiring only localized updates. Inspired by
+the success of a discrete key-value bottleneck in vision, we address new and
+NLP-specific challenges. We experiment with different bottleneck architectures
+to find the most suitable variants regarding language, and present a generic
+discrete key initialization technique for NLP that is task independent. We
+evaluate the discrete key-value bottleneck in four continual learning NLP
+scenarios and demonstrate that it alleviates catastrophic forgetting. We
+showcase that it offers competitive performance to other popular continual
+learning methods, with lower computational costs.
+
+
+
+
+
+
+
+ ☆ EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache
+ Compression Based on Global-Local Importance
+
+
+ As large language models (LLMs) continue to advance, the demand for higher
+quality and faster processing of long contexts across various applications is
+growing. KV cache is widely adopted as it stores previously generated key and
+value tokens, effectively reducing redundant computations during inference.
+However, as memory overhead becomes a significant concern, efficient
+compression of KV cache has gained increasing attention. Most existing methods
+perform compression from two perspectives: identifying important tokens and
+designing compression strategies. However, these approaches often produce
+biased distributions of important tokens due to the influence of accumulated
+attention scores or positional encoding. Furthermore, they overlook the
+sparsity and redundancy across different heads, which leads to difficulties in
+preserving the most effective information at the head level. To this end, we
+propose EMS to overcome these limitations, while achieving better KV cache
+compression under extreme compression ratios. Specifically, we introduce a
+Global-Local score that combines accumulated attention scores from both global
+and local KV tokens to better identify the token importance. For the
+compression strategy, we design an adaptive and unified Evict-then-Merge
+framework that accounts for the sparsity and redundancy of KV tokens across
+different heads. Additionally, we implement the head-wise parallel compression
+through a zero-class mechanism to enhance efficiency. Extensive experiments
+demonstrate our SOTA performance even under extreme compression ratios. EMS
+consistently achieves the lowest perplexity, improves scores by over 1.28
+points across four LLMs on LongBench under a 256 cache budget, and preserves
+95% retrieval accuracy with a cache budget less than 2% of the context length
+in the Needle-in-a-Haystack task.
+
+
+
+
+
+
+
+ ☆ GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek COLING 2025
+
+
+
+
+
+
+
+
+ Lefteris Loukas, Nikolaos Smyrnioudis, Chrysa Dikonomaki, Spyros Barbakos, Anastasios Toumazatos, John Koutsikakis, Manolis Kyriakakis, Mary Georgiou, Stavros Vassos, John Pavlopoulos, Ion Androutsopoulos
+
+
+ We present GR-NLP-TOOLKIT, an open-source natural language processing (NLP)
+toolkit developed specifically for modern Greek. The toolkit provides
+state-of-the-art performance in five core NLP tasks, namely part-of-speech
+tagging, morphological tagging, dependency parsing, named entity recognition,
+and Greeklishto-Greek transliteration. The toolkit is based on pre-trained
+Transformers, it is freely available, and can be easily installed in Python
+(pip install gr-nlp-toolkit). It is also accessible through a demonstration
+platform on HuggingFace, along with a publicly available API for non-commercial
+use. We discuss the functionality provided for each task, the underlying
+methods, experiments against comparable open-source toolkits, and future
+possible enhancements. The toolkit is available at:
+https://github.com/nlpaueb/gr-nlp-toolkit
+
+
+ The reranker and generator are two critical components in the
+Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking
+relevant documents and generating responses. However, due to differences in
+pre-training data and objectives, there is an inevitable gap between the
+documents ranked as relevant by the reranker and those required by the
+generator to support answering the query. To address this gap, we propose
+RADIO, a novel and practical preference alignment framework with RAtionale
+DIstillatiOn. Specifically, We first propose a rationale extraction method that
+leverages the reasoning capabilities of Large Language Models (LLMs) to extract
+the rationales necessary for answering the query. Subsequently, a
+rationale-based alignment process is designed to rerank the documents based on
+the extracted rationales, and fine-tune the reranker to align the preferences.
+We conduct extensive experiments on two tasks across three datasets to
+demonstrate the effectiveness of our approach compared to baseline methods. Our
+code is released online to ease reproduction.
+
+
+ Comparative reviews are pivotal in understanding consumer preferences and
+influencing purchasing decisions. Comparative Quintuple Extraction (COQE) aims
+to identify five key components in text: the target entity, compared entities,
+compared aspects, opinions on these aspects, and polarity. Extracting precise
+comparative information from product reviews is challenging due to nuanced
+language and sequential task errors in traditional methods. To mitigate these
+problems, we propose MTP-COQE, an end-to-end model designed for COQE.
+Leveraging multi-perspective prompt-based learning, MTP-COQE effectively guides
+the generative model in comparative opinion mining tasks. Evaluation on the
+Camera-COQE (English) and VCOM (Vietnamese) datasets demonstrates MTP-COQE's
+efficacy in automating COQE, achieving superior performance with a 1.41% higher
+F1 score than the previous baseline models on the English dataset.
+Additionally, we designed a strategy to limit the generative model's creativity
+to ensure the output meets expectations. We also performed data augmentation to
+address data imbalance and to prevent the model from becoming biased towards
+dominant samples.
+
+
+
+
+
+
+
+ ☆ Multi-perspective Alignment for Increasing Naturalness in Neural Machine
+ Translation
+
+
+
+
+
+
+
+
+ Huiyuan Lai, Esther Ploeger, Rik van Noord, Antonio Toral
+
+
+ Neural machine translation (NMT) systems amplify lexical biases present in
+their training data, leading to artificially impoverished language in output
+translations. These language-level characteristics render automatic
+translations different from text originally written in a language and human
+translations, which hinders their usefulness in for example creating evaluation
+datasets. Attempts to increase naturalness in NMT can fall short in terms of
+content preservation, where increased lexical diversity comes at the cost of
+translation accuracy. Inspired by the reinforcement learning from human
+feedback framework, we introduce a novel method that rewards both naturalness
+and content preservation. We experiment with multiple perspectives to produce
+more natural translations, aiming at reducing machine and human translationese.
+We evaluate our method on English-to-Dutch literary translation, and find that
+our best model produces translations that are lexically richer and exhibit more
+properties of human-written language, without loss in translation accuracy.
+
+
+
+
+
+
+
+ ☆ Bootstrapping Language-Guided Navigation Learning with Self-Refining
+ Data Flywheel
+
+
+ Creating high-quality data for training robust language-instructed agents is
+a long-lasting challenge in embodied AI. In this paper, we introduce a
+Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale
+navigational instruction-trajectory pairs by iteratively refining the data pool
+through the collaboration between two models, the instruction generator and the
+navigator, without any human-in-the-loop annotation. Specifically, SRDF starts
+with using a base generator to create an initial data pool for training a base
+navigator, followed by applying the trained navigator to filter the data pool.
+This leads to higher-fidelity data to train a better generator, which can, in
+turn, produce higher-quality data for training the next-round navigator. Such a
+flywheel establishes a data self-refining process, yielding a continuously
+improved and highly effective dataset for large-scale language-guided
+navigation learning. Our experiments demonstrate that after several flywheel
+rounds, the navigator elevates the performance boundary from 70% to 78% SPL on
+the classic R2R test set, surpassing human performance (76%) for the first
+time. Meanwhile, this process results in a superior generator, evidenced by a
+SPICE increase from 23.5 to 26.2, better than all previous VLN instruction
+generation methods. Finally, we demonstrate the scalability of our method
+through increasing environment and instruction diversity, and the
+generalization ability of our pre-trained navigator across various downstream
+navigation tasks, surpassing state-of-the-art methods by a large margin in all
+cases.
+
+
+
+ comment: 28 pages, Code and data are available at
+ https://github.com/wz0919/VLN-SRDF
+
+
+
+
+
+
+ ☆ Mitigating Out-of-Entity Errors in Named Entity Recognition: A
+ Sentence-Level Strategy COLING 2025
+
+
+ Many previous models of named entity recognition (NER) suffer from the
+problem of Out-of-Entity (OOE), i.e., the tokens in the entity mentions of the
+test samples have not appeared in the training samples, which hinders the
+achievement of satisfactory performance. To improve OOE-NER performance, in
+this paper, we propose a new framework, namely S+NER, which fully leverages
+sentence-level information. Our S+NER achieves better OOE-NER performance
+mainly due to the following two particular designs. 1) It first exploits the
+pre-trained language model's capability of understanding the target entity's
+sentence-level context with a template set. 2) Then, it refines the
+sentence-level representation based on the positive and negative templates,
+through a contrastive learning strategy and template pooling method, to obtain
+better NER results. Our extensive experiments on five benchmark datasets have
+demonstrated that, our S+NER outperforms some state-of-the-art OOE-NER models.
+
+
+
+ comment: Accepted by COLING 2025
+
+
+
+
+
+
+ ☆ Assessing Personalized AI Mentoring with Large Language Models in the
+ Computing Field
+
+
+ This paper provides an in-depth evaluation of three state-of-the-art Large
+Language Models (LLMs) for personalized career mentoring in the computing
+field, using three distinct student profiles that consider gender, race, and
+professional levels. We evaluated the performance of GPT-4, LLaMA 3, and Palm 2
+using a zero-shot learning approach without human intervention. A quantitative
+evaluation was conducted through a custom natural language processing analytics
+pipeline to highlight the uniqueness of the responses and to identify words
+reflecting each student's profile, including race, gender, or professional
+level. The analysis of frequently used words in the responses indicates that
+GPT-4 offers more personalized mentoring compared to the other two LLMs.
+Additionally, a qualitative evaluation was performed to see if human experts
+reached similar conclusions. The analysis of survey responses shows that GPT-4
+outperformed the other two LLMs in delivering more accurate and useful
+mentoring while addressing specific challenges with encouragement languages.
+Our work establishes a foundation for developing personalized mentoring tools
+based on LLMs, incorporating human mentors in the process to deliver a more
+impactful and tailored mentoring experience.
+
+
+ Mental manipulation severely undermines mental wellness by covertly and
+negatively distorting decision-making. While there is an increasing interest in
+mental health care within the natural language processing community, progress
+in tackling manipulation remains limited due to the complexity of detecting
+subtle, covert tactics in conversations. In this paper, we propose Intent-Aware
+Prompting (IAP), a novel approach for detecting mental manipulations using
+large language models (LLMs), providing a deeper understanding of manipulative
+tactics by capturing the underlying intents of participants. Experimental
+results on the MentalManip dataset demonstrate superior effectiveness of IAP
+against other advanced prompting strategies. Notably, our approach
+substantially reduces false negatives, helping detect more instances of mental
+manipulation with minimal misjudgment of positive cases. The code of this paper
+is available at https://github.com/Anton-Jiayuan-MA/Manip-IAP.
+
+
+
+
+
+
+
+ ♻ ☆ Evaluating Dialect Robustness of Language Models via Conversation
+ Understanding COLING'25
+
+
+ With an evergrowing number of LLMs reporting superlative performance for
+English, their ability to perform equitably for different dialects of English
+($\textit{i.e.}$, dialect robustness) needs to be ascertained. Specifically, we
+use English language (US English or Indian English) conversations between
+humans who play the word-guessing game of 'taboo'. We formulate two evaluative
+tasks: target word prediction (TWP) ($\textit{i.e.}$, predict the masked target
+word in a conversation) and target word selection (TWS) ($\textit{i.e.}$,
+select the most likely masked target word in a conversation, from among a set
+of candidate words). Extending MD3, an existing dialectic dataset of
+taboo-playing conversations, we introduce M-MD3, a target-word-masked version
+of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where
+en-US is transformed to include dialectal information) and en-TR (where
+dialectal information is removed from en-IN). We evaluate one open-source
+(Llama3) and two closed-source (GPT-4/3.5) LLMs. LLMs perform significantly
+better for US English than Indian English for both TWP and TWS tasks, for all
+settings, exhibiting marginalisation against the Indian dialect of English.
+While GPT-based models perform the best, the comparatively smaller models work
+more equitably after fine-tuning. Our error analysis shows that the LLMs can
+understand the dialect better after fine-tuning using dialectal data. Our
+evaluation methodology exhibits a novel way to examine attributes of language
+models using pre-existing dialogue datasets.
+
+
+
+ comment: SUMEval@COLING'25
+
+
+
+
+
+
+ ♻ ☆ From Jack of All Trades to Master of One: Specializing LLM-based
+ Autoraters to a Test Set
+
+
+
+
+
+
+
+
+ Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza Kovacs, Markus Freitag
+
+
+ As LLMs continue to become more powerful and versatile, human evaluation has
+quickly become intractable at scale and reliance on automatic metrics has
+become the norm. Recently, it has been shown that LLMs are themselves
+state-of-the-art evaluators for many tasks. These Autoraters are typically
+designed so that they generalize to new systems and test sets. In practice,
+however, evaluation is performed on a small set of fixed, canonical test sets,
+which are carefully curated to measure certain capabilities of interest and are
+not changed frequently. In this work, we design a method which specializes a
+prompted Autorater to a given test set, by leveraging historical ratings on the
+test set to construct in-context learning (ICL) examples. We evaluate our
+Specialist method on the task of fine-grained machine translation evaluation,
+and show that it dramatically outperforms the state-of-the-art XCOMET metric by
+54% and 119% on the WMT'23 and WMT'24 test sets, respectively. We perform
+extensive analyses to understand the representations learned by our Specialist
+metrics, and how variability in rater behavior affects their performance. We
+also verify the generalizability and robustness of our Specialist method for
+designing automatic metrics across different numbers of ICL examples, LLM
+backbones, systems to evaluate, and evaluation tasks.
+
+
+
+
+
+
+
+ ♻ ☆ Improve Mathematical Reasoning in Language Models by Automated Process
+ Supervision
+
+
+ Complex multi-step reasoning tasks, such as solving mathematical problems or
+generating code, remain a significant hurdle for even the most advanced large
+language models (LLMs). Verifying LLM outputs with an Outcome Reward Model
+(ORM) is a standard inference-time technique aimed at enhancing the reasoning
+performance of LLMs. However, this still proves insufficient for reasoning
+tasks with a lengthy or multi-hop reasoning chain, where the intermediate
+outcomes are neither properly rewarded nor penalized. Process supervision
+addresses this limitation by assigning intermediate rewards during the
+reasoning process. To date, the methods used to collect process supervision
+data have relied on either human annotation or per-step Monte Carlo estimation,
+both prohibitively expensive to scale, thus hindering the broad application of
+this technique. In response to this challenge, we propose a novel
+divide-and-conquer style Monte Carlo Tree Search (MCTS) algorithm named
+\textit{OmegaPRM} for the efficient collection of high-quality process
+supervision data. This algorithm swiftly identifies the first error in the
+Chain of Thought (CoT) with binary search and balances the positive and
+negative examples, thereby ensuring both efficiency and quality. As a result,
+we are able to collect over 1.5 million process supervision annotations to
+train Process Reward Models (PRMs). This fully automated process supervision
+alongside the weighted self-consistency algorithm is able to enhance LLMs' math
+reasoning performances. We improved the success rates of the instruction-tuned
+Gemini Pro model from 51\% to 69.4\% on MATH500 and from 86.4\% to 93.6\% on
+GSM8K. Similarly, we boosted the success rates of Gemma2 27B from 42.3\% to
+58.2\% on MATH500 and from 74.0\% to 92.2\% on GSM8K. The entire process
+operates without any human intervention or supervision, making our method both
+financially and ...
+
+
+
+ comment: 17 pages, 5 figures, 2 table
+
+
+
+
+
+
+ ♻ ☆ ELBA: Learning by Asking for Embodied Visual Navigation and Task
+ Completion WACV 2025
+
+
+
+
+
+
+
+
+ Ying Shen, Daniel Bis, Cynthia Lu, Ismini Lourentzou
+
+
+ The research community has shown increasing interest in designing intelligent
+embodied agents that can assist humans in accomplishing tasks. Although there
+have been significant advancements in related vision-language benchmarks, most
+prior work has focused on building agents that follow instructions rather than
+endowing agents the ability to ask questions to actively resolve ambiguities
+arising naturally in embodied environments. To address this gap, we propose an
+Embodied Learning-By-Asking (ELBA) model that learns when and what questions to
+ask to dynamically acquire additional information for completing the task. We
+evaluate ELBA on the TEACh vision-dialog navigation and task completion
+dataset. Experimental results show that the proposed method achieves improved
+task performance compared to baseline models without question-answering
+capabilities.
+
+
+
+ comment: 14 pages, 10 figures, WACV 2025
+
+
+
+
+
+
+ ♻ ☆ Can Large Language Models Understand Symbolic Graphics Programs?
+
+
+
+
+
+
+
+
+ Zeju Qiu, Weiyang Liu, Haiwen Feng, Zhen Liu, Tim Z. Xiao, Katherine M. Collins, Joshua B. Tenenbaum, Adrian Weller, Michael J. Black, Bernhard Schölkopf
+
+
+ Against the backdrop of enthusiasm for large language models (LLMs), there is
+an urgent need to scientifically assess their capabilities and shortcomings.
+This is nontrivial in part because it is difficult to find tasks which the
+models have not encountered during training. Utilizing symbolic graphics
+programs, we propose a domain well-suited to test multiple spatial-semantic
+reasoning skills of LLMs. Popular in computer graphics, these programs
+procedurally generate visual data. While LLMs exhibit impressive skills in
+general program synthesis and analysis, symbolic graphics programs offer a new
+layer of evaluation: they allow us to test an LLM's ability to answer
+different-grained semantic-level questions of the images or 3D geometries
+without a vision encoder. To semantically understand the symbolic programs,
+LLMs would need to possess the ability to "imagine" and reason how the
+corresponding graphics content would look with only the symbolic description.
+We use this task to evaluate LLMs by creating a large benchmark for the
+semantic visual understanding of symbolic graphics programs, built procedurally
+with minimal human effort. Particular emphasis is placed on transformations of
+images that leave the image level semantics invariant while introducing
+significant changes to the underlying program. We evaluate commercial and
+open-source LLMs on our benchmark to assess their ability to reason about
+visual output of programs, finding that LLMs considered stronger at reasoning
+generally perform better. Lastly, we introduce a novel method to improve this
+ability -- Symbolic Instruction Tuning (SIT), in which the LLM is finetuned
+with pre-collected instruction data on symbolic graphics programs.
+Interestingly, we find that SIT not only improves LLM's understanding on
+symbolic programs, but it also improves general reasoning ability on various
+other benchmarks.
+
+
+ In-group language is an important signifier of group dynamics. This paper
+proposes a novel method for inducing lexicons of in-group language, which
+incorporates its socio-temporal context. Existing methods for lexicon induction
+do not capture the evolving nature of in-group language, nor the social
+structure of the community. Using dynamic word and user embeddings trained on
+conversations from online anti-women communities, our approach outperforms
+prior methods for lexicon induction. We develop a test set for the task of
+lexicon induction and a new lexicon of manosphere language, validated by human
+experts, which quantifies the relevance of each term to a specific
+sub-community at a given point in time. Finally, we present novel insights on
+in-group language which illustrate the utility of this approach.
+
+
+
+
+
+
+
+ ♻ ☆ Categorical Syllogisms Revisited: A Review of the Logical Reasoning
+ Abilities of LLMs for Analyzing Categorical Syllogism
+
+
+ There have been a huge number of benchmarks proposed to evaluate how large
+language models (LLMs) behave for logic inference tasks. However, it remains an
+open question how to properly evaluate this ability. In this paper, we provide
+a systematic overview of prior works on the logical reasoning ability of LLMs
+for analyzing categorical syllogisms. We first investigate all the possible
+variations for the categorical syllogisms from a purely logical perspective and
+then examine the underlying configurations (i.e., mood and figure) tested by
+the existing datasets. Our results indicate that compared to template-based
+synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of
+configurations (i.e., mood and figure) of categorical syllogisms for more
+language variations, thus bringing challenges to fully testing LLMs under
+different situations. We then proceed to summarize the findings and
+observations for the performances of LLMs to infer the validity of syllogisms
+from the current literature. The error rate breakdown analyses suggest that the
+interpretation of the quantifiers seems to be the current bottleneck that
+limits the performances of the LLMs and is thus worth more attention. Finally,
+we discuss several points that might be worth considering when researchers plan
+on the future release of categorical syllogism datasets. We hope our work will
+not only provide a timely review of the current literature regarding
+categorical syllogisms, but also motivate more interdisciplinary research
+between communities, specifically computational linguists and logicians.
+
+
+
+
+
+
+
+
+ Guy Barel, Oren Tsur, Dan Vilenchik
+
+
+ Stance detection plays a pivotal role in enabling an extensive range of
+downstream applications, from discourse parsing to tracing the spread of fake
+news and the denial of scientific facts. While most stance classification
+models rely on textual representation of the utterance in question, prior work
+has demonstrated the importance of the conversational context in stance
+detection. In this work we introduce TASTE -- a multimodal architecture for
+stance detection that harmoniously fuses Transformer-based content embedding
+with unsupervised structural embedding. Through the fine-tuning of a pretrained
+transformer and the amalgamation with social embedding via a Gated Residual
+Network (GRN) layer, our model adeptly captures the complex interplay between
+content and conversational structure in determining stance. TASTE achieves
+state-of-the-art results on common benchmarks, significantly outperforming an
+array of strong baselines. Comparative evaluations underscore the benefits of
+social grounding -- emphasizing the criticality of concurrently harnessing both
+content and structure for enhanced stance detection.
+
+
+
+ comment: COLING 2025
+
+
+
+
+
+
+ ♻ ☆ Evaluating Deduplication Techniques for Economic Research Paper Titles
+ with a Focus on Semantic Similarity using NLP and LLMs
+
+
+ This study investigates efficient deduplication techniques for a large NLP
+dataset of economic research paper titles. We explore various pairing methods
+alongside established distance measures (Levenshtein distance, cosine
+similarity) and a sBERT model for semantic evaluation. Our findings suggest a
+potentially low prevalence of duplicates based on the observed semantic
+similarity across different methods. Further exploration with a human-annotated
+ground truth set is completed for a more conclusive assessment. The result
+supports findings from the NLP, LLM based distance metrics.
+
+
+
+ comment: 6 pages, 1 figure
+
+
+
+
+
+
+ ♻ ☆ Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study
+ on Two Materials Datasets
+
+
+
+
+
+
+
+
+ Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton
+
+
+ We explore the ability of GPT-4 to perform ad-hoc schema based information
+extraction from scientific literature. We assess specifically whether it can,
+with a basic prompting approach, replicate two existing material science
+datasets, given the manuscripts from which they were originally manually
+extracted. We employ materials scientists to perform a detailed manual error
+analysis to assess where the model struggles to faithfully extract the desired
+information, and draw on their insights to suggest research directions to
+address this broadly important task.
+
+
+
+ comment: Update on 12/11/2024: Added some relevant literature that we missed
+ in previous version of the paper
+
+
+
+
+
+
+
+ Pu Zhao, Xuan Shen, Zhenglun Kong, Yixin Shen, Sung-En Chang, Timothy Rupprecht, Lei Lu, Enfu Nan, Changdi Yang, Yumei He, Xingchen Xu, Yu Huang, Wei Wang, Yue Chen, Yong He, Yanzhi Wang
+
+
+ Recently, Large Language Models (LLMs) have undergone a significant
+transformation, marked by a rapid rise in both their popularity and
+capabilities. Leading this evolution are proprietary LLMs like GPT-4 and
+GPT-o1, which have captured widespread attention in the AI community due to
+their remarkable performance and versatility. Simultaneously, open-source LLMs,
+such as LLaMA and Mistral, have made great contributions to the ever-increasing
+popularity of LLMs due to the ease to customize and deploy the models across
+diverse applications. Although open-source LLMs present unprecedented
+opportunities for innovation and research, the commercialization of LLMs has
+raised concerns about transparency, reproducibility, and safety. Many
+open-source LLMs fail to meet fundamental transparency requirements by
+withholding essential components like training code and data, and some use
+restrictive licenses whilst claiming to be "open-source," which may hinder
+further innovations on LLMs. To mitigate this issue, we introduce Moxin 7B, a
+fully open-source LLM developed in accordance with the Model Openness Framework
+(MOF), a ranked classification system that evaluates AI models based on model
+completeness and openness, adhering to principles of open science, open source,
+open data, and open access. Our model achieves the highest MOF classification
+level of "open science" through the comprehensive release of pre-training code
+and configurations, training and fine-tuning datasets, and intermediate and
+final checkpoints. Experiments show that our model achieves superior
+performance in zero-shot evaluation compared with popular 7B models and
+performs competitively in few-shot evaluation.
+
+
+
+
+
+
+
+ ♻ ☆ Fusing Domain-Specific Content from Large Language Models into Knowledge
+ Graphs for Enhanced Zero Shot Object State Classification AAAI
+
+
+ Domain-specific knowledge can significantly contribute to addressing a wide
+variety of vision tasks. However, the generation of such knowledge entails
+considerable human labor and time costs. This study investigates the potential
+of Large Language Models (LLMs) in generating and providing domain-specific
+information through semantic embeddings. To achieve this, an LLM is integrated
+into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors
+in the context of the Vision-based Zero-shot Object State Classification task.
+We thoroughly examine the behavior of the LLM through an extensive ablation
+study. Our findings reveal that the integration of LLM-based embeddings, in
+combination with general-purpose pre-trained embeddings, leads to substantial
+performance improvements. Drawing insights from this ablation study, we conduct
+a comparative analysis against competing models, thereby highlighting the
+state-of-the-art performance achieved by the proposed approach.
+
+
+
+ comment: Accepted at the AAAI-MAKE 2024
+
+
+
+
+
+
+ ♻ ☆ Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
+
+
+
+
+
+
+
+
+ Mikayel Samvelyan, Sharath Chandra Raparthy, Andrei Lupu, Eric Hambro, Aram H. Markosyan, Manish Bhatt, Yuning Mao, Minqi Jiang, Jack Parker-Holder, Jakob Foerster, Tim Rocktäschel, Roberta Raileanu
+
+
+ As large language models (LLMs) become increasingly prevalent across many
+real-world applications, understanding and enhancing their robustness to
+adversarial attacks is of paramount importance. Existing methods for
+identifying adversarial prompts tend to focus on specific domains, lack
+diversity, or require extensive human annotations. To address these
+limitations, we present Rainbow Teaming, a novel black-box approach for
+producing a diverse collection of adversarial prompts. Rainbow Teaming casts
+adversarial prompt generation as a quality-diversity problem and uses
+open-ended search to generate prompts that are both effective and diverse.
+Focusing on the safety domain, we use Rainbow Teaming to target various
+state-of-the-art LLMs, including the Llama 2 and Llama 3 models. Our approach
+reveals hundreds of effective adversarial prompts, with an attack success rate
+exceeding 90% across all tested models. Furthermore, we demonstrate that
+prompts generated by Rainbow Teaming are highly transferable and that
+fine-tuning models with synthetic data generated by our method significantly
+enhances their safety without sacrificing general performance or helpfulness.
+We additionally explore the versatility of Rainbow Teaming by applying it to
+question answering and cybersecurity, showcasing its potential to drive robust
+open-ended self-improvement in a wide range of applications.
+
+
+
+
+
+
+
+ ♻ ☆ Comparative Analysis of Pooling Mechanisms in LLMs: A Sentiment Analysis
+ Perspective
+
+
+
+
+
+
+
+
+ Jinming Xing, Ruilin Xing, Yan Sun
+
+
+ Large Language Models (LLMs) have revolutionized natural language processing
+(NLP) by delivering state-of-the-art performance across a variety of tasks.
+Among these, Transformer-based models like BERT and GPT rely on pooling layers
+to aggregate token-level embeddings into sentence-level representations. Common
+pooling mechanisms such as Mean, Max, and Weighted Sum play a pivotal role in
+this aggregation process. Despite their widespread use, the comparative
+performance of these strategies on different LLM architectures remains
+underexplored. To address this gap, this paper investigates the effects of
+these pooling mechanisms on two prominent LLM families -- BERT and GPT, in the
+context of sentence-level sentiment analysis. Comprehensive experiments reveal
+that each pooling mechanism exhibits unique strengths and weaknesses depending
+on the task's specific requirements. Our findings underline the importance of
+selecting pooling methods tailored to the demands of particular applications,
+prompting a re-evaluation of common assumptions regarding pooling operations.
+By offering actionable insights, this study contributes to the optimization of
+LLM-based models for downstream tasks.
+
+
+ With the burgeoning amount of data of image-text pairs and diversity of
+Vision-and-Language (V\&L) tasks, scholars have introduced an abundance of deep
+learning models in this research domain. Furthermore, in recent years, transfer
+learning has also shown tremendous success in Computer Vision for tasks such as
+Image Classification, Object Detection, etc., and in Natural Language
+Processing for Question Answering, Machine Translation, etc. Inheriting the
+spirit of Transfer Learning, research works in V\&L have devised multiple
+pretraining techniques on large-scale datasets in order to enhance the
+performance of downstream tasks. The aim of this article is to provide a
+comprehensive revision of contemporary V\&L pretraining models. In particular,
+we categorize and delineate pretraining approaches, along with the summary of
+state-of-the-art vision-and-language pretrained models. Moreover, a list of
+training datasets and downstream tasks is supplied to further polish the
+perspective into V\&L pretraining. Lastly, we decided to take a further step to
+discuss numerous directions for future research.
+
+
+
+ comment: The content of the paper has been outdated. I would like to rewrite a
+ new version with completely new information.
+
+ Building on the success of large language models (LLMs), recent advancements
+such as GPT-4o have enabled real-time speech interactions through LLM-based
+voice assistants, offering a significantly improved user experience compared to
+traditional text-based interactions. However, the absence of benchmarks
+designed to evaluate these speech interaction capabilities has hindered
+progress of LLM-based voice assistants development. Current evaluations focus
+primarily on automatic speech recognition (ASR) or general knowledge evaluation
+with clean speeches, neglecting the more intricate, real-world scenarios that
+involve diverse speaker characteristics, environmental and content factors. To
+address this, we introduce VoiceBench, the first benchmark designed to provide
+a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also
+includes both real and synthetic spoken instructions that incorporate the above
+three key real-world variations. Extensive experiments reveal the limitations
+of current LLM-based voice assistant models and offer valuable insights for
+future research and development in this field.
+
+
+
+ comment: Work in progress. Data is available at
+ https://github.com/MatthewCYM/VoiceBench
+
+
+
+
+
+
+ ♻ ☆ Topic Classification of Case Law Using a Large Language Model and a New
+ Taxonomy for UK Law: AI Insights into Summary Judgment
+
+
+
+
+
+
+
+
+ Holli Sargeant, Ahmed Izzidien, Felix Steffek
+
+
+ This paper addresses a critical gap in legal analytics by developing and
+applying a novel taxonomy for topic classification of summary judgment cases in
+the United Kingdom. Using a curated dataset of summary judgment cases, we use
+the Large Language Model Claude 3 Opus to explore functional topics and trends.
+We find that Claude 3 Opus correctly classified the topic with an accuracy of
+87.13% and an F1 score of 0.87. The analysis reveals distinct patterns in the
+application of summary judgments across various legal domains. As case law in
+the United Kingdom is not originally labelled with keywords or a topic
+filtering option, the findings not only refine our understanding of the
+thematic underpinnings of summary judgments but also illustrate the potential
+of combining traditional and AI-driven approaches in legal classification.
+Therefore, this paper provides a new and general taxonomy for UK law. The
+implications of this work serve as a foundation for further research and policy
+discussions in the field of judicial administration and computational legal
+research methodologies.
+
+
+
+
+
+
+
+
+ Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, Aleksander Mądry
+
+
+ We introduce MLE-bench, a benchmark for measuring how well AI agents perform
+at machine learning engineering. To this end, we curate 75 ML
+engineering-related competitions from Kaggle, creating a diverse set of
+challenging tasks that test real-world ML engineering skills such as training
+models, preparing datasets, and running experiments. We establish human
+baselines for each competition using Kaggle's publicly available leaderboards.
+We use open-source agent scaffolds to evaluate several frontier language models
+on our benchmark, finding that the best-performing setup--OpenAI's o1-preview
+with AIDE scaffolding--achieves at least the level of a Kaggle bronze medal in
+16.9% of competitions. In addition to our main results, we investigate various
+forms of resource scaling for AI agents and the impact of contamination from
+pre-training. We open-source our benchmark code (github.com/openai/mle-bench/)
+to facilitate future research in understanding the ML engineering capabilities
+of AI agents.
+
+
+
+ comment: 10 pages, 17 pages appendix. Equal contribution by first seven
+ authors, authors randomized. Corrected footnote 4
+
+
+
+
+
+
+ ♻ ☆ Filipino Benchmarks for Measuring Sexist and Homophobic Bias in
+ Multilingual Language Models from Southeast Asia COLING 2025
+
+
+ Bias studies on multilingual models confirm the presence of gender-related
+stereotypes in masked models processing languages with high NLP resources. We
+expand on this line of research by introducing Filipino CrowS-Pairs and
+Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in
+pretrained language models (PLMs) handling texts in Filipino, a low-resource
+language from the Philippines. The benchmarks consist of 7,074 new challenge
+pairs resulting from our cultural adaptation of English bias evaluation
+datasets, a process that we document in detail to guide similar forthcoming
+efforts. We apply the Filipino benchmarks on masked and causal multilingual
+models, including those pretrained on Southeast Asian data, and find that they
+contain considerable amounts of bias. We also find that for multilingual
+models, the extent of bias learned for a particular language is influenced by
+how much pretraining data in that language a model was exposed to. Our
+benchmarks and insights can serve as a foundation for future work analyzing and
+mitigating bias in multilingual models.
+
+
+
+ comment: Accepted for presentation at The First Workshop on Language Models
+ for Low-Resource Languages (LoResLM) at The 31st International Conference on
+ Computational Linguistics (COLING 2025)
+
+
+
+
+
+
+
+
+
+ Information Retrieval 18
+
+
+
+
+
+ ☆ jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
+
+
+
+
+
+
+
+
+ Andreas Koukounas, Georgios Mastrapas, Bo Wang, Mohammad Kalim Akram, Sedigheh Eslami, Michael Günther, Isabelle Mohr, Saba Sturua, Scott Martens, Nan Wang, Han Xiao
+
+
+ Contrastive Language-Image Pretraining (CLIP) is a highly effective method
+for aligning images and texts in a shared embedding space. These models are
+widely used for tasks such as cross-modal information retrieval and multi-modal
+understanding. However, CLIP models often struggle with text-only tasks,
+underperforming compared to specialized text models. This performance disparity
+forces retrieval systems to rely on separate models for text-only and
+multi-modal tasks. In this work, we build upon our previous model,
+jina-clip-v1, by introducing a refined framework that utilizes multi-task,
+multi-stage contrastive learning across multiple languages, coupled with an
+improved training recipe to enhance text-only retrieval. The resulting model,
+jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks,
+while adding multilingual support, better understanding of complex visual
+documents and efficiency gains thanks to Matryoshka Representation Learning and
+vector truncation. The model performs comparably to the state-of-the-art in
+both multilingual-multimodal and multilingual text retrieval benchmarks,
+addressing the challenge of unifying text-only and multi-modal retrieval
+systems.
+
+
+
+
+
+
+
+ ☆ Reducing Popularity Influence by Addressing Position Bias
+
+
+
+
+
+
+
+
+ Andrii Dzhoha, Alexey Kurennoy, Vladimir Vlasov, Marjan Celikik
+
+
+ Position bias poses a persistent challenge in recommender systems, with much
+of the existing research focusing on refining ranking relevance and driving
+user engagement. However, in practical applications, the mitigation of position
+bias does not always result in detectable short-term improvements in ranking
+relevance. This paper provides an alternative, practically useful view of what
+position bias reduction methods can achieve. It demonstrates that position
+debiasing can spread visibility and interactions more evenly across the
+assortment, effectively reducing a skew in the popularity of items induced by
+the position bias through a feedback loop. We offer an explanation of how
+position bias affects item popularity. This includes an illustrative model of
+the item popularity histogram and the effect of the position bias on its
+skewness. Through offline and online experiments on our large-scale e-commerce
+platform, we show that position debiasing can significantly improve assortment
+utilization, without any degradation in user engagement or financial metrics.
+This makes the ranking fairer and helps attract more partners or content
+providers, benefiting the customers and the business in the long term.
+
+
+
+
+
+
+
+
+ Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li, Xialo Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh
+
+
+ Sequential recommendation systems aim to provide personalized recommendations
+for users based on their interaction history. To achieve this, they often
+incorporate auxiliary information, such as textual descriptions of items and
+auxiliary tasks, like predicting user preferences and intent. Despite numerous
+efforts to enhance these models, they still suffer from limited
+personalization. To address this issue, we propose a new paradigm, which we
+term preference discerning. In preference dscerning, we explicitly condition a
+generative sequential recommendation system on user preferences within its
+context. To this end, we generate user preferences using Large Language Models
+(LLMs) based on user reviews and item-specific data. To evaluate preference
+discerning capabilities of sequential recommendation systems, we introduce a
+novel benchmark that provides a holistic evaluation across various scenarios,
+including preference steering and sentiment following. We assess current
+state-of-the-art methods using our benchmark and show that they struggle to
+accurately discern user preferences. Therefore, we propose a new method named
+Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce
+$\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and
+achieves state-of-the-art performance on our benchmark. Our results show that
+Mender can be effectively guided by human preferences even though they have not
+been observed during training, paving the way toward more personalized
+sequential recommendation systems. We will open-source the code and benchmarks
+upon publication.
+
+
+
+ comment: 11 pages + references and appendix
+
+
+
+
+
+
+ ☆ Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based
+ Automated Requirement Traceability and Compliance Checks
+
+
+
+
+
+
+
+
+ Arsalan Masoudifard, Mohammad Mowlavi Sorond, Moein Madadi, Mohammad Sabokrou, Elahe Habibi
+
+
+ Ensuring that Software Requirements Specifications (SRS) align with
+higher-level organizational or national requirements is vital, particularly in
+regulated environments such as finance and aerospace. In these domains,
+maintaining consistency, adhering to regulatory frameworks, minimizing errors,
+and meeting critical expectations are essential for the reliable functioning of
+systems. The widespread adoption of large language models (LLMs) highlights
+their immense potential, yet there remains considerable scope for improvement
+in retrieving relevant information and enhancing reasoning capabilities. This
+study demonstrates that integrating a robust Graph-RAG framework with advanced
+prompt engineering techniques, such as Chain of Thought and Tree of Thought,
+can significantly enhance performance. Compared to baseline RAG methods and
+simple prompting strategies, this approach delivers more accurate and
+context-aware results. While this method demonstrates significant improvements
+in performance, it comes with challenges. It is both costly and more complex to
+implement across diverse contexts, requiring careful adaptation to specific
+scenarios. Additionally, its effectiveness heavily relies on having complete
+and accurate input data, which may not always be readily available, posing
+further limitations to its scalability and practicality.
+
+
+
+
+
+
+
+ ☆ AltFS: Agency-light Feature Selection with Large Language Models in Deep
+ Recommender Systems
+
+
+ Feature selection is crucial in recommender systems for improving model
+efficiency and predictive performance. Traditional methods rely on agency
+models, such as decision trees or neural networks, to estimate feature
+importance. However, this approach is inherently limited, as the agency models
+may fail to learn effectively in all scenarios due to suboptimal training
+conditions (e.g., feature collinearity, high-dimensional sparsity, and data
+insufficiency). In this paper, we propose AltFS, an Agency-light Feature
+Selection method for deep recommender systems. AltFS integrates semantic
+reasoning from Large Language Models (LLMs) with task-specific learning from
+agency models. Initially, LLMs will generate a semantic ranking of feature
+importance, which is then refined by an agency model, combining world knowledge
+with task-specific insights. Extensive experiments on three public datasets
+from real-world recommender platforms demonstrate the effectiveness of AltFS.
+Our code is publicly available for reproducibility.
+
+
+
+ comment: under review
+
+
+
+
+
+
+ ☆ InvDiff: Invariant Guidance for Bias Mitigation in Diffusion Models KDD 2025
+
+
+ As one of the most successful generative models, diffusion models have
+demonstrated remarkable efficacy in synthesizing high-quality images. These
+models learn the underlying high-dimensional data distribution in an
+unsupervised manner. Despite their success, diffusion models are highly
+data-driven and prone to inheriting the imbalances and biases present in
+real-world data. Some studies have attempted to address these issues by
+designing text prompts for known biases or using bias labels to construct
+unbiased data. While these methods have shown improved results, real-world
+scenarios often contain various unknown biases, and obtaining bias labels is
+particularly challenging. In this paper, we emphasize the necessity of
+mitigating bias in pre-trained diffusion models without relying on auxiliary
+bias annotations. To tackle this problem, we propose a framework, InvDiff,
+which aims to learn invariant semantic information for diffusion guidance.
+Specifically, we propose identifying underlying biases in the training data and
+designing a novel debiasing training objective. Then, we employ a lightweight
+trainable module that automatically preserves invariant semantic information
+and uses it to guide the diffusion model's sampling process toward unbiased
+outcomes simultaneously. Notably, we only need to learn a small number of
+parameters in the lightweight learnable module without altering the pre-trained
+diffusion model. Furthermore, we provide a theoretical guarantee that the
+implementation of InvDiff is equivalent to reducing the error upper bound of
+generalization. Extensive experimental results on three publicly available
+benchmarks demonstrate that InvDiff effectively reduces biases while
+maintaining the quality of image generation. Our code is available at
+https://github.com/Hundredl/InvDiff.
+
+
+
+ comment: KDD 2025
+
+
+
+
+
+
+ ☆ NyayaAnumana & INLegalLlama: The Largest Indian Legal Judgment
+ Prediction Dataset and Specialized Language Model for Enhanced Decision
+ Analysis COLING 2025
+
+
+ The integration of artificial intelligence (AI) in legal judgment prediction
+(LJP) has the potential to transform the legal landscape, particularly in
+jurisdictions like India, where a significant backlog of cases burdens the
+legal system. This paper introduces NyayaAnumana, the largest and most diverse
+corpus of Indian legal cases compiled for LJP, encompassing a total of 7,02,945
+preprocessed cases. NyayaAnumana, which combines the words "Nyay" (judgment)
+and "Anuman" (prediction or inference) respectively for most major Indian
+languages, includes a wide range of cases from the Supreme Court, High Courts,
+Tribunal Courts, District Courts, and Daily Orders and, thus, provides
+unparalleled diversity and coverage. Our dataset surpasses existing datasets
+like PredEx and ILDC, offering a comprehensive foundation for advanced AI
+research in the legal domain.
+ In addition to the dataset, we present INLegalLlama, a domain-specific
+generative large language model (LLM) tailored to the intricacies of the Indian
+legal system. It is developed through a two-phase training approach over a base
+LLaMa model. First, Indian legal documents are injected using continual
+pretraining. Second, task-specific supervised finetuning is done. This method
+allows the model to achieve a deeper understanding of legal contexts.
+ Our experiments demonstrate that incorporating diverse court data
+significantly boosts model accuracy, achieving approximately 90% F1-score in
+prediction tasks. INLegalLlama not only improves prediction accuracy but also
+offers comprehensible explanations, addressing the need for explainability in
+AI-assisted legal decisions.
+
+
+
+ comment: Accepted on COLING 2025
+
+
+
+
+
+
+ ☆ Augmenting Sequential Recommendation with Balanced Relevance and
+ Diversity AAAI 2025
+
+
+ By generating new yet effective data, data augmentation has become a
+promising method to mitigate the data sparsity problem in sequential
+recommendation. Existing works focus on augmenting the original data but rarely
+explore the issue of imbalanced relevance and diversity for augmented data,
+leading to semantic drift problems or limited performance improvements. In this
+paper, we propose a novel Balanced data Augmentation Plugin for Sequential
+Recommendation (BASRec) to generate data that balance relevance and diversity.
+BASRec consists of two modules: Single-sequence Augmentation and Cross-sequence
+Augmentation. The former leverages the randomness of the heuristic operators to
+generate diverse sequences for a single user, after which the diverse and the
+original sequences are fused at the representation level to obtain relevance.
+Further, we devise a reweighting strategy to enable the model to learn the
+preferences based on the two properties adaptively. The Cross-sequence
+Augmentation performs nonlinear mixing between different sequence
+representations from two directions. It produces virtual sequence
+representations that are diverse enough but retain the vital semantics of the
+original sequences. These two modules enhance the model to discover
+fine-grained preferences knowledge from single-user and cross-user
+perspectives. Extensive experiments verify the effectiveness of BASRec. The
+average improvement is up to 72.0% on GRU4Rec, 33.8% on SASRec, and 68.5% on
+FMLP-Rec. We demonstrate that BASRec generates data with a better balance
+between relevance and diversity than existing methods. The source code is
+available at https://github.com/KingGugu/BASRec.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ Large Language Models for Scholarly Ontology Generation: An Extensive
+ Analysis in the Engineering Field
+
+
+ Ontologies of research topics are crucial for structuring scientific
+knowledge, enabling scientists to navigate vast amounts of research, and
+forming the backbone of intelligent systems such as search engines and
+recommendation systems. However, manual creation of these ontologies is
+expensive, slow, and often results in outdated and overly general
+representations. As a solution, researchers have been investigating ways to
+automate or semi-automate the process of generating these ontologies. This
+paper offers a comprehensive analysis of the ability of large language models
+(LLMs) to identify semantic relationships between different research topics,
+which is a critical step in the development of such ontologies. To this end, we
+developed a gold standard based on the IEEE Thesaurus to evaluate the task of
+identifying four types of relationships between pairs of topics: broader,
+narrower, same-as, and other. Our study evaluates the performance of seventeen
+LLMs, which differ in scale, accessibility (open vs. proprietary), and model
+type (full vs. quantised), while also assessing four zero-shot reasoning
+strategies. Several models have achieved outstanding results, including
+Mixtral-8x7B, Dolphin-Mistral-7B, and Claude 3 Sonnet, with F1-scores of 0.847,
+0.920, and 0.967, respectively. Furthermore, our findings demonstrate that
+smaller, quantised models, when optimised through prompt engineering, can
+deliver performance comparable to much larger proprietary models, while
+requiring significantly fewer computational resources.
+
+
+
+ comment: submitted to Information Processing & Management
+
+
+
+
+
+
+ ☆ Exploring Multidimensional Checkworthiness: Designing AI-assisted Claim
+ Prioritization for Human Fact-checkers
+
+
+
+
+
+
+
+
+ Houjiang Liu, Jacek Gwizdka, Matthew Lease
+
+
+ Given the massive volume of potentially false claims circulating online,
+claim prioritization is essential in allocating limited human resources
+available for fact-checking. In this study, we perceive claim prioritization as
+an information retrieval (IR) task: just as multidimensional IR relevance, with
+many factors influencing which search results a user deems relevant,
+checkworthiness is also multi-faceted, subjective, and even personal, with many
+factors influencing how fact-checkers triage and select which claims to check.
+Our study investigates both the multidimensional nature of checkworthiness and
+effective tool support to assist fact-checkers in claim prioritization.
+Methodologically, we pursue Research through Design combined with mixed-method
+evaluation. We develop an AI-assisted claim prioritization prototype as a probe
+to explore how fact-checkers use multidimensional checkworthiness factors in
+claim prioritization, simultaneously probing fact-checker needs while also
+exploring the design space to meet those needs.
+ Our study with 16 professional fact-checkers investigates: 1) how
+participants assessed the relative importance of different checkworthy
+dimensions and apply different priorities in claim selection; 2) how they
+created customized GPT-based search filters and the corresponding benefits and
+limitations; and 3) their overall user experiences with our prototype. Our work
+makes a conceptual contribution between multidimensional IR relevance and
+fact-checking checkworthiness, with findings demonstrating the value of
+corresponding tooling support. Specifically, we uncovered a hierarchical
+prioritization strategy fact-checkers implicitly use, revealing an
+underexplored aspect of their workflow, with actionable design recommendations
+for improving claim triage across multi-dimensional checkworthiness and
+tailoring this process with LLM integration.
+
+
+ Sequential recommendations have drawn significant attention in modeling the
+user's historical behaviors to predict the next item. With the booming
+development of multimodal data (e.g., image, text) on internet platforms,
+sequential recommendation also benefits from the incorporation of multimodal
+data. Most methods introduce modal features of items as side information and
+simply concatenates them to learn unified user interests. Nevertheless, these
+methods encounter the limitation in modeling multimodal differences. We argue
+that user interests and item relationships vary across different modalities. To
+address this problem, we propose a novel Multimodal Difference Learning
+framework for Sequential Recommendation, MDSRec for brevity. Specifically, we
+first explore the differences in item relationships by constructing modal-aware
+item relation graphs with behavior signal to enhance item representations.
+Then, to capture the differences in user interests across modalities, we design
+a interest-centralized attention mechanism to independently model user sequence
+representations in different modalities. Finally, we fuse the user embeddings
+from multiple modalities to achieve accurate item recommendation. Experimental
+results on five real-world datasets demonstrate the superiority of MDSRec over
+state-of-the-art baselines and the efficacy of multimodal difference learning.
+
+
+
+
+
+
+
+ ☆ A Tutorial of Personalized Federated Recommender Systems: Recent
+ Advances and Future Directions
+
+
+
+
+
+
+
+
+ Jing Jiang, Chunxu Zhang, Honglei Zhang, Zhiwei Li, Yidong Li, Bo Yang
+
+
+ Personalization stands as the cornerstone of recommender systems (RecSys),
+striving to sift out redundant information and offer tailor-made services for
+users. However, the conventional cloud-based RecSys necessitates centralized
+data collection, posing significant risks of user privacy breaches. In response
+to this challenge, federated recommender systems (FedRecSys) have emerged,
+garnering considerable attention. FedRecSys enable users to retain personal
+data locally and solely share model parameters with low privacy sensitivity for
+global model training, significantly bolstering the system's privacy protection
+capabilities. Within the distributed learning framework, the pronounced non-iid
+nature of user behavior data introduces fresh hurdles to federated
+optimization. Meanwhile, the ability of federated learning to concurrently
+learn multiple models presents an opportunity for personalized user modeling.
+Consequently, the development of personalized FedRecSys (PFedRecSys) is crucial
+and holds substantial significance. This tutorial seeks to provide an
+introduction to PFedRecSys, encompassing (1) an overview of existing studies on
+PFedRecSys, (2) a comprehensive taxonomy of PFedRecSys spanning four pivotal
+research directions-client-side adaptation, server-side aggregation,
+communication efficiency, privacy and protection, and (3) exploration of open
+challenges and promising future directions in PFedRecSys. This tutorial aims to
+establish a robust foundation and spark new perspectives for subsequent
+exploration and practical implementations in the evolving realm of RecSys.
+
+
+
+ comment: A technical tutorial will appear at The Web Conference 2025
+
+ Personal interaction data can be effectively modeled as individual graphs for
+each user in recommender systems.Graph Neural Networks (GNNs)-based
+recommendation techniques have become extremely popular since they can capture
+high-order collaborative signals between users and items by aggregating the
+individual graph into a global interactive graph.However, this centralized
+approach inherently poses a threat to user privacy and security. Recently,
+federated GNN-based recommendation techniques have emerged as a promising
+solution to mitigate privacy concerns. Nevertheless, current implementations
+either limit on-device training to an unaccompanied individual graphs or
+necessitate reliance on an extra third-party server to touch other individual
+graphs, which also increases the risk of privacy leakage. To address this
+challenge, we propose a Cluster-enhanced Federated Graph Neural Network
+framework for Recommendation, named CFedGR, which introduces high-order
+collaborative signals to augment individual graphs in a privacy preserving
+manner. Specifically, the server clusters the pretrained user representations
+to identify high-order collaborative signals. In addition, two efficient
+strategies are devised to reduce communication between devices and the server.
+Extensive experiments on three benchmark datasets validate the effectiveness of
+our proposed methods.
+
+
+
+
+
+
+
+
+ Yuchen Hui, Fengran Mo, Milan Mao, Jian-Yun Nie
+
+
+ The Recherche Appliquee en Linguistique Informatique (RALI) team participated
+in the 2024 TREC Interactive Knowledge Assistance (iKAT) Track. In personalized
+conversational search, effectively capturing a user's complex search intent
+requires incorporating both contextual information and key elements from the
+user profile into query reformulation. The user profile often contains many
+relevant pieces, and each could potentially complement the user's information
+needs. It is difficult to disregard any of them, whereas introducing an
+excessive number of these pieces risks drifting from the original query and
+hinders search performance. This is a challenge we denote as
+over-personalization. To address this, we propose different strategies by
+fusing ranking lists generated from the queries with different levels of
+personalization.
+
+
+
+ comment: Work presented at NIST Text Retrieval Conference 2024.
+ https://www.nist.gov/news-events/events/2024/11/trec2024
+
+
+
+
+
+
+ ♻ ☆ Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study
+ on Two Materials Datasets
+
+
+
+
+
+
+
+
+ Satanu Ghosh, Neal R. Brodnik, Carolina Frey, Collin Holgate, Tresa M. Pollock, Samantha Daly, Samuel Carton
+
+
+ We explore the ability of GPT-4 to perform ad-hoc schema based information
+extraction from scientific literature. We assess specifically whether it can,
+with a basic prompting approach, replicate two existing material science
+datasets, given the manuscripts from which they were originally manually
+extracted. We employ materials scientists to perform a detailed manual error
+analysis to assess where the model struggles to faithfully extract the desired
+information, and draw on their insights to suggest research directions to
+address this broadly important task.
+
+
+
+ comment: Update on 12/11/2024: Added some relevant literature that we missed
+ in previous version of the paper
+
+
+
+
+
+
+ ♻ ☆ CURE: A dataset for Clinical Understanding & Retrieval Evaluation
+
+
+
+
+
+
+
+
+ Nadia Sheikh, Anne-Laure Jousse, Daniel Buades Marcos, Akintunde Oladipo, Olivier Rousseau, Jimmy Lin
+
+
+ Given the dominance of dense retrievers that do not generalize well beyond
+their training dataset distributions, domain-specific test sets are essential
+in evaluating retrieval. There are few test datasets for retrieval systems
+intended for use by healthcare providers in a point-of-care setting. To fill
+this gap we have collaborated with medical professionals to create CURE, an
+ad-hoc retrieval test dataset for passage ranking with 2000 queries spanning 10
+medical domains with a monolingual (English) and two cross-lingual
+(French/Spanish -> English) conditions. In this paper, we describe how CURE was
+constructed and provide baseline results to showcase its effectiveness as an
+evaluation tool. CURE is published with a Creative Commons Attribution Non
+Commercial 4.0 license and can be accessed on Hugging Face.
+
+
+
+
+
+
+
+ ♻ ☆ Spatial-Temporal Federated Learning for Lifelong Person
+ Re-identification on Distributed Edges
+
+
+ Data drift is a thorny challenge when deploying person re-identification
+(ReID) models into real-world devices, where the data distribution is
+significantly different from that of the training environment and keeps
+changing. To tackle this issue, we propose a federated spatial-temporal
+incremental learning approach, named FedSTIL, which leverages both lifelong
+learning and federated learning to continuously optimize models deployed on
+many distributed edge clients. Unlike previous efforts, FedSTIL aims to mine
+spatial-temporal correlations among the knowledge learnt from different edge
+clients. Specifically, the edge clients first periodically extract general
+representations of drifted data to optimize their local models. Then, the
+learnt knowledge from edge clients will be aggregated by centralized parameter
+server, where the knowledge will be selectively and attentively distilled from
+spatial- and temporal-dimension with carefully designed mechanisms. Finally,
+the distilled informative spatial-temporal knowledge will be sent back to
+correlated edge clients to further improve the recognition accuracy of each
+edge client with a lifelong learning method. Extensive experiments on a mixture
+of five real-world datasets demonstrate that our method outperforms others by
+nearly 4% in Rank-1 accuracy, while reducing communication cost by 62%. All
+implementation codes are publicly available on
+https://github.com/MSNLAB/Federated-Lifelong-Person-ReID
+
+
+
+
+
+
+
+ ♻ ☆ Representation Learning with Large Language Models for Recommendation WWW'24
+
+
+ Recommender systems have seen significant advancements with the influence of
+deep learning and graph neural networks, particularly in capturing complex
+user-item relationships. However, these graph-based recommenders heavily depend
+on ID-based data, potentially disregarding valuable textual information
+associated with users and items, resulting in less informative learned
+representations. Moreover, the utilization of implicit feedback data introduces
+potential noise and bias, posing challenges for the effectiveness of user
+preference learning. While the integration of large language models (LLMs) into
+traditional ID-based recommenders has gained attention, challenges such as
+scalability issues, limitations in text-only reliance, and prompt input
+constraints need to be addressed for effective implementation in practical
+recommender systems. To address these challenges, we propose a model-agnostic
+framework RLMRec that aims to enhance existing recommenders with LLM-empowered
+representation learning. It proposes a recommendation paradigm that integrates
+representation learning with LLMs to capture intricate semantic aspects of user
+behaviors and preferences. RLMRec incorporates auxiliary textual signals,
+develops a user/item profiling paradigm empowered by LLMs, and aligns the
+semantic space of LLMs with the representation space of collaborative
+relational signals through a cross-view alignment framework. This work further
+establish a theoretical foundation demonstrating that incorporating textual
+signals through mutual information maximization enhances the quality of
+representations. In our evaluation, we integrate RLMRec with state-of-the-art
+recommender models, while also analyzing its efficiency and robustness to noise
+data. Our implementation codes are available at
+https://github.com/HKUDS/RLMRec.
+
+
+
+ comment: Published as a WWW'24 full paper
+
+
+
+
+
+
+
+
+
+ Multimedia 14
+
+
+
+
+
+ ☆ Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio
+ Generation
+
+
+ Text-to-audio (TTA) model is capable of generating diverse audio from textual
+prompts. However, most mainstream TTA models, which predominantly rely on
+Mel-spectrograms, still face challenges in producing audio with rich content.
+The intricate details and texture required in Mel-spectrograms for such audio
+often surpass the models' capacity, leading to outputs that are blurred or lack
+coherence. In this paper, we begin by investigating the critical role of U-Net
+in Mel-spectrogram generation. Our analysis shows that in U-Net structure,
+high-frequency components in skip-connections and the backbone influence
+texture and detail, while low-frequency components in the backbone are critical
+for the diffusion denoising process. We further propose ``Mel-Refine'', a
+plug-and-play approach that enhances Mel-spectrogram texture and detail by
+adjusting different component weights during inference. Our method requires no
+additional training or fine-tuning and is fully compatible with any
+diffusion-based TTA architecture. Experimental results show that our approach
+boosts performance metrics of the latest TTA model Tango2 by 25\%,
+demonstrating its effectiveness.
+
+
+
+
+
+
+
+ ☆ PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based
+ Talking Head Synthesis AAAI 2025
+
+
+ Talking head synthesis with arbitrary speech audio is a crucial challenge in
+the field of digital humans. Recently, methods based on radiance fields have
+received increasing attention due to their ability to synthesize high-fidelity
+and identity-consistent talking heads from just a few minutes of training
+video. However, due to the limited scale of the training data, these methods
+often exhibit poor performance in audio-lip synchronization and visual quality.
+In this paper, we propose a novel 3D Gaussian-based method called PointTalk,
+which constructs a static 3D Gaussian field of the head and deforms it in sync
+with the audio. It also incorporates an audio-driven dynamic lip point cloud as
+a critical component of the conditional information, thereby facilitating the
+effective synthesis of talking heads. Specifically, the initial step involves
+generating the corresponding lip point cloud from the audio signal and
+capturing its topological structure. The design of the dynamic difference
+encoder aims to capture the subtle nuances inherent in dynamic lip movements
+more effectively. Furthermore, we integrate the audio-point enhancement module,
+which not only ensures the synchronization of the audio signal with the
+corresponding lip point cloud within the feature space, but also facilitates a
+deeper understanding of the interrelations among cross-modal conditional
+features. Extensive experiments demonstrate that our method achieves superior
+high-fidelity and audio-lip synchronization in talking head synthesis compared
+to previous methods.
+
+
+
+ comment: 9 pages, accepted by AAAI 2025
+
+
+
+
+
+
+ ☆ A Dual-Module Denoising Approach with Curriculum Learning for Enhancing
+ Multimodal Aspect-Based Sentiment Analysis ACL
+
+
+
+
+
+
+
+
+ Nguyen Van Doan, Dat Tran Nguyen, Cam-Van Thi Nguyen
+
+
+ Multimodal Aspect-Based Sentiment Analysis (MABSA) combines text and images
+to perform sentiment analysis but often struggles with irrelevant or misleading
+visual information. Existing methodologies typically address either
+sentence-image denoising or aspect-image denoising but fail to comprehensively
+tackle both types of noise. To address these limitations, we propose DualDe, a
+novel approach comprising two distinct components: the Hybrid Curriculum
+Denoising Module (HCD) and the Aspect-Enhance Denoising Module (AED). The HCD
+module enhances sentence-image denoising by incorporating a flexible curriculum
+learning strategy that prioritizes training on clean data. Concurrently, the
+AED module mitigates aspect-image noise through an aspect-guided attention
+mechanism that filters out noisy visual regions which unrelated to the specific
+aspects of interest. Our approach demonstrates effectiveness in addressing both
+sentence-image and aspect-image noise, as evidenced by experimental evaluations
+on benchmark datasets.
+
+
+
+ comment: Accepted at PACLIC 2024
+
+
+
+
+
+
+ ☆ POINTS1.5: Building a Vision-Language Model towards Real World
+ Applications
+
+
+
+
+
+
+
+
+ Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou
+
+
+ Vision-language models have made significant strides recently, demonstrating
+superior performance across a range of tasks, e.g. optical character
+recognition and complex diagram analysis. Building on this trend, we introduce
+a new vision-language model, POINTS1.5, designed to excel in various real-world
+applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several
+key innovations: i) We replace the original CLIP vision encoder, which had a
+fixed image resolution, with a NaViT-style vision encoder that supports native
+dynamic high resolution. This allows POINTS1.5 to process images of any
+resolution without needing to split them into tiles. ii) We add bilingual
+support to POINTS1.5, significantly enhancing its capability in Chinese. Due to
+the scarcity of open-source Chinese datasets for vision-language models, we
+collect numerous images from the Internet and annotate them using a combination
+of manual and automatic methods. iii) We propose a set of rigorous filtering
+methods for visual instruction tuning datasets. We comprehensively evaluate all
+these filtering methods, and choose the most effective ones to obtain the final
+visual instruction tuning set. Thanks to these innovations, POINTS1.5
+significantly outperforms POINTS1.0 and demonstrates strong performance across
+a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer
+than 4 billion tokens and ranks first on the OpenCompass leaderboard among
+models with fewer than 10 billion parameters
+
+
+
+
+
+
+
+ ☆ A Unified Model For Voice and Accent Conversion In Speech and Singing
+ using Self-Supervised Learning and Feature Extraction
+
+
+ This paper presents a new voice conversion model capable of transforming both
+speaking and singing voices. It addresses key challenges in current systems,
+such as conveying emotions, managing pronunciation and accent changes, and
+reproducing non-verbal sounds. One of the model's standout features is its
+ability to perform accent conversion on hybrid voice samples that encompass
+both speech and singing, allowing it to change the speaker's accent while
+preserving the original content and prosody. The proposed model uses an
+encoder-decoder architecture: the encoder is based on HuBERT to process the
+speech's acoustic and linguistic content, while the HiFi-GAN decoder audio
+matches the target speaker's voice. The model incorporates fundamental
+frequency (f0) features and singer embeddings to enhance performance while
+ensuring the pitch & tone accuracy and vocal identity are preserved during
+transformation. This approach improves how naturally and flexibly voice style
+can be transformed, showing strong potential for applications in voice dubbing,
+content creation, and technologies like Text-to-Speech (TTS) and Interactive
+Voice Response (IVR) systems.
+
+
+
+
+
+
+
+
+ Junjie Li, Ke Zhang, Shuai Wang, Kong Aik Lee, Haizhou Li
+
+
+ Audio-visual Target Speaker Extraction (AV-TSE) aims to isolate the speech of
+a specific target speaker from an audio mixture using time-synchronized visual
+cues. In real-world scenarios, visual cues are not always available due to
+various impairments, which undermines the stability of AV-TSE. Despite this
+challenge, humans can maintain attentional momentum over time, even when the
+target speaker is not visible. In this paper, we introduce the Momentum
+Multi-modal target Speaker Extraction (MoMuSE), which retains a speaker
+identity momentum in memory, enabling the model to continuously track the
+target speaker. Designed for real-time inference, MoMuSE extracts the current
+speech window with guidance from both visual cues and dynamically updated
+speaker momentum. Experimental results demonstrate that MoMuSE exhibits
+significant improvement, particularly in scenarios with severe impairment of
+visual cues.
+
+
+
+
+
+
+
+ ☆ SAFIRE: Segment Any Forged Image Region AAAI 2025
+
+
+ Most techniques approach the problem of image forgery localization as a
+binary segmentation task, training neural networks to label original areas as 0
+and forged areas as 1. In contrast, we tackle this issue from a more
+fundamental perspective by partitioning images according to their originating
+sources. To this end, we propose Segment Any Forged Image Region (SAFIRE),
+which solves forgery localization using point prompting. Each point on an image
+is used to segment the source region containing itself. This allows us to
+partition images into multiple source regions, a capability achieved for the
+first time. Additionally, rather than memorizing certain forgery traces, SAFIRE
+naturally focuses on uniform characteristics within each source region. This
+approach leads to more stable and effective learning, achieving superior
+performance in both the new task and the traditional binary forgery
+localization.
+
+
+
+ comment: Accepted at AAAI 2025. Code is available at:
+ https://github.com/mjkwon2021/SAFIRE
+
+
+
+
+
+
+
+ Jingjing Xie, Yuxin Zhang, Jun Peng, Zhaohong Huang, Liujuan Cao
+
+
+ Despite the efficiency of prompt learning in transferring vision-language
+models (VLMs) to downstream tasks, existing methods mainly learn the prompts in
+a coarse-grained manner where the learned prompt vectors are shared across all
+categories. Consequently, the tailored prompts often fail to discern
+class-specific visual concepts, thereby hindering the transferred performance
+for classes that share similar or complex visual attributes. Recent advances
+mitigate this challenge by leveraging external knowledge from Large Language
+Models (LLMs) to furnish class descriptions, yet incurring notable inference
+costs. In this paper, we introduce TextRefiner, a plug-and-play method to
+refine the text prompts of existing methods by leveraging the internal
+knowledge of VLMs. Particularly, TextRefiner builds a novel local cache module
+to encapsulate fine-grained visual concepts derivedfrom local tokens within the
+image branch. By aggregating and aligning the cached visual descriptions with
+the original output of the text branch, TextRefiner can efficiently refine and
+enrich the learned prompts from existing methods without relying on any
+external expertise. For example, it improves the performance of CoOp from 71.66
+% to 76.94 % on 11 benchmarks, surpassing CoCoOp which introduces instance-wise
+features for text prompts. Equipped with TextRefiner, PromptKD achieves
+state-of-the-art performance and is efficient in inference. Our code is relesed
+at https://github.com/xjjxmu/TextRefiner
+
+
+
+ comment: Accepted by AAAI2025
+
+
+
+
+
+
+ ☆ Collaborative Hybrid Propagator for Temporal Misalignment in
+ Audio-Visual Segmentation
+
+
+
+
+
+
+
+
+ Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao
+
+
+ Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of
+sound-producing objects that accurately align with the corresponding audio.
+However, existing methods often face temporal misalignment, where audio cues
+and segmentation results are not temporally coordinated. Audio provides two
+critical pieces of information: i) target object-level details and ii) the
+timing of when objects start and stop producing sounds. Current methods focus
+more on object-level information but neglect the boundaries of audio semantic
+changes, leading to temporal misalignment. To address this issue, we propose a
+Collaborative Hybrid Propagator Framework~(Co-Prop). This framework includes
+two main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame
+Audio-Insert Propagation. To Anchor the audio boundary, we employ
+retrieval-assist prompts with Qwen large language models to identify control
+points of audio semantic changes. These control points split the audio into
+semantically consistent audio portions. After obtaining the control point
+lists, we propose the Audio Insertion Propagator to process each audio portion
+using a frame-by-frame audio insertion propagation and matching approach. We
+curated a compact dataset comprising diverse source conversion cases and
+devised a metric to assess alignment rates. Compared to traditional
+simultaneous processing methods, our approach reduces memory requirements and
+facilitates frame alignment. Experimental results demonstrate the effectiveness
+of our approach across three datasets and two backbones. Furthermore, our
+method can be integrated with existing AVVS approaches, offering plug-and-play
+functionality to enhance their performance.
+
+
+
+
+
+
+
+
+ Haowei Lou, Helen Paik, Pari Delir Haghighi, Wen Hu, Lina Yao
+
+
+ Diffusion-based Generative AI gains significant attention for its superior
+performance over other generative techniques like Generative Adversarial
+Networks and Variational Autoencoders. While it has achieved notable
+advancements in fields such as computer vision and natural language processing,
+their application in speech generation remains under-explored. Mainstream
+Text-to-Speech systems primarily map outputs to Mel-Spectrograms in the
+spectral space, leading to high computational loads due to the sparsity of
+MelSpecs. To address these limitations, we propose LatentSpeech, a novel TTS
+generation approach utilizing latent diffusion models. By using latent
+embeddings as the intermediate representation, LatentSpeech reduces the target
+dimension to 5% of what is required for MelSpecs, simplifying the processing
+for the TTS encoder and vocoder and enabling efficient high-quality speech
+generation. This study marks the first integration of latent diffusion models
+in TTS, enhancing the accuracy and naturalness of generated speech.
+Experimental results on benchmark datasets demonstrate that LatentSpeech
+achieves a 25% improvement in Word Error Rate and a 24% improvement in Mel
+Cepstral Distortion compared to existing models, with further improvements
+rising to 49.5% and 26%, respectively, with additional training data. These
+findings highlight the potential of LatentSpeech to advance the
+state-of-the-art in TTS technology
+
+
+
+
+
+
+
+ ☆ NeRF-NQA: No-Reference Quality Assessment for Scenes Generated by NeRF
+ and Neural View Synthesis Methods
+
+
+ Neural View Synthesis (NVS) has demonstrated efficacy in generating
+high-fidelity dense viewpoint videos using a image set with sparse views.
+However, existing quality assessment methods like PSNR, SSIM, and LPIPS are not
+tailored for the scenes with dense viewpoints synthesized by NVS and NeRF
+variants, thus, they often fall short in capturing the perceptual quality,
+including spatial and angular aspects of NVS-synthesized scenes. Furthermore,
+the lack of dense ground truth views makes the full reference quality
+assessment on NVS-synthesized scenes challenging. For instance, datasets such
+as LLFF provide only sparse images, insufficient for complete full-reference
+assessments. To address the issues above, we propose NeRF-NQA, the first
+no-reference quality assessment method for densely-observed scenes synthesized
+from the NVS and NeRF variants. NeRF-NQA employs a joint quality assessment
+strategy, integrating both viewwise and pointwise approaches, to evaluate the
+quality of NVS-generated scenes. The viewwise approach assesses the spatial
+quality of each individual synthesized view and the overall inter-views
+consistency, while the pointwise approach focuses on the angular qualities of
+scene surface points and their compound inter-point quality. Extensive
+evaluations are conducted to compare NeRF-NQA with 23 mainstream visual quality
+assessment methods (from fields of image, video, and light-field assessment).
+The results demonstrate NeRF-NQA outperforms the existing assessment methods
+significantly and it shows substantial superiority on assessing NVS-synthesized
+scenes without references. An implementation of this paper are available at
+https://github.com/VincentQQu/NeRF-NQA.
+
+
+
+
+
+
+
+ ♻ ☆ Compression of Higher Order Ambisonics with Multichannel RVQGAN
+
+
+ A multichannel extension to the RVQGAN neural coding method is proposed, and
+realized for data-driven compression of third-order Ambisonics audio. The
+input- and output layers of the generator and discriminator models are modified
+to accept multiple (16) channels without increasing the model bitrate. We also
+propose a loss function for accounting for spatial perception in immersive
+reproduction, and transfer learning from single-channel models. Listening test
+results with 7.1.4 immersive playback show that the proposed extension is
+suitable for coding scene-based, 16-channel Ambisonics content with good
+quality at 16 kbps when trained and tested on the EigenScape database. The
+model has potential applications for learning other types of content and
+multichannel formats.
+
+
+
+
+
+
+
+ ♻ ☆ LinVT: Empower Your Image-level Large Language Model to Understand
+ Videos
+
+
+ Large Language Models (LLMs) have been widely used in various tasks,
+motivating us to develop an LLM-based assistant for videos. Instead of training
+from scratch, we propose a module to transform arbitrary well-trained
+image-based LLMs into video-LLMs (after being trained on video data). To better
+adapt image-LLMs for processing videos, we introduce two design principles:
+linear transformation to preserve the original visual-language alignment and
+representative information condensation from redundant video content. Guided by
+these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),
+which enables existing image-LLMs to understand videos. We benchmark LinVT with
+six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,
+showcasing the high compatibility of LinVT. LinVT-based LLMs achieve
+state-of-the-art performance across various video benchmarks, illustrating the
+effectiveness of LinVT in multi-modal video understanding.
+
+
+
+
+
+
+
+ ♻ ☆ Preserving Speaker Information in Direct Speech-to-Speech Translation
+ with Non-Autoregressive Generation and Pretraining
+
+
+ Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one
+language into semantically equivalent speech in another language, facilitating
+communication between speakers of different languages. Speech-to-Discrete Unit
+Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses
+challenges such as error propagation across modules and slow inference speed
+often encountered in traditional cascade systems. However, as discrete units
+primarily capture content information, conventional S2UT methods fail to retain
+speaker-specific characteristics from the source. Our previous work, SC-S2UT,
+introduced a speaker adapter and a unit-to-mel structure, enabling the
+preservation of speaker information and non-autoregressive speech generation.
+Building on this foundation, this study proposes a self-supervised pretraining
+method to enrich the information extracted by both the speaker adapter and the
+unit-to-mel structure. Additionally, we investigate different feature fusion
+strategies to further improve the integration of speaker and content features.
+Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks
+demonstrate that our proposed method achieves a BLEU score improvement of 1.14
+compared to SC-S2UT, along with significant enhancements in MOS and speaker
+similarity. Furthermore, our approach achieves translation quality comparable
+to traditional S2UT, with only a minimal increase of 0.04s per utterance in
+inference time, while maintaining high speaker similarity. These results
+validate the effectiveness of the proposed method.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 20
+
+
+
+
+
+ ☆ Benchmark for Evaluation and Analysis of Citation Recommendation Models
+
+
+ Citation recommendation systems have attracted much academic interest,
+resulting in many studies and implementations. These systems help authors
+automatically generate proper citations by suggesting relevant references based
+on the text they have written. However, the methods used in citation
+recommendation differ across various studies and implementations. Some
+approaches focus on the overall content of papers, while others consider the
+context of the citation text. Additionally, the datasets used in these studies
+include different aspects of papers, such as metadata, citation context, or
+even the full text of the paper in various formats and structures. The
+diversity in models, datasets, and evaluation metrics makes it challenging to
+assess and compare citation recommendation methods effectively. To address this
+issue, a standardized dataset and evaluation metrics are needed to evaluate
+these models consistently. Therefore, we propose developing a benchmark
+specifically designed to analyze and compare citation recommendation models.
+This benchmark will evaluate the performance of models on different features of
+the citation context and provide a comprehensive evaluation of the models
+across all these tasks, presenting the results in a standardized way. By
+creating a benchmark with standardized evaluation metrics, researchers and
+practitioners in the field of citation recommendation will have a common
+platform to assess and compare different models. This will enable meaningful
+comparisons and help identify promising approaches for further research and
+development in the field.
+
+
+
+ comment: 10 pages
+
+
+
+
+
+
+ ☆ OmniDocBench: Benchmarking Diverse PDF Document Parsing with
+ Comprehensive Annotations
+
+
+
+
+
+
+
+
+ Linke Ouyang, Yuan Qu, Hongbin Zhou, Jiawei Zhu, Rui Zhang, Qunshu Lin, Bin Wang, Zhiyuan Zhao, Man Jiang, Xiaomeng Zhao, Jin Shi, Fan Wu, Pei Chu, Minghao Liu, Zhenxiang Li, Chao Xu, Bo Zhang, Botian Shi, Zhongying Tu, Conghui He
+
+
+ Document content extraction is crucial in computer vision, especially for
+meeting the high-quality data needs of large language models (LLMs) and
+retrieval-augmented generation (RAG) technologies. However, current document
+parsing methods suffer from significant limitations in terms of diversity and
+comprehensive evaluation. To address these challenges, we introduce
+OmniDocBench, a novel multi-source benchmark designed to advance automated
+document content extraction. OmniDocBench includes a meticulously curated and
+annotated high-quality evaluation dataset comprising nine diverse document
+types, such as academic papers, textbooks, slides, among others. Our benchmark
+provides a flexible and comprehensive evaluation framework with 19 layout
+category labels and 14 attribute labels, enabling multi-level assessments
+across entire datasets, individual modules, or specific data types. Using
+OmniDocBench, we perform an exhaustive comparative analysis of existing modular
+pipelines and multimodal end-to-end methods, highlighting their limitations in
+handling document diversity and ensuring fair evaluation. OmniDocBench
+establishes a robust, diverse, and fair evaluation standard for the document
+content extraction field, offering crucial insights for future advancements and
+fostering the development of document parsing technologies. The codes and
+dataset is available in https://github.com/opendatalab/OmniDocBench.
+
+
+ Long-form document matching aims to judge the relevance between two documents
+and has been applied to various scenarios. Most existing works utilize
+hierarchical or long context models to process documents, which achieve coarse
+understanding but may ignore details. Some researchers construct a document
+view with similar sentences about aligned document subtopics to focus on
+detailed matching signals. However, a long document generally contains multiple
+subtopics. The matching signals are heterogeneous from multiple topics.
+Considering only the homologous aligned subtopics may not be representative
+enough and may cause biased modeling. In this paper, we introduce a new
+framework to model representative matching signals. First, we propose to
+capture various matching signals through subtopics of document pairs. Next, We
+construct multiple document views based on subtopics to cover heterogeneous and
+valuable details. However, existing spatial aggregation methods like attention,
+which integrate all these views simultaneously, are hard to integrate
+heterogeneous information. Instead, we propose temporal aggregation, which
+effectively integrates different views gradually as the training progresses.
+Experimental results show that our learning framework is effective on several
+document-matching tasks, including news duplication and legal case retrieval.
+
+
+
+
+
+
+
+
+ Ehsan Lotfi, Nikolay Banar, Nerses Yuzbashyan, Walter Daelemans
+
+
+ Statutory article retrieval plays a crucial role in making legal information
+more accessible to both laypeople and legal professionals. Multilingual
+countries like Belgium present unique challenges for retrieval models due to
+the need for handling legal issues in multiple languages. Building on the
+Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the
+bilingual version of this dataset, bBSARD. The dataset contains parallel
+Belgian statutory articles in both French and Dutch, along with legal questions
+from BSARD and their Dutch translation. Using bBSARD, we conduct extensive
+benchmarking of retrieval models available for Dutch and French. Our
+benchmarking setup includes lexical models, zero-shot dense models, and
+fine-tuned small foundation models. Our experiments show that BM25 remains a
+competitive baseline compared to many zero-shot dense models in both languages.
+We also observe that while proprietary models outperform open alternatives in
+the zero-shot setting, they can be matched or surpassed by fine-tuning small
+language-specific models. Our dataset and evaluation code are publicly
+available.
+
+
+
+ comment: To be presented at RegNLP-2025 (COLING)
+
+
+
+
+
+
+ ☆ RAG-based Question Answering over Heterogeneous Data and Text
+
+
+
+
+
+
+
+
+ Philipp Christmann, Gerhard Weikum
+
+
+ This article presents the QUASAR system for question answering over
+unstructured text, structured tables, and knowledge graphs, with unified
+treatment of all sources. The system adopts a RAG-based architecture, with a
+pipeline of evidence retrieval followed by answer generation, with the latter
+powered by a moderate-sized language model. Additionally and uniquely, QUASAR
+has components for question understanding, to derive crisper input for evidence
+retrieval, and for re-ranking and filtering the retrieved evidence before
+feeding the most informative pieces into the answer generation. Experiments
+with three different benchmarks demonstrate the high answering quality of our
+approach, being on par with or better than large GPT models, while keeping the
+computational cost and energy consumption orders of magnitude lower.
+
+
+
+ comment: IEEE Data Engineering Bulletin -- December 2024 Edition on RAG
+
+
+
+
+
+
+ ☆ RLT4Rec: Reinforcement Learning Transformer for User Cold Start and Item
+ Recommendation
+
+
+
+
+
+
+
+
+ Dilina Chandika Rajapakse, Douglas Leith
+
+
+ We introduce a new sequential transformer reinforcement learning architecture
+RLT4Rec and demonstrate that it achieves excellent performance in a range of
+item recommendation tasks. RLT4Rec uses a relatively simple transformer
+architecture that takes as input the user's (item,rating) history and outputs
+the next item to present to the user. Unlike existing RL approaches, there is
+no need to input a state observation or estimate. RLT4Rec handles new users and
+established users within the same consistent framework and automatically
+balances the "exploration" needed to discover the preferences of a new user
+with the "exploitation" that is more appropriate for established users.
+Training of RLT4Rec is robust and fast and is insensitive to the choice of
+training data, learning to generate "good" personalised sequences that the user
+tends to rate highly even when trained on "bad" data.
+
+
+
+
+
+
+
+ ☆ Temporal Linear Item-Item Model for Sequential Recommendation WSDM 2025
+
+
+
+
+
+
+
+
+ Seongmin Park, Mincheol Yoon, Minjin Choi, Jongwuk Lee
+
+
+ In sequential recommendation (SR), neural models have been actively explored
+due to their remarkable performance, but they suffer from inefficiency inherent
+to their complexity. On the other hand, linear SR models exhibit high
+efficiency and achieve competitive or superior accuracy compared to neural
+models. However, they solely deal with the sequential order of items (i.e.,
+sequential information) and overlook the actual timestamp (i.e., temporal
+information). It is limited to effectively capturing various user preference
+drifts over time. To address this issue, we propose a novel linear SR model,
+named TemporAl LinEar item-item model (TALE), incorporating temporal
+information while preserving training/inference efficiency, with three key
+components. (i) Single-target augmentation concentrates on a single target
+item, enabling us to learn the temporal correlation for the target item. (ii)
+Time interval-aware weighting utilizes the actual timestamp to discern the item
+correlation depending on time intervals. (iii) Trend-aware normalization
+reflects the dynamic shift of item popularity over time. Our empirical studies
+show that TALE outperforms ten competing SR models by up to 18.71% gains on
+five benchmark datasets. It also exhibits remarkable effectiveness in
+evaluating long-tail items by up to 30.45% gains. The source code is available
+at https://github.com/psm1206/TALE.
+
+
+
+ comment: Accepted by WSDM 2025
+
+
+
+
+
+
+ ☆ IntellectSeeker: A Personalized Literature Management System with the
+ Probabilistic Model and Large Language Model
+
+
+ Faced with the burgeoning volume of academic literature, researchers often
+need help with uncertain article quality and mismatches in term searches using
+traditional academic engines. We introduce IntellectSeeker, an innovative and
+personalized intelligent academic literature management platform to address
+these challenges. This platform integrates a Large Language Model (LLM)--based
+semantic enhancement bot with a sophisticated probability model to personalize
+and streamline literature searches. We adopted the GPT-3.5-turbo model to
+transform everyday language into professional academic terms across various
+scenarios using multiple rounds of few-shot learning. This adaptation mainly
+benefits academic newcomers, effectively bridging the gap between general
+inquiries and academic terminology. The probabilistic model intelligently
+filters academic articles to align closely with the specific interests of
+users, which are derived from explicit needs and behavioral patterns. Moreover,
+IntellectSeeker incorporates an advanced recommendation system and text
+compression tools. These features enable intelligent article recommendations
+based on user interactions and present search results through concise one-line
+summaries and innovative word cloud visualizations, significantly enhancing
+research efficiency and user experience. IntellectSeeker offers academic
+researchers a highly customizable literature management solution with
+exceptional search precision and matching capabilities. The code can be found
+here: https://github.com/LuckyBian/ISY5001
+
+
+
+
+
+
+
+
+ Julien Monteil, Volodymyr Vaskovych, Wentao Lu, Anirban Majumder, Anton van den Hengel
+
+
+ For many recommender systems, the primary data source is a historical record
+of user clicks. The associated click matrix is often very sparse, as the number
+of users x products can be far larger than the number of clicks. Such sparsity
+is accentuated in cold-start settings, which makes the efficient use of
+metadata information of paramount importance. In this work, we propose a simple
+approach to address cold-start recommendations by leveraging content metadata,
+Metadata Alignment for cold-start Recommendation. We show that this approach
+can readily augment existing matrix factorization and autoencoder approaches,
+enabling a smooth transition to top performing algorithms in warmer set-ups.
+Our experimental results indicate three separate contributions: first, we show
+that our proposed framework largely beats SOTA results on 4 cold-start datasets
+with different sparsity and scale characteristics, with gains ranging from
++8.4% to +53.8% on reported ranking metrics; second, we provide an ablation
+study on the utility of semantic features, and proves the additional gain
+obtained by leveraging such features ranges between +46.8% and +105.5%; and
+third, our approach is by construction highly competitive in warm set-ups, and
+we propose a closed-form solution outperformed by SOTA results by only 0.8% on
+average.
+
+
+
+
+
+
+
+ ♻ ☆ Beyond Retrieval: Generating Narratives in Conversational Recommender
+ Systems
+
+
+
+
+
+
+
+
+ Krishna Sayana, Raghavendra Vasudeva, Yuri Vasilevski, Kun Su, Liam Hebert, James Pine, Hubert Pham, Ambarish Jash, Sukhdeep Sodhi
+
+
+ The recent advances in Large Language Model's generation and reasoning
+capabilities present an opportunity to develop truly conversational
+recommendation systems. However, effectively integrating recommender system
+knowledge into LLMs for natural language generation which is tailored towards
+recommendation tasks remains a challenge. This paper addresses this challenge
+by making two key contributions.
+ First, we introduce a new dataset (REGEN) for natural language generation
+tasks in conversational recommendations. REGEN (Reviews Enhanced with
+GEnerative Narratives) extends the Amazon Product Reviews dataset with rich
+user narratives, including personalized explanations of product preferences,
+product endorsements for recommended items, and summaries of user purchase
+history. REGEN is made publicly available to facilitate further research.
+Furthermore, we establish benchmarks using well-known generative metrics, and
+perform an automated evaluation of the new dataset using a rater LLM. Second,
+the paper introduces a fusion architecture (CF model with an LLM) which serves
+as a baseline for REGEN. And to the best of our knowledge, represents the first
+attempt to analyze the capabilities of LLMs in understanding recommender
+signals and generating rich narratives. We demonstrate that LLMs can
+effectively learn from simple fusion architectures utilizing interaction-based
+CF embeddings, and this can be further enhanced using the metadata and
+personalization data associated with items. Our experiments show that combining
+CF and content embeddings leads to improvements of 4-12% in key language
+metrics compared to using either type of embedding individually. We also
+provide an analysis to interpret how CF and content embeddings contribute to
+this new generative task.
+
+
+
+
+
+
+
+
+ Sebastian Bruch, Aditya Krishnan, Franco Maria Nardini
+
+
+ Clustering-based nearest neighbor search is an effective method in which
+points are partitioned into geometric shards to form an index, with only a few
+shards searched during query processing to find a set of top-$k$ vectors. Even
+though the search efficacy is heavily influenced by the algorithm that
+identifies the shards to probe, it has received little attention in the
+literature. This work bridges that gap by studying routing in clustering-based
+maximum inner product search. We unpack existing routers and notice the
+surprising contribution of optimism. We then take a page from the sequential
+decision making literature and formalize that insight following the principle
+of ``optimism in the face of uncertainty.'' In particular, we present a
+framework that incorporates the moments of the distribution of inner products
+within each shard to estimate the maximum inner product. We then present an
+instance of our algorithm that uses only the first two moments to reach the
+same accuracy as state-of-the-art routers such as ScaNN by probing up to $50\%$
+fewer points on benchmark datasets. Our algorithm is also space-efficient: we
+design a sketch of the second moment whose size is independent of the number of
+points and requires $\mathcal{O}(1)$ vectors per shard.
+
+
+
+
+
+
+
+ ♻ ☆ CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model
+ Recommender System
+
+
+ This work takes a critical stance on previous studies concerning fairness
+evaluation in Large Language Model (LLM)-based recommender systems, which have
+primarily assessed consumer fairness by comparing recommendation lists
+generated with and without sensitive user attributes. Such approaches
+implicitly treat discrepancies in recommended items as biases, overlooking
+whether these changes might stem from genuine personalization aligned with true
+preferences of users. Moreover, these earlier studies typically address single
+sensitive attributes in isolation, neglecting the complex interplay of
+intersectional identities. In response to these shortcomings, we introduce
+CFaiRLLM, an enhanced evaluation framework that not only incorporates true
+preference alignment but also rigorously examines intersectional fairness by
+considering overlapping sensitive attributes. Additionally, CFaiRLLM introduces
+diverse user profile sampling strategies-random, top-rated, and
+recency-focused-to better understand the impact of profile generation fed to
+LLMs in light of inherent token limitations in these systems. Given that
+fairness depends on accurately understanding users' tastes and preferences,,
+these strategies provide a more realistic assessment of fairness within
+RecLLMs.
+ The results demonstrated that true preference alignment offers a more
+personalized and fair assessment compared to similarity-based measures,
+revealing significant disparities when sensitive and intersectional attributes
+are incorporated. Notably, our study finds that intersectional attributes
+amplify fairness gaps more prominently, especially in less structured domains
+such as music recommendations in LastFM.
+
+
+
+
+
+
+
+ ♻ ☆ S+t-SNE -- Bringing Dimensionality Reduction to Data Streams
+
+
+
+
+
+
+
+
+ Pedro C. Vieira, João P. Montrezol, João T. Vieira, João Gama
+
+
+ We present S+t-SNE, an adaptation of the t-SNE algorithm designed to handle
+infinite data streams. The core idea behind S+t-SNE is to update the t-SNE
+embedding incrementally as new data arrives, ensuring scalability and
+adaptability to handle streaming scenarios. By selecting the most important
+points at each step, the algorithm ensures scalability while keeping
+informative visualisations. By employing a blind method for drift management,
+the algorithm adjusts the embedding space, which facilitates the visualisation
+of evolving data dynamics. Our experimental evaluations demonstrate the
+effectiveness and efficiency of S+t-SNE, whilst highlighting its ability to
+capture patterns in a streaming scenario. We hope our approach offers
+researchers and practitioners a real-time tool for understanding and
+interpreting high-dimensional data.
+
+
+
+ comment: This preprint has undergone peer review but does not have any
+ post-submission improvements or corrections. Full version after peer-review
+ and post-acceptance improvements was presented at IDA2024
+ (https://ida2024.org/)
+
+
+
+
+
+
+
+ Yang Li, Kangbo Liu, Yaoxin Wu, Zhaoxuan Wang, Erik Cambria, Xiaoxu Wang
+
+
+ Bundle recommendations strive to offer users a set of items as a package
+named bundle, enhancing convenience and contributing to the seller's revenue.
+While previous approaches have demonstrated notable performance, we argue that
+they may compromise the ternary relationship among users, items, and bundles.
+This compromise can result in information loss, ultimately impacting the
+overall model performance. To address this gap, we develop a unified model for
+bundle recommendation, termed hypergraph-enhanced dual convolutional neural
+network (HED). Our approach is characterized by two key aspects. Firstly, we
+construct a complete hypergraph to capture interaction dynamics among users,
+items, and bundles. Secondly, we incorporate U-B interaction information to
+enhance the information representation derived from users and bundle embedding
+vectors. Extensive experimental results on the Youshu and Netease datasets have
+demonstrated that HED surpasses state-of-the-art baselines, proving its
+effectiveness. In addition, various ablation studies and sensitivity analyses
+revealed the working mechanism and proved our effectiveness. Codes and datasets
+are available at https://github.com/AAI-Lab/HED
+
+
+
+
+
+
+
+ ♻ ☆ Causal Deconfounding via Confounder Disentanglement for Dual-Target
+ Cross-Domain Recommendation
+
+
+
+
+
+
+
+
+ Jiajie Zhu, Yan Wang, Feng Zhu, Zhu Sun
+
+
+ In recent years, dual-target Cross-Domain Recommendation (CDR) has been
+proposed to capture comprehensive user preferences in order to ultimately
+enhance the recommendation accuracy in both data-richer and data-sparser
+domains simultaneously. However, in addition to users' true preferences, the
+user-item interactions might also be affected by confounders (e.g., free
+shipping, sales promotion). As a result, dual-target CDR has to meet two
+challenges: (1) how to effectively decouple observed confounders, including
+single-domain confounders and cross-domain confounders, and (2) how to preserve
+the positive effects of observed confounders on predicted interactions, while
+eliminating their negative effects on capturing comprehensive user preferences.
+To address the above two challenges, we propose a Causal Deconfounding
+framework via Confounder Disentanglement for dual-target Cross-Domain
+Recommendation, called CD2CDR. In CD2CDR, we first propose a confounder
+disentanglement module to effectively decouple observed single-domain and
+cross-domain confounders. We then propose a causal deconfounding module to
+preserve the positive effects of such observed confounders and eliminate their
+negative effects via backdoor adjustment, thereby enhancing the recommendation
+accuracy in each domain. Extensive experiments conducted on five real-world
+datasets demonstrate that CD2CDR significantly outperforms the state-of-the-art
+methods.
+
+
+
+
+
+
+
+
+ Li Shi, Houjiang Liu, Yian Wong, Utkarsh Mujumdar, Dan Zhang, Jacek Gwizdka, Matthew Lease
+
+
+ Large language models (LLMs) are enabling designers to give life to exciting
+new user experiences for information access. In this work, we present a system
+that generates LLM personas to debate a topic of interest from different
+perspectives. How might information seekers use and benefit from such a system?
+Can centering information access around diverse viewpoints help to mitigate
+thorny challenges like confirmation bias in which information seekers
+over-trust search results matching existing beliefs? How do potential biases
+and hallucinations in LLMs play out alongside human users who are also fallible
+and possibly biased?
+ Our study exposes participants to multiple viewpoints on controversial issues
+via a mixed-methods, within-subjects study. We use eye-tracking metrics to
+quantitatively assess cognitive engagement alongside qualitative feedback.
+Compared to a baseline search system, we see more creative interactions and
+diverse information-seeking with our multi-persona debate system, which more
+effectively reduces user confirmation bias and conviction toward their initial
+beliefs. Overall, our study contributes to the emerging design space of
+LLM-based information access systems, specifically investigating the potential
+of simulated personas to promote greater exposure to information diversity,
+emulate collective intelligence, and mitigate bias in information seeking.
+
+
+
+
+
+
+
+ ♻ ☆ LEARN: Knowledge Adaptation from Large Language Model to Recommendation
+ for Practical Industrial Application AAAI 2025
+
+
+
+
+
+
+
+
+ Jian Jia, Yipei Wang, Yan Li, Honggang Chen, Xuehan Bai, Zhaocheng Liu, Jian Liang, Quan Chen, Han Li, Peng Jiang, Kun Gai
+
+
+ Contemporary recommendation systems predominantly rely on ID embedding to
+capture latent associations among users and items. However, this approach
+overlooks the wealth of semantic information embedded within textual
+descriptions of items, leading to suboptimal performance and poor
+generalizations. Leveraging the capability of large language models to
+comprehend and reason about textual content presents a promising avenue for
+advancing recommendation systems. To achieve this, we propose an Llm-driven
+knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world
+knowledge with collaborative knowledge. We address computational complexity
+concerns by utilizing pretrained LLMs as item encoders and freezing LLM
+parameters to avoid catastrophic forgetting and preserve open-world knowledge.
+To bridge the gap between the open-world and collaborative domains, we design a
+twin-tower structure supervised by the recommendation task and tailored for
+practical industrial application. Through experiments on the real large-scale
+industrial dataset and online A/B tests, we demonstrate the efficacy of our
+approach in industry application. We also achieve state-of-the-art performance
+on six Amazon Review datasets to verify the superiority of our method.
+
+
+
+ comment: Accepted by AAAI 2025
+
+
+
+
+
+
+ ♻ ☆ LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
+
+
+
+
+
+
+
+
+ Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, Yiqun Liu
+
+
+ The rapid advancement of Large Language Models (LLMs) has driven their
+expanding application across various fields. One of the most promising
+applications is their role as evaluators based on natural language responses,
+referred to as ''LLMs-as-judges''. This framework has attracted growing
+attention from both academia and industry due to their excellent effectiveness,
+ability to generalize across tasks, and interpretability in the form of natural
+language. This paper presents a comprehensive survey of the LLMs-as-judges
+paradigm from five key perspectives: Functionality, Methodology, Applications,
+Meta-evaluation, and Limitations. We begin by providing a systematic definition
+of LLMs-as-Judges and introduce their functionality (Why use LLM judges?). Then
+we address methodology to construct an evaluation system with LLMs (How to use
+LLM judges?). Additionally, we investigate the potential domains for their
+application (Where to use LLM judges?) and discuss methods for evaluating them
+in various contexts (How to evaluate LLM judges?). Finally, we provide a
+detailed analysis of the limitations of LLM judges and discuss potential future
+directions. Through a structured and comprehensive analysis, we aim aims to
+provide insights on the development and application of LLMs-as-judges in both
+research and practice. We will continue to maintain the relevant resource list
+at https://github.com/CSHaitao/Awesome-LLMs-as-Judges.
+
+
+
+ comment: 60 pages, comprehensive and continuously updated
+
+ Federated Collaborative Filtering (FedCF) is an emerging field focused on
+developing a new recommendation framework with preserving privacy in a
+federated setting. Existing FedCF methods typically combine distributed
+Collaborative Filtering (CF) algorithms with privacy-preserving mechanisms, and
+then preserve personalized information into a user embedding vector. However,
+the user embedding is usually insufficient to preserve the rich information of
+the fine-grained personalization across heterogeneous clients. This paper
+proposes a novel personalized FedCF method by preserving users' personalized
+information into a latent variable and a neural model simultaneously.
+Specifically, we decompose the modeling of user knowledge into two encoders,
+each designed to capture shared knowledge and personalized knowledge
+separately. A personalized gating network is then applied to balance
+personalization and generalization between the global and local encoders.
+Moreover, to effectively train the proposed framework, we model the CF problem
+as a specialized Variational AutoEncoder (VAE) task by integrating user
+interaction vector reconstruction with missing value prediction. The decoder is
+trained to reconstruct the implicit feedback from items the user has interacted
+with, while also predicting items the user might be interested in but has not
+yet interacted with. Experimental results on benchmark datasets demonstrate
+that the proposed method outperforms other baseline methods, showcasing
+superior performance. Our code is available at https://github.com/mtics/FedDAE.
+
+
+ Information Retrieval (IR) systems used in search and recommendation
+platforms frequently employ Learning-to-Rank (LTR) models to rank items in
+response to user queries. These models heavily rely on features derived from
+user interactions, such as clicks and engagement data. This dependence
+introduces cold start issues for items lacking user engagement and poses
+challenges in adapting to non-stationary shifts in user behavior over time. We
+address both challenges holistically as an online learning problem and propose
+BayesCNS, a Bayesian approach designed to handle cold start and non-stationary
+distribution shifts in search systems at scale. BayesCNS achieves this by
+estimating prior distributions for user-item interactions, which are
+continuously updated with new user interactions gathered online. This online
+learning procedure is guided by a ranker model, enabling efficient exploration
+of relevant items using contextual information provided by the ranker. We
+successfully deployed BayesCNS in a large-scale search system and demonstrated
+its efficacy through comprehensive offline and online experiments. Notably, an
+online A/B experiment showed a 10.60% increase in new item interactions and a
+1.05% improvement in overall success metrics over the existing production
+baseline.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 13
+
+
+
+
+
+ ☆ Frechet Music Distance: A Metric For Generative Symbolic Music
+ Evaluation
+
+
+
+
+
+
+
+
+ Jan Retkowski, Jakub Stępniak, Mateusz Modrzejewski
+
+
+ In this paper we introduce the Frechet Music Distance (FMD), a novel
+evaluation metric for generative symbolic music models, inspired by the Frechet
+Inception Distance (FID) in computer vision and Frechet Audio Distance (FAD) in
+generative audio. FMD calculates the distance between distributions of
+reference and generated symbolic music embeddings, capturing abstract musical
+features. We validate FMD across several datasets and models. Results indicate
+that FMD effectively differentiates model quality, providing a domain-specific
+metric for evaluating symbolic music generation, and establishing a
+reproducible standard for future research in symbolic music modeling.
+
+
+
+
+
+
+
+
+ Andrew Hamara, Benjamin Kilpatrick, Alex Baratta, Brendon Kofink, Andrew C. Freeman
+
+
+ Recently, we have witnessed the rise of novel ``event-based'' camera sensors
+for high-speed, low-power video capture. Rather than recording discrete image
+frames, these sensors output asynchronous ``event'' tuples with microsecond
+precision, only when the brightness change of a given pixel exceeds a certain
+threshold. Although these sensors have enabled compelling new computer vision
+applications, these applications often require expensive, power-hungry GPU
+systems, rendering them incompatible for deployment on the low-power devices
+for which event cameras are optimized. Whereas receiver-driven rate adaptation
+is a crucial feature of modern video streaming solutions, this topic is
+underexplored in the realm of event-based vision systems. On a real-world event
+camera dataset, we first demonstrate that a state-of-the-art object detection
+application is resilient to dramatic data loss, and that this loss may be
+weighted towards the end of each temporal window. We then propose a scalable
+streaming method for event-based data based on Media Over QUIC, prioritizing
+object detection performance and low latency. The application server can
+receive complementary event data across several streams simultaneously, and
+drop streams as needed to maintain a certain latency. With a latency target of
+5 ms for end-to-end transmission across a small network, we observe an average
+reduction in detection mAP as low as 0.36. With a more relaxed latency target
+of 50 ms, we observe an average mAP reduction as low as 0.19.
+
+
+
+
+
+
+
+ ☆ STIV: Scalable Text and Image Conditioned Video Generation
+
+
+ The field of video generation has made remarkable advancements, yet there
+remains a pressing need for a clear, systematic recipe that can guide the
+development of robust and scalable models. In this work, we present a
+comprehensive study that systematically explores the interplay of model
+architectures, training recipes, and data curation strategies, culminating in a
+simple and scalable text-image-conditioned video generation method, named STIV.
+Our framework integrates image condition into a Diffusion Transformer (DiT)
+through frame replacement, while incorporating text conditioning via a joint
+image-text conditional classifier-free guidance. This design enables STIV to
+perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks
+simultaneously. Additionally, STIV can be easily extended to various
+applications, such as video prediction, frame interpolation, multi-view
+generation, and long video generation, etc. With comprehensive ablation studies
+on T2I, T2V, and TI2V, STIV demonstrate strong performance, despite its simple
+design. An 8.7B model with 512 resolution achieves 83.1 on VBench T2V,
+surpassing both leading open and closed-source models like CogVideoX-5B, Pika,
+Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result
+of 90.1 on VBench I2V task at 512 resolution. By providing a transparent and
+extensible recipe for building cutting-edge video generation models, we aim to
+empower future research and accelerate progress toward more versatile and
+reliable video generation solutions.
+
+
+ We propose a novel self-supervised approach for learning audio and visual
+representations from unlabeled videos, based on their correspondence. The
+approach uses an attention mechanism to learn the relative importance of
+convolutional features extracted at different resolutions from the audio and
+visual streams and uses the attention features to encode the audio and visual
+input based on their correspondence. We evaluated the representations learned
+by the model to classify audio-visual correlation as well as to recommend sound
+effects for visual scenes. Our results show that the representations generated
+by the attention model improves the correlation accuracy compared to the
+baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is
+a public video dataset. Additionally, audio-visual representations learned by
+training the attention model with cross-modal contrastive learning further
+improves the recommendation performance, based on our evaluation using
+VGG-Sound and a more challenging dataset consisting of gameplay video
+recordings.
+
+
+
+ comment: Published in the Proceedings of the International Symposium on Visual
+ Computing, 2021 https://dl.acm.org/doi/10.1007/978-3-030-90436-4_10
+
+
+
+
+
+
+ ☆ Multimodal Sentiment Analysis Based on Causal Reasoning
+
+
+
+
+
+
+
+
+ Fuhai Chen, Pengpeng Huang, Xuri Ge, Jie Huang, Zishuo Bao
+
+
+ With the rapid development of multimedia, the shift from unimodal textual
+sentiment analysis to multimodal image-text sentiment analysis has obtained
+academic and industrial attention in recent years. However, multimodal
+sentiment analysis is affected by unimodal data bias, e.g., text sentiment is
+misleading due to explicit sentiment semantic, leading to low accuracy in the
+final sentiment classification. In this paper, we propose a novel
+CounterFactual Multimodal Sentiment Analysis framework (CF-MSA) using causal
+counterfactual inference to construct multimodal sentiment causal inference.
+CF-MSA mitigates the direct effect from unimodal bias and ensures heterogeneity
+across modalities by differentiating the treatment variables between
+modalities. In addition, considering the information complementarity and bias
+differences between modalities, we propose a new optimisation objective to
+effectively integrate different modalities and reduce the inherent bias from
+each modality. Experimental results on two public datasets, MVSA-Single and
+MVSA-Multiple, demonstrate that the proposed CF-MSA has superior debiasing
+capability and achieves new state-of-the-art performances. We will release the
+code and datasets to facilitate future research.
+
+
+
+
+
+
+
+ ☆ Reducing Traffic Wastage in Video Streaming via Bandwidth-Efficient
+ Bitrate Adaptation
+
+
+ Bitrate adaptation (also known as ABR) is a crucial technique to improve the
+quality of experience (QoE) for video streaming applications. However, existing
+ABR algorithms suffer from severe traffic wastage, which refers to the traffic
+cost of downloading the video segments that users do not finally consume, for
+example, due to early departure or video skipping. In this paper, we carefully
+formulate the dynamics of buffered data volume (BDV), a strongly correlated
+indicator of traffic wastage, which, to the best of our knowledge, is the first
+time to rigorously clarify the effect of downloading plans on potential
+wastage. To reduce wastage while keeping a high QoE, we present a
+bandwidth-efficient bitrate adaptation algorithm (named BE-ABR), achieving
+consistently low BDV without distinct QoE losses. Specifically, we design a
+precise, time-aware transmission delay prediction model over the Transformer
+architecture, and develop a fine-grained buffer control scheme. Through
+extensive experiments conducted on emulated and real network environments
+including WiFi, 4G, and 5G, we demonstrate that BE-ABR performs well in both
+QoE and bandwidth savings, enabling a 60.87\% wastage reduction and a
+comparable, or even better, QoE, compared to the state-of-the-art methods.
+
+
+
+
+
+
+
+ ☆ PTSBench: A Comprehensive Post-Training Sparsity Benchmark Towards
+ Algorithms and Models
+
+
+ With the increased attention to model efficiency, post-training sparsity
+(PTS) has become more and more prevalent because of its effectiveness and
+efficiency. However, there remain questions on better practice of PTS
+algorithms and the sparsification ability of models, which hinders the further
+development of this area. Therefore, a benchmark to comprehensively investigate
+the issues above is urgently needed. In this paper, we propose the first
+comprehensive post-training sparsity benchmark called PTSBench towards
+algorithms and models. We benchmark 10+ PTS general-pluggable fine-grained
+techniques on 3 typical tasks using over 40 off-the-shelf model architectures.
+Through extensive experiments and analyses, we obtain valuable conclusions and
+provide several insights from both algorithms and model aspects. Our PTSBench
+can provide (1) new observations for a better understanding of the PTS
+algorithms, (2) in-depth and comprehensive evaluations for the sparsification
+ability of models, and (3) a well-structured and easy-integrate open-source
+framework. We hope this work will provide illuminating conclusions and advice
+for future studies of post-training sparsity methods and
+sparsification-friendly model design. The code for our PTSBench is released at
+\href{https://github.com/ModelTC/msbench}{https://github.com/ModelTC/msbench}.
+
+
+
+
+
+
+
+ ☆ RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation
+
+
+ In recent years, robotics has advanced significantly through the integration
+of larger models and large-scale datasets. However, challenges remain in
+applying these models to 3D spatial interactions and managing data collection
+costs. To address these issues, we propose the multimodal robotic manipulation
+model, RoboMM, along with the comprehensive dataset, RoboData. RoboMM enhances
+3D perception through camera parameters and occupancy supervision. Building on
+OpenFlamingo, it incorporates Modality-Isolation-Mask and multimodal decoder
+blocks, improving modality fusion and fine-grained perception. RoboData offers
+the complete evaluation system by integrating several well-known datasets,
+achieving the first fusion of multi-view images, camera parameters, depth maps,
+and actions, and the space alignment facilitates comprehensive learning from
+diverse robotic datasets. Equipped with RoboData and the unified physical
+space, RoboMM is the generalist policy that enables simultaneous evaluation
+across all tasks within multiple datasets, rather than focusing on limited
+selection of data or tasks. Its design significantly enhances robotic
+manipulation performance, increasing the average sequence length on the CALVIN
+from 1.7 to 3.3 and ensuring cross-embodiment capabilities, achieving
+state-of-the-art results across multiple datasets.
+
+
+
+
+
+
+
+ ☆ Annotation Techniques for Judo Combat Phase Classification from
+ Tournament Footage
+
+
+ This paper presents a semi-supervised approach to extracting and analyzing
+combat phases in judo tournaments using live-streamed footage. The objective is
+to automate the annotation and summarization of live streamed judo matches. We
+train models that extract relevant entities and classify combat phases from
+fixed-perspective judo recordings. We employ semi-supervised methods to address
+limited labeled data in the domain. We build a model of combat phases via
+transfer learning from a fine-tuned object detector to classify the presence,
+activity, and standing state of the match. We evaluate our approach on a
+dataset of 19 thirty-second judo clips, achieving an F1 score on a $20\%$ test
+hold-out of 0.66, 0.78, and 0.87 for the three classes, respectively. Our
+results show initial promise for automating more complex information retrieval
+tasks using rigorous methods with limited labeled data.
+
+
+
+
+
+
+
+ ☆ EvRepSL: Event-Stream Representation via Self-Supervised Learning for
+ Event-Based Vision
+
+
+ Event-stream representation is the first step for many computer vision tasks
+using event cameras. It converts the asynchronous event-streams into a
+formatted structure so that conventional machine learning models can be applied
+easily. However, most of the state-of-the-art event-stream representations are
+manually designed and the quality of these representations cannot be guaranteed
+due to the noisy nature of event-streams. In this paper, we introduce a
+data-driven approach aiming at enhancing the quality of event-stream
+representations. Our approach commences with the introduction of a new
+event-stream representation based on spatial-temporal statistics, denoted as
+EvRep. Subsequently, we theoretically derive the intrinsic relationship between
+asynchronous event-streams and synchronous video frames. Building upon this
+theoretical relationship, we train a representation generator, RepGen, in a
+self-supervised learning manner accepting EvRep as input. Finally, the
+event-streams are converted to high-quality representations, termed as EvRepSL,
+by going through the learned RepGen (without the need of fine-tuning or
+retraining). Our methodology is rigorously validated through extensive
+evaluations on a variety of mainstream event-based classification and optical
+flow datasets (captured with various types of event cameras). The experimental
+results highlight not only our approach's superior performance over existing
+event-stream representations but also its versatility, being agnostic to
+different event cameras and tasks.
+
+
+
+ comment: Published on IEEE Transactions on Image Processing
+
+
+
+
+
+
+ ♻ ☆ CHORDONOMICON: A Dataset of 666,000 Songs and their Chord Progressions
+
+
+ Chord progressions encapsulate important information about music, pertaining
+to its structure and conveyed emotions. They serve as the backbone of musical
+composition, and in many cases, they are the sole information required for a
+musician to play along and follow the music. Despite their importance, chord
+progressions as a data domain remain underexplored. There is a lack of
+large-scale datasets suitable for deep learning applications, and limited
+research exploring chord progressions as an input modality. In this work, we
+present Chordonomicon, a dataset of over 666,000 songs and their chord
+progressions, annotated with structural parts, genre, and release date -
+created by scraping various sources of user-generated progressions and
+associated metadata. We demonstrate the practical utility of the Chordonomicon
+dataset for classification and generation tasks, and discuss its potential to
+provide valuable insights to the research community. Chord progressions are
+unique in their ability to be represented in multiple formats (e.g. text,
+graph) and the wealth of information chords convey in given contexts, such as
+their harmonic function . These characteristics make the Chordonomicon an ideal
+testbed for exploring advanced machine learning techniques, including
+transformers, graph machine learning, and hybrid systems that combine knowledge
+representation and machine learning.
+
+
+
+
+
+
+
+ ♻ ☆ MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion
+
+
+
+
+
+
+
+
+ Sai Shashank Kalakonda, Shubh Maheshwari, Ravi Kiran Sarvadevabhatla
+
+
+ We introduce MoRAG, a novel multi-part fusion based retrieval-augmented
+generation strategy for text-based human motion generation. The method enhances
+motion diffusion models by leveraging additional knowledge obtained through an
+improved motion retrieval process. By effectively prompting large language
+models (LLMs), we address spelling errors and rephrasing issues in motion
+retrieval. Our approach utilizes a multi-part retrieval strategy to improve the
+generalizability of motion retrieval across the language space. We create
+diverse samples through the spatial composition of the retrieved motions.
+Furthermore, by utilizing low-level, part-specific motion information, we can
+construct motion samples for unseen text descriptions. Our experiments
+demonstrate that our framework can serve as a plug-and-play module, improving
+the performance of motion diffusion models. Code, pretrained models and sample
+videos are available at: https://motion-rag.github.io/
+
+
+
+
+
+
+
+ ♻ ☆ SOMONITOR: Combining Explainable AI & Large Language Models for
+ Marketing Analytics
+
+
+
+
+
+
+
+
+ Aleksandr Farseev, Qi Yang, Marlo Ongpin, Ilia Gossoudarev, Yu-Yi Chu-Farseeva, Sergey Nikolenko
+
+
+ Online marketing faces formidable challenges in managing and interpreting
+immense volumes of data necessary for competitor analysis, content research,
+and strategic branding. It is impossible to review hundreds to thousands of
+transient online content items by hand, and partial analysis often leads to
+suboptimal outcomes and poorly performing campaigns. We introduce an
+explainable AI framework SOMONITOR that aims to synergize human intuition with
+AI-based efficiency, helping marketers across all stages of the marketing
+funnel, from strategic planning to content creation and campaign execution.
+SOMONITOR incorporates a CTR prediction and ranking model for advertising
+content and uses large language models (LLMs) to process high-performing
+competitor content, identifying core content pillars such as target audiences,
+customer needs, and product features. These pillars are then organized into
+broader categories, including communication themes and targeted customer
+personas. By integrating these insights with data from the brand's own
+advertising campaigns, SOMONITOR constructs a narrative for addressing new
+customer personas and simultaneously generates detailed content briefs in the
+form of user stories that, as shown in the conducted case study, can be
+directly applied by marketing teams to streamline content production and
+campaign execution. The adoption of SOMONITOR in daily operations allows
+digital marketers to quickly parse through extensive datasets, offering
+actionable insights that significantly enhance campaign effectiveness and
+overall job satisfaction.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 12
+
+
+
+
+
+ ☆ FM2DS: Few-Shot Multimodal Multihop Data Synthesis with Knowledge
+ Distillation for Question Answering CVPR 2025
+
+
+
+
+
+
+
+
+ Amirhossein Abaskohi, Spandana Gella, Giuseppe Carenini, Issam H. Laradji
+
+
+ Multimodal multihop question answering is a complex task that requires
+reasoning over multiple sources of information, such as images and text, to
+answer questions. While there has been significant progress in visual question
+answering, the multihop setting remains unexplored due to the lack of
+high-quality datasets. Current methods focus on single-hop question answering
+or a single modality, which makes them unsuitable for real-world scenarios such
+as analyzing multimodal educational materials, summarizing lengthy academic
+articles, or interpreting scientific studies that combine charts, images, and
+text. To address this gap, we propose a novel methodology, introducing the
+first framework for creating a high-quality dataset that enables training
+models for multimodal multihop question answering. Our approach consists of a
+5-stage pipeline that involves acquiring relevant multimodal documents from
+Wikipedia, synthetically generating high-level questions and answers, and
+validating them through rigorous criteria to ensure quality data. We evaluate
+our methodology by training models on our synthesized dataset and testing on
+two benchmarks, our results demonstrate that, with an equal sample size, models
+trained on our synthesized data outperform those trained on human-collected
+data by 1.9 in exact match (EM) on average. We believe our data synthesis
+method will serve as a strong foundation for training and evaluating multimodal
+multihop question answering models.
+
+
+
+
+
+
+
+ ☆ Bridging Conversational and Collaborative Signals for Conversational
+ Recommendation
+
+
+
+
+
+
+
+
+ Ahmad Bin Rabiah, Nafis Sadeq, Julian McAuley
+
+
+ Conversational recommendation systems (CRS) leverage contextual information
+from conversations to generate recommendations but often struggle due to a lack
+of collaborative filtering (CF) signals, which capture user-item interaction
+patterns essential for accurate recommendations. We introduce Reddit-ML32M, a
+dataset that links reddit conversations with interactions on MovieLens 32M, to
+enrich item representations by leveraging collaborative knowledge and
+addressing interaction sparsity in conversational datasets. We propose an
+LLM-based framework that uses Reddit-ML32M to align LLM-generated
+recommendations with CF embeddings, refining rankings for better performance.
+We evaluate our framework against three sets of baselines: CF-based
+recommenders using only interactions from CRS tasks, traditional CRS models,
+and LLM-based methods relying on conversational context without item
+representations. Our approach achieves consistent improvements, including a
+12.32% increase in Hit Rate and a 9.9% improvement in NDCG, outperforming the
+best-performing baseline that relies on conversational context but lacks
+collaborative item representations.
+
+
+
+
+
+
+
+ ☆ Efficient user history modeling with amortized inference for deep
+ learning recommendation models WWW 2025
+
+
+
+
+
+
+
+
+ Lars Hertel, Neil Daftary, Fedor Borisyuk, Aman Gupta, Rahul Mazumder
+
+
+ We study user history modeling via Transformer encoders in deep learning
+recommendation models (DLRM). Such architectures can significantly improve
+recommendation quality, but usually incur high latency cost necessitating
+infrastructure upgrades or very small Transformer models. An important part of
+user history modeling is early fusion of the candidate item and various methods
+have been studied. We revisit early fusion and compare concatenation of the
+candidate to each history item against appending it to the end of the list as a
+separate item. Using the latter method, allows us to reformulate the recently
+proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions}
+for the case of DLRM models. We show via experimental results that appending
+with cross-attention performs on par with concatenation and that amortization
+significantly reduces inference costs. We conclude with results from deploying
+this model on the LinkedIn Feed and Ads surfaces, where amortization reduces
+latency by 30\% compared to non-amortized inference.
+
+
+ Information retrieval systems have historically relied on explicit query
+formulation, requiring users to translate their information needs into text.
+This process is particularly disruptive during reading tasks, where users must
+interrupt their natural flow to formulate queries. We present DEEPER (Dense
+Electroencephalography Passage Retrieval), a novel framework that enables
+direct retrieval of relevant passages from users' neural signals during
+naturalistic reading without intermediate text translation. Building on dense
+retrieval architectures, DEEPER employs a dual-encoder approach with
+specialised components for processing neural data, mapping EEG signals and text
+passages into a shared semantic space. Through careful architecture design and
+cross-modal negative sampling strategies, our model learns to align neural
+patterns with their corresponding textual content. Experimental results on the
+ZuCo dataset demonstrate that direct brain-to-passage retrieval significantly
+outperforms current EEG-to-text baselines, achieving a 571% improvement in
+Precision@1. Our ablation studies reveal that the model successfully learns
+aligned representations between EEG and text modalities (0.29 cosine
+similarity), while our hard negative sampling strategy contributes to overall
+performance increases.
+
+
+ This paper introduces a new semantic search algorithm that uses Word2Vec and
+Annoy Index to improve the efficiency of information retrieval from large
+datasets. The proposed approach addresses the limitations of traditional search
+methods by offering enhanced speed, accuracy, and scalability. Testing on
+datasets up to 100GB demonstrates the method's effectiveness in processing vast
+amounts of data while maintaining high precision and performance.
+
+
+
+ comment: 6 pages, 5 Figures
+
+
+
+
+
+
+ ☆ PRECISE: Pre-training Sequential Recommenders with Collaborative and
+ Semantic Information
+
+
+ Real-world recommendation systems commonly offer diverse content scenarios
+for users to interact with. Considering the enormous number of users in
+industrial platforms, it is infeasible to utilize a single unified
+recommendation model to meet the requirements of all scenarios. Usually,
+separate recommendation pipelines are established for each distinct scenario.
+This practice leads to challenges in comprehensively grasping users' interests.
+Recent research endeavors have been made to tackle this problem by pre-training
+models to encapsulate the overall interests of users. Traditional pre-trained
+recommendation models mainly capture user interests by leveraging collaborative
+signals. Nevertheless, a prevalent drawback of these systems is their
+incapacity to handle long-tail items and cold-start scenarios. With the recent
+advent of large language models, there has been a significant increase in
+research efforts focused on exploiting LLMs to extract semantic information for
+users and items. However, text-based recommendations highly rely on elaborate
+feature engineering and frequently fail to capture collaborative similarities.
+To overcome these limitations, we propose a novel pre-training framework for
+sequential recommendation, termed PRECISE. This framework combines
+collaborative signals with semantic information. Moreover, PRECISE employs a
+learning framework that initially models users' comprehensive interests across
+all recommendation scenarios and subsequently concentrates on the specific
+interests of target-scene behaviors. We demonstrate that PRECISE precisely
+captures the entire range of user interests and effectively transfers them to
+the target interests. Empirical findings reveal that the PRECISE framework
+attains outstanding performance on both public and industrial datasets.
+
+
+
+
+
+
+
+ ☆ Methods for Legal Citation Prediction in the Age of LLMs: An Australian
+ Law Case Study
+
+
+ In recent years, Large Language Models (LLMs) have shown great potential
+across a wide range of legal tasks. Despite these advances, mitigating
+hallucination remains a significant challenge, with state-of-the-art LLMs still
+frequently generating incorrect legal references. In this paper, we focus on
+the problem of legal citation prediction within the Australian law context,
+where correctly identifying and citing relevant legislations or precedents is
+critical. We compare several approaches: prompting general purpose and
+law-specialised LLMs, retrieval-only pipelines with both generic and
+domain-specific embeddings, task-specific instruction-tuning of LLMs, and
+hybrid strategies that combine LLMs with retrieval augmentation, query
+expansion, or voting ensembles. Our findings indicate that domain-specific
+pre-training alone is insufficient for achieving satisfactory citation accuracy
+even after law-specialised pre-training. In contrast, instruction tuning on our
+task-specific dataset dramatically boosts performance reaching the best results
+across all settings. We also highlight that database granularity along with the
+type of embeddings play a critical role in the performance of retrieval
+systems. Among retrieval-based approaches, hybrid methods consistently
+outperform retrieval-only setups, and among these, ensemble voting delivers the
+best result by combining the predictive quality of instruction-tuned LLMs with
+the retrieval system.
+
+
+
+ comment: For code, data, and models see https://auslawbench.github.io
+
+
+
+
+
+
+ ♻ ☆ Predictive Models in Sequential Recommendations: Bridging Performance
+ Laws with Data Quality Insights
+
+
+ Sequential Recommendation (SR) plays a critical role in predicting users'
+sequential preferences. Despite its growing prominence in various industries,
+the increasing scale of SR models incurs substantial computational costs and
+unpredictability, challenging developers to manage resources efficiently. Under
+this predicament, Scaling Laws have achieved significant success by examining
+the loss as models scale up. However, there remains a disparity between loss
+and model performance, which is of greater concern in practical applications.
+Moreover, as data continues to expand, it incorporates repetitive and
+inefficient data. In response, we introduce the Performance Law for SR models,
+which aims to theoretically investigate and model the relationship between
+model performance and data quality. Specifically, we first fit the HR and NDCG
+metrics to transformer-based SR models. Subsequently, we propose Approximate
+Entropy (ApEn) to assess data quality, presenting a more nuanced approach
+compared to traditional data quantity metrics. Our method enables accurate
+predictions across various dataset scales and model sizes, demonstrating a
+strong correlation in large SR models and offering insights into achieving
+optimal performance for any given model configuration.
+
+
+
+ comment: 12 pages, 5 figures
+
+
+
+
+
+
+ ♻ ☆ Croissant: A Metadata Format for ML-Ready Datasets NeurIPS 2024
+
+
+
+
+
+
+
+
+ Mubashara Akhtar, Omar Benjelloun, Costanza Conforti, Luca Foschini, Joan Giner-Miguelez, Pieter Gijsbers, Sujata Goswami, Nitisha Jain, Michalis Karamousadakis, Michael Kuchnik, Satyapriya Krishna, Sylvain Lesage, Quentin Lhoest, Pierre Marcenac, Manil Maskey, Peter Mattson, Luis Oala, Hamidah Oderinwale, Pierre Ruyssen, Tim Santos, Rajat Shinde, Elena Simperl, Arjun Suresh, Goeffry Thomas, Slava Tykhonov, Joaquin Vanschoren, Susheel Varma, Jos van der Velde, Steffen Vogler, Carole-Jean Wu, Luyao Zhang
+
+
+ Data is a critical resource for machine learning (ML), yet working with data
+remains a key friction point. This paper introduces Croissant, a metadata
+format for datasets that creates a shared representation across ML tools,
+frameworks, and platforms. Croissant makes datasets more discoverable,
+portable, and interoperable, thereby addressing significant challenges in ML
+data management. Croissant is already supported by several popular dataset
+repositories, spanning hundreds of thousands of datasets, enabling easy loading
+into the most commonly-used ML frameworks, regardless of where the data is
+stored. Our initial evaluation by human raters shows that Croissant metadata is
+readable, understandable, complete, yet concise.
+
+
+
+ comment: Published at the NeurIPS 2024 Datasets and Benchmark Track. A shorter
+ version appeared earlier in Proceedings of ACM SIGMOD/PODS'24 Data Management
+ for End-to-End Machine Learning (DEEM) Workshop
+ https://dl.acm.org/doi/10.1145/3650203.3663326
+
+
+
+
+
+
+ ♻ ☆ Enhancing Graph Contrastive Learning with Reliable and Informative
+ Augmentation for Recommendation KDD 2025
+
+
+ Graph neural network(GNN) has been a powerful approach in collaborative
+filtering(CF) due to its ability to model high-order user-item relationships.
+Recently, to alleviate the data sparsity and enhance representation learning,
+many efforts have been conducted to integrate contrastive learning(CL) with
+GNNs. Despite the promising improvements, the contrastive view generation based
+on structure and representation perturbations in existing methods potentially
+disrupts the collaborative information in contrastive views, resulting in
+limited effectiveness of positive alignment. To overcome this issue, we propose
+CoGCL, a novel framework that aims to enhance graph contrastive learning by
+constructing contrastive views with stronger collaborative information via
+discrete codes. The core idea is to map users and items into discrete codes
+rich in collaborative information for reliable and informative contrastive view
+generation. To this end, we initially introduce a multi-level vector quantizer
+in an end-to-end manner to quantize user and item representations into discrete
+codes. Based on these discrete codes, we enhance the collaborative information
+of contrastive views by considering neighborhood structure and semantic
+relevance respectively. For neighborhood structure, we propose virtual neighbor
+augmentation by treating discrete codes as virtual neighbors, which expands an
+observed user-item interaction into multiple edges involving discrete codes.
+Regarding semantic relevance, we identify similar users/items based on shared
+discrete codes and interaction targets to generate the semantically relevant
+view. Through these strategies, we construct contrastive views with stronger
+collaborative information and develop a triple-view graph contrastive learning
+approach. Extensive experiments on four public datasets demonstrate the
+effectiveness of our proposed approach.
+
+
+ This paper presents a novel approach to compute food composition data for
+Indian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The
+primary focus is to provide a broad overview of an automated food composition
+analysis workflow and describe its core functionalities: nutrition data
+aggregation, food composition analysis, and LLM-augmented information
+resolution. This workflow aims to complement FKG.in and iteratively supplement
+food composition data from verified knowledge bases. Additionally, this paper
+highlights the challenges of representing Indian food and accessing food
+composition data digitally. It also reviews three key sources of food
+composition data: the Indian Food Composition Tables, the Indian Nutrient
+Databank, and the Nutritionix API. Furthermore, it briefly outlines how users
+can interact with the workflow to obtain diet-based health recommendations and
+detailed food composition information for numerous recipes. We then explore the
+complex challenges of analyzing Indian recipe information across dimensions
+such as structure, multilingualism, and uncertainty as well as present our
+ongoing work on LLM-based solutions to address these issues. The methods
+proposed in this workshop paper for AI-driven knowledge curation and
+information resolution are application-agnostic, generalizable, and replicable
+for any domain.
+
+
+ This paper introduces xRAG, an innovative context compression method tailored
+for retrieval-augmented generation. xRAG reinterprets document embeddings in
+dense retrieval--traditionally used solely for retrieval--as features from the
+retrieval modality. By employing a modality fusion methodology, xRAG seamlessly
+integrates these embeddings into the language model representation space,
+effectively eliminating the need for their textual counterparts and achieving
+an extreme compression rate. In xRAG, the only trainable component is the
+modality bridge, while both the retriever and the language model remain frozen.
+This design choice allows for the reuse of offline-constructed document
+embeddings and preserves the plug-and-play nature of retrieval augmentation.
+Experimental results demonstrate that xRAG achieves an average improvement of
+over 10% across six knowledge-intensive tasks, adaptable to various language
+model backbones, ranging from a dense 7B model to an 8x7B Mixture of Experts
+configuration. xRAG not only significantly outperforms previous context
+compression methods but also matches the performance of uncompressed models on
+several datasets, while reducing overall FLOPs by a factor of 3.53. Our work
+pioneers new directions in retrieval-augmented generation from the perspective
+of multimodality fusion, and we hope it lays the foundation for future
+efficient and scalable retrieval-augmented systems
+
+
+
+ comment: Neurips 2024
+
+
+
+
+
+
+
+
+
+ Multimedia 11
+
+
+
+
+
+ ☆ OmniEvalKit: A Modular, Lightweight Toolbox for Evaluating Large
+ Language Model and its Omni-Extensions
+
+
+ The rapid advancements in Large Language Models (LLMs) have significantly
+expanded their applications, ranging from multilingual support to
+domain-specific tasks and multimodal integration. In this paper, we present
+OmniEvalKit, a novel benchmarking toolbox designed to evaluate LLMs and their
+omni-extensions across multilingual, multidomain, and multimodal capabilities.
+Unlike existing benchmarks that often focus on a single aspect, OmniEvalKit
+provides a modular, lightweight, and automated evaluation system. It is
+structured with a modular architecture comprising a Static Builder and Dynamic
+Data Flow, promoting the seamless integration of new models and datasets.
+OmniEvalKit supports over 100 LLMs and 50 evaluation datasets, covering
+comprehensive evaluations across thousands of model-dataset combinations.
+OmniEvalKit is dedicated to creating an ultra-lightweight and fast-deployable
+evaluation framework, making downstream applications more convenient and
+versatile for the AI community.
+
+
+
+
+
+
+
+ ☆ MuMu-LLaMA: Multi-modal Music Understanding and Generation via Large
+ Language Models
+
+
+ Research on large language models has advanced significantly across text,
+speech, images, and videos. However, multi-modal music understanding and
+generation remain underexplored due to the lack of well-annotated datasets. To
+address this, we introduce a dataset with 167.69 hours of multi-modal data,
+including text, images, videos, and music annotations. Based on this dataset,
+we propose MuMu-LLaMA, a model that leverages pre-trained encoders for music,
+images, and videos. For music generation, we integrate AudioLDM 2 and MusicGen.
+Our evaluation across four tasks--music understanding, text-to-music
+generation, prompt-based music editing, and multi-modal music
+generation--demonstrates that MuMu-LLaMA outperforms state-of-the-art models,
+showing its potential for multi-modal music applications.
+
+
+
+
+
+
+
+ ☆ AI TrackMate: Finally, Someone Who Will Give Your Music More Than Just
+ "Sounds Great!" NeurIPS 2024
+
+
+ The rise of "bedroom producers" has democratized music creation, while
+challenging producers to objectively evaluate their work. To address this, we
+present AI TrackMate, an LLM-based music chatbot designed to provide
+constructive feedback on music productions. By combining LLMs' inherent musical
+knowledge with direct audio track analysis, AI TrackMate offers
+production-specific insights, distinguishing it from text-only approaches. Our
+framework integrates a Music Analysis Module, an LLM-Readable Music Report, and
+Music Production-Oriented Feedback Instruction, creating a plug-and-play,
+training-free system compatible with various LLMs and adaptable to future
+advancements. We demonstrate AI TrackMate's capabilities through an interactive
+web interface and present findings from a pilot study with a music producer. By
+bridging AI capabilities with the needs of independent producers, AI TrackMate
+offers on-demand analytical feedback, potentially supporting the creative
+process and skill development in music production. This system addresses the
+growing demand for objective self-assessment tools in the evolving landscape of
+independent music production.
+
+
+
+ comment: Accepted for the NeurIPS 2024 Creative AI Track
+
+
+
+
+
+
+ ☆ Towards Controllable Speech Synthesis in the Era of Large Language
+ Models: A Survey
+
+
+
+
+
+
+
+
+ Tianxin Xie, Yan Rong, Pengfei Zhang, Li Liu
+
+
+ Text-to-speech (TTS), also known as speech synthesis, is a prominent research
+area that aims to generate natural-sounding human speech from text. Recently,
+with the increasing industrial demand, TTS technologies have evolved beyond
+synthesizing human-like speech to enabling controllable speech generation. This
+includes fine-grained control over various attributes of synthesized speech
+such as emotion, prosody, timbre, and duration. Besides, advancements in deep
+learning, such as diffusion and large language models, have significantly
+enhanced controllable TTS over the past several years. In this paper, we
+conduct a comprehensive survey of controllable TTS, covering approaches ranging
+from basic control techniques to methods utilizing natural language prompts,
+aiming to provide a clear understanding of the current state of research. We
+examine the general controllable TTS pipeline, challenges, model architectures,
+and control strategies, offering a comprehensive and clear taxonomy of existing
+methods. Additionally, we provide a detailed summary of datasets and evaluation
+metrics and shed some light on the applications and future directions of
+controllable TTS. To the best of our knowledge, this survey paper provides the
+first comprehensive review of emerging controllable TTS methods, which can
+serve as a beneficial resource for both academic researchers and industry
+practitioners.
+
+
+
+ comment: A comprehensive survey on controllable TTS, 23 pages, 6 tables, 4
+ figures, 280 references
+
+
+
+
+
+
+ ☆ 4D Gaussian Splatting with Scale-aware Residual Field and Adaptive
+ Optimization for Real-time Rendering of Temporally Complex Dynamic Scenes
+
+
+ Reconstructing dynamic scenes from video sequences is a highly promising task
+in the multimedia domain. While previous methods have made progress, they often
+struggle with slow rendering and managing temporal complexities such as
+significant motion and object appearance/disappearance. In this paper, we
+propose SaRO-GS as a novel dynamic scene representation capable of achieving
+real-time rendering while effectively handling temporal complexities in dynamic
+scenes. To address the issue of slow rendering speed, we adopt a Gaussian
+primitive-based representation and optimize the Gaussians in 4D space, which
+facilitates real-time rendering with the assistance of 3D Gaussian Splatting.
+Additionally, to handle temporally complex dynamic scenes, we introduce a
+Scale-aware Residual Field. This field considers the size information of each
+Gaussian primitive while encoding its residual feature and aligns with the
+self-splitting behavior of Gaussian primitives. Furthermore, we propose an
+Adaptive Optimization Schedule, which assigns different optimization strategies
+to Gaussian primitives based on their distinct temporal properties, thereby
+expediting the reconstruction of dynamic regions. Through evaluations on
+monocular and multi-view datasets, our method has demonstrated state-of-the-art
+performance. Please see our project page at
+https://yjb6.github.io/SaRO-GS.github.io.
+
+
+ Crack detection is a critical task in structural health monitoring, aimed at
+assessing the structural integrity of bridges, buildings, and roads to prevent
+potential failures. Vision-based crack detection has become the mainstream
+approach due to its ease of implementation and effectiveness. Fusing infrared
+(IR) channels with red, green and blue (RGB) channels can enhance feature
+representation and thus improve crack detection. However, IR and RGB channels
+often differ in resolution. To align them, higher-resolution RGB images
+typically need to be downsampled to match the IR image resolution, which leads
+to the loss of fine details. Moreover, crack detection performance is
+restricted by the limited receptive fields and high computational complexity of
+traditional image segmentation networks. Inspired by the recently proposed
+Mamba neural architecture, this study introduces a two-stage paradigm called
+MSCrackMamba, which leverages Vision Mamba along with a super-resolution
+network to address these challenges. Specifically, to align IR and RGB
+channels, we first apply super-resolution to IR channels to match the
+resolution of RGB channels for data fusion. Vision Mamba is then adopted as the
+backbone network, while UperNet is employed as the decoder for crack detection.
+Our approach is validated on the large-scale Crack Detection dataset Crack900,
+demonstrating an improvement of 3.55% in mIoU compared to the best-performing
+baseline methods.
+
+
+
+
+
+
+
+ ☆ Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal
+ Latent Alignment
+
+
+
+
+
+
+
+
+ Kim Sung-Bin, Arda Senocak, Hyunwoo Ha, Tae-Hyun Oh
+
+
+ How does audio describe the world around us? In this work, we propose a
+method for generating images of visual scenes from diverse in-the-wild sounds.
+This cross-modal generation task is challenging due to the significant
+information gap between auditory and visual signals. We address this challenge
+by designing a model that aligns audio-visual modalities by enriching audio
+features with visual information and translating them into the visual latent
+space. These features are then fed into the pre-trained image generator to
+produce images. To enhance image quality, we use sound source localization to
+select audio-visual pairs with strong cross-modal correlations. Our method
+achieves substantially better results on the VEGAS and VGGSound datasets
+compared to previous work and demonstrates control over the generation process
+through simple manipulations to the input waveform or latent space.
+Furthermore, we analyze the geometric properties of the learned embedding space
+and demonstrate that our learning approach effectively aligns audio-visual
+signals for cross-modal generation. Based on this analysis, we show that our
+method is agnostic to specific design choices, showing its generalizability by
+integrating various model architectures and different types of audio-visual
+data.
+
+
+
+ comment: Under-review
+
+
+
+
+
+
+ ☆ Pilot-guided Multimodal Semantic Communication for Audio-Visual Event
+ Localization
+
+
+
+
+
+
+
+
+ Fei Yu, Zhe Xiang, Nan Che, Zhuoran Zhang, Yuandi Li, Junxiao Xue, Zhiguo Wan
+
+
+ Multimodal semantic communication, which integrates various data modalities
+such as text, images, and audio, significantly enhances communication
+efficiency and reliability. It has broad application prospects in fields such
+as artificial intelligence, autonomous driving, and smart homes. However,
+current research primarily relies on analog channels and assumes constant
+channel states (perfect CSI), which is inadequate for addressing dynamic
+physical channels and noise in real-world scenarios. Existing methods often
+focus on single modality tasks and fail to handle multimodal stream data, such
+as video and audio, and their corresponding tasks. Furthermore, current
+semantic encoding and decoding modules mainly transmit single modality
+features, neglecting the need for multimodal semantic enhancement and
+recognition tasks.
+ To address these challenges, this paper proposes a pilot-guided framework for
+multimodal semantic communication specifically tailored for audio-visual event
+localization tasks. This framework utilizes digital pilot codes and channel
+modules to guide the state of analog channels in real-wold scenarios and
+designs Euler-based multimodal semantic encoding and decoding that consider
+time-frequency characteristics based on dynamic channel state. This approach
+effectively handles multimodal stream source data, especially for audio-visual
+event localization tasks. Extensive numerical experiments demonstrate the
+robustness of the proposed framework in channel changes and its support for
+various communication scenarios. The experimental results show that the
+framework outperforms existing benchmark methods in terms of Signal-to-Noise
+Ratio (SNR), highlighting its advantage in semantic communication quality.
+
+
+
+
+
+
+
+ ♻ ☆ M$^{2}$UGen: Multi-modal Music Understanding and Generation with the
+ Power of Large Language Models
+
+
+ The current landscape of research leveraging large language models (LLMs) is
+experiencing a surge. Many works harness the powerful reasoning capabilities of
+these models to comprehend various modalities, such as text, speech, images,
+videos, etc. They also utilize LLMs to understand human intention and generate
+desired outputs like images, videos, and music. However, research that combines
+both understanding and generation using LLMs is still limited and in its
+nascent stage. To address this gap, we introduce a Multi-modal Music
+Understanding and Generation (M$^{2}$UGen) framework that integrates LLM's
+abilities to comprehend and generate music for different modalities. The
+M$^{2}$UGen framework is purpose-built to unlock creative potential from
+diverse sources of inspiration, encompassing music, image, and video through
+the use of pretrained MERT, ViT, and ViViT models, respectively. To enable
+music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging
+multi-modal understanding and music generation is accomplished through the
+integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA
+model to generate extensive datasets that support text/image/video-to-music
+generation, facilitating the training of our M$^{2}$UGen framework. We conduct
+a thorough evaluation of our proposed framework. The experimental results
+demonstrate that our model achieves or surpasses the performance of the current
+state-of-the-art models.
+
+
+
+
+
+
+
+ ♻ ☆ StableMoFusion: Towards Robust and Efficient Diffusion-based Motion
+ Generation Framework
+
+
+ Thanks to the powerful generative capacity of diffusion models, recent years
+have witnessed rapid progress in human motion generation. Existing
+diffusion-based methods employ disparate network architectures and training
+strategies. The effect of the design of each component is still unclear. In
+addition, the iterative denoising process consumes considerable computational
+overhead, which is prohibitive for real-time scenarios such as virtual
+characters and humanoid robots. For this reason, we first conduct a
+comprehensive investigation into network architectures, training strategies,
+and inference processs. Based on the profound analysis, we tailor each
+component for efficient high-quality human motion generation. Despite the
+promising performance, the tailored model still suffers from foot skating which
+is an ubiquitous issue in diffusion-based solutions. To eliminate footskate, we
+identify foot-ground contact and correct foot motions along the denoising
+process. By organically combining these well-designed components together, we
+present StableMoFusion, a robust and efficient framework for human motion
+generation. Extensive experimental results show that our StableMoFusion
+performs favorably against current state-of-the-art methods. Project page:
+https://h-y1heng.github.io/StableMoFusion-page/
+
+
+
+
+
+
+
+ ♻ ☆ Towards Emotion Analysis in Short-form Videos: A Large-Scale Dataset and
+ Baseline
+
+
+
+
+
+
+
+
+ Xuecheng Wu, Heli Sun, Junxiao Xue, Jiayu Nie, Xiangyan Kong, Ruofan Zhai, Liang He
+
+
+ Nowadays, short-form videos (SVs) are essential to web information
+acquisition and sharing in our daily life. The prevailing use of SVs to spread
+emotions leads to the necessity of conducting video emotion analysis (VEA)
+towards SVs. Considering the lack of SVs emotion data, we introduce a
+large-scale dataset named eMotions, comprising 27,996 videos. Meanwhile, we
+alleviate the impact of subjectivities on labeling quality by emphasizing
+better personnel allocations and multi-stage annotations. In addition, we
+provide the category-balanced and test-oriented variants through targeted data
+sampling. Some commonly used videos, such as facial expressions, have been well
+studied. However, it is still challenging to analysis the emotions in SVs.
+Since the broader content diversity brings more distinct semantic gaps and
+difficulties in learning emotion-related features, and there exists local
+biases and collective information gaps caused by the emotion inconsistence
+under the prevalently audio-visual co-expressions. To tackle these challenges,
+we present an end-to-end audio-visual baseline AV-CANet which employs the video
+transformer to better learn semantically relevant representations. We further
+design the Local-Global Fusion Module to progressively capture the correlations
+of audio-visual features. The EP-CE Loss is then introduced to guide model
+optimization. Extensive experimental results on seven datasets demonstrate the
+effectiveness of AV-CANet, while providing broad insights for future works.
+Besides, we investigate the key components of AV-CANet by ablation studies.
+Datasets and code will be fully open soon.
+
+
+ Recent advances have extended the context window of frontier LLMs
+dramatically, from a few thousand tokens up to millions, enabling entire books
+and codebases to fit into context. However, the compute costs of inferencing
+long-context LLMs are massive and often prohibitive in practice. RAG offers an
+efficient and effective alternative: retrieve and process only the subset of
+the context most important for the current task. Although promising, recent
+work applying RAG to long-context tasks has two core limitations: 1) there has
+been little focus on making the RAG pipeline compute efficient, and 2) such
+works only test on simple QA tasks, and their performance on more challenging
+tasks is unclear. To address this, we develop an algorithm based on PageRank, a
+graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR).
+MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented
+using sparse matrices for efficent, cheap retrieval that can deal with a
+variety of complex tasks. Our MixPR retriever achieves state-of-the-art results
+across a wide range of long-context benchmark tasks, outperforming both
+existing RAG methods, specialized retrieval architectures, and long-context
+LLMs despite being far more compute efficient. Due to using sparse embeddings,
+our retriever is extremely compute efficient, capable of embedding and
+retrieving millions of tokens within a few seconds and runs entirely on CPU.
+
+
+
+
+
+
+
+ ☆ Fuzzy Norm-Explicit Product Quantization for Recommender Systems
+
+
+
+
+
+
+
+
+ Mohammadreza Jamalifard, Javier Andreu-Perez, Hani Hagras, Luis Martínez López
+
+
+ As the data resources grow, providing recommendations that best meet the
+demands has become a vital requirement in business and life to overcome the
+information overload problem. However, building a system suggesting relevant
+recommendations has always been a point of debate. One of the most
+cost-efficient techniques in terms of producing relevant recommendations at a
+low complexity is Product Quantization (PQ). PQ approaches have continued
+developing in recent years. This system's crucial challenge is improving
+product quantization performance in terms of recall measures without
+compromising its complexity. This makes the algorithm suitable for problems
+that require a greater number of potentially relevant items without
+disregarding others, at high-speed and low-cost to keep up with traffic. This
+is the case of online shops where the recommendations for the purpose are
+important, although customers can be susceptible to scoping other products.
+This research proposes a fuzzy approach to perform norm-based product
+quantization. Type-2 Fuzzy sets (T2FSs) define the codebook allowing
+sub-vectors (T2FSs) to be associated with more than one element of the
+codebook, and next, its norm calculus is resolved by means of integration. Our
+method finesses the recall measure up, making the algorithm suitable for
+problems that require querying at most possible potential relevant items
+without disregarding others. The proposed method outperforms all PQ approaches
+such as NEQ, PQ, and RQ up to +6%, +5%, and +8% by achieving a recall of 94%,
+69%, 59% in Netflix, Audio, Cifar60k datasets, respectively. More and over,
+computing time and complexity nearly equals the most computationally efficient
+existing PQ method in the state-of-the-art.
+
+
+
+
+
+
+
+ ☆ 1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval
+ (LeSeR) for Regulatory Question Answering COLING 2025
+
+
+ This paper presents the system description of our entry for the COLING 2025
+RegNLP RIRAG (Regulatory Information Retrieval and Answer Generation)
+challenge, focusing on leveraging advanced information retrieval and answer
+generation techniques in regulatory domains. We experimented with a combination
+of embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged
+fine-tuning and reranking for retrieving relevant documents in top ranks. We
+utilized a novel approach, LeSeR, which achieved competitive results with a
+recall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights
+the transformative potential of natural language processing techniques in
+regulatory applications, offering insights into their capabilities for
+implementing a retrieval augmented generation system while identifying areas
+for future improvement in robustness and domain adaptation.
+
+
+
+
+
+
+
+ ☆ Accelerating Manufacturing Scale-Up from Material Discovery Using
+ Agentic Web Navigation and Retrieval-Augmented AI for Process Engineering
+ Schematics Design
+
+
+ Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (PIDs)
+are critical tools for industrial process design, control, and safety. However,
+the generation of precise and regulation-compliant diagrams remains a
+significant challenge, particularly in scaling breakthroughs from material
+discovery to industrial production in an era of automation and digitalization.
+This paper introduces an autonomous agentic framework to address these
+challenges through a twostage approach involving knowledge acquisition and
+generation. The framework integrates specialized sub-agents for retrieving and
+synthesizing multimodal data from publicly available online sources and
+constructs ontological knowledge graphs using a Graph Retrieval-Augmented
+Generation (Graph RAG) paradigm. These capabilities enable the automation of
+diagram generation and open-domain question answering (ODQA) tasks with high
+contextual accuracy. Extensive empirical experiments demonstrate the frameworks
+ability to deliver regulation-compliant diagrams with minimal expert
+intervention, highlighting its practical utility for industrial applications.
+
+
+ Developing increasingly efficient and accurate algorithms for approximate
+nearest neighbor search is a paramount goal in modern information retrieval. A
+primary approach to addressing this question is clustering, which involves
+partitioning the dataset into distinct groups, with each group characterized by
+a representative data point. By this method, retrieving the top-k data points
+for a query requires identifying the most relevant clusters based on their
+representatives -- a routing step -- and then conducting a nearest neighbor
+search within these clusters only, drastically reducing the search space.
+ The objective of this thesis is not only to provide a comprehensive
+explanation of clustering-based approximate nearest neighbor search but also to
+introduce and delve into every aspect of our novel state-of-the-art method,
+which originated from a natural observation: The routing function solves a
+ranking problem, making the function amenable to learning-to-rank. The
+development of this intuition and applying it to maximum inner product search
+has led us to demonstrate that learning cluster representatives using a simple
+linear function significantly boosts the accuracy of clustering-based
+approximate nearest neighbor search.
+
+
+
+
+
+
+
+ ☆ Automated Extraction and Creation of FBS Design Reasoning Knowledge
+ Graphs from Structured Data in Product Catalogues Lacking Contextual
+ Information
+
+
+ Ontology-based knowledge graphs (KG) are desirable for effective knowledge
+management and reuse in various decision making scenarios, including design.
+Creating and populating extensive KG based on specific ontological models can
+be highly labour and time-intensive unless automated processes are developed
+for knowledge extraction and graph creation. Most research and development on
+automated extraction and creation of KG is based on extensive unstructured data
+sets that provide contextual information. However, some of the most useful
+information about the products and services of a company has traditionally been
+recorded as structured data. Such structured data sets rarely follow a standard
+ontology, do not capture explicit mapping of relationships between the
+entities, and provide no contextual information. Therefore, this research
+reports a method and digital workflow developed to address this gap. The
+developed method and workflow employ rule-based techniques to extract and
+create a Function Behaviour-Structure (FBS) ontology-based KG from legacy
+structured data, especially specification sheets and product catalogues. The
+solution approach consists of two main components: a process for deriving
+context and context-based classification rules for FBS ontology concepts and a
+workflow for populating and retrieving the FBS ontology-based KG. KG and
+Natural Language Processing (NLP) are used to automate knowledge extraction,
+representation, and retrieval. The workflow's effectiveness is demonstrated via
+pilot implementation in an industrial context. Insights gained from the pilot
+study are reported regarding the challenges and opportunities, including
+discussing the FBS ontology and concepts.
+
+
+
+ comment: 31 pages, with 17 figures and 10 tables
+
+
+
+
+
+
+ ♻ ☆ Language Model Powered Digital Biology with BRAD
+
+
+
+
+
+
+
+
+ Joshua Pickard, Ram Prakash, Marc Andrew Choi, Natalie Oliven, Cooper Stansbury, Jillian Cwycyshyn, Alex Gorodetsky, Alvaro Velasquez, Indika Rajapakse
+
+
+ Recent advancements in Large Language Models (LLMs) are transforming biology,
+computer science, engineering, and every day life. However, integrating the
+wide array of computational tools, databases, and scientific literature
+continues to pose a challenge to biological research. LLMs are well-suited for
+unstructured integration, efficient information retrieval, and automating
+standard workflows and actions from these diverse resources. To harness these
+capabilities in bioinformatics, we present a prototype Bioinformatics Retrieval
+Augmented Digital assistant (BRAD). BRAD is a chatbot and agentic system that
+integrates a variety of bioinformatics tools. The Python package implements an
+AI \texttt{Agent} that is powered by LLMs and connects to a local file system,
+online databases, and a user's software. The \texttt{Agent} is highly
+configurable, enabling tasks such as Retrieval-Augmented Generation, searches
+across bioinformatics databases, and the execution of software pipelines.
+BRAD's coordinated integration of bioinformatics tools delivers a context-aware
+and semi-autonomous system that extends beyond the capabilities of conventional
+LLM-based chatbots. A graphical user interface (GUI) provides an intuitive
+interface to the system.
+
+
+
+
+
+
+
+ ♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance
+ Modeling in Commercial Search
+
+
+
+
+
+
+
+
+ Kaixin Wu, Yixin Ji, Zeyuan Chen, Qiang Wang, Cunxiang Wang, Hong Liu, Baijun Ji, Jia Xu, Zhongyi Liu, Jinjie Gu, Yuan Zhou, Linjian Mo
+
+
+ Relevance modeling between queries and items stands as a pivotal component in
+commercial search engines, directly affecting the user experience. Given the
+remarkable achievements of large language models (LLMs) in various natural
+language processing (NLP) tasks, LLM-based relevance modeling is gradually
+being adopted within industrial search systems. Nevertheless, foundational LLMs
+lack domain-specific knowledge and do not fully exploit the potential of
+in-context learning. Furthermore, structured item text remains underutilized,
+and there is a shortage in the supply of corresponding queries and background
+knowledge. We thereby propose CPRM (Continual Pre-training for Relevance
+Modeling), a framework designed for the continual pre-training of LLMs to
+address these issues. Our CPRM framework includes three modules: 1) employing
+both queries and multi-field item to jointly pre-train for enhancing domain
+knowledge, 2) applying in-context pre-training, a novel approach where LLMs are
+pre-trained on a sequence of related queries or items, and 3) conducting
+reading comprehension on items to produce associated domain knowledge and
+background information (e.g., generating summaries and corresponding queries)
+to further strengthen LLMs. Results on offline experiments and online A/B
+testing demonstrate that our model achieves convincing performance compared to
+strong baselines.
+
+
+
+
+
+
+
+ ♻ ☆ FairSort: Learning to Fair Rank for Personalized Recommendations in
+ Two-Sided Platforms
+
+
+ Traditional recommendation systems focus on maximizing user satisfaction by
+suggesting their favourite items. This user-centric approach may lead to unfair
+exposure distribution among the providers. On the contrary, a provider-centric
+design might become unfair to the users. Therefore, this paper proposes a
+re-ranking model FairSort to find a trade-off solution among user-side
+fairness, provider-side fairness, and personalized recommendations utility.
+Previous works habitually treat this issue as a knapsack problem, incorporating
+both-side fairness as constraints.
+ In this paper, we adopt a novel perspective, treating each recommendation
+list as a runway rather than a knapsack. In this perspective, each item on the
+runway gains a velocity and runs within a specific time, achieving re-ranking
+for both-side fairness. Meanwhile, we ensure the Minimum Utility Guarantee for
+personalized recommendations by designing a Binary Search approach. This can
+provide more reliable recommendations compared to the conventional greedy
+strategy based on the knapsack problem. We further broaden the applicability of
+FairSort, designing two versions for online and offline recommendation
+scenarios. Theoretical analysis and extensive experiments on real-world
+datasets indicate that FairSort can ensure more reliable personalized
+recommendations while considering fairness for both the provider and user.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 6
+
+
+
+
+
+ ☆ M6: Multi-generator, Multi-domain, Multi-lingual and cultural,
+ Multi-genres, Multi-instrument Machine-Generated Music Detection Databases
+
+
+
+
+
+
+
+
+ Yupei Li, Hanqian Li, Lucia Specia, Björn W. Schuller
+
+
+ Machine-generated music (MGM) has emerged as a powerful tool with
+applications in music therapy, personalised editing, and creative inspiration
+for the music community. However, its unregulated use threatens the
+entertainment, education, and arts sectors by diminishing the value of
+high-quality human compositions. Detecting machine-generated music (MGMD) is,
+therefore, critical to safeguarding these domains, yet the field lacks
+comprehensive datasets to support meaningful progress. To address this gap, we
+introduce \textbf{M6}, a large-scale benchmark dataset tailored for MGMD
+research. M6 is distinguished by its diversity, encompassing multiple
+generators, domains, languages, cultural contexts, genres, and instruments. We
+outline our methodology for data selection and collection, accompanied by
+detailed data analysis, providing all WAV form of music. Additionally, we
+provide baseline performance scores using foundational binary classification
+models, illustrating the complexity of MGMD and the significant room for
+improvement. By offering a robust and multifaceted resource, we aim to empower
+future research to develop more effective detection methods for MGM. We believe
+M6 will serve as a critical step toward addressing this societal challenge. The
+dataset and code will be freely available to support open collaboration and
+innovation in this field.
+
+
+ Content creators often use music to enhance their videos, from soundtracks in
+movies to background music in video blogs and social media content. However,
+identifying the best music for a video can be a difficult and time-consuming
+task. To address this challenge, we propose a novel framework for automatically
+retrieving a matching music clip for a given video, and vice versa. Our
+approach leverages annotated music labels, as well as the inherent artistic
+correspondence between visual and music elements. Distinct from previous
+cross-modal music retrieval works, our method combines both self-supervised and
+supervised training objectives. We use self-supervised and label-supervised
+contrastive learning to train a joint embedding space between music and video.
+We show the effectiveness of our approach by using music genre labels for the
+supervised training component, and our framework can be generalized to other
+music annotations (e.g., emotion, instrument, etc.). Furthermore, our method
+enables fine-grained control over how much the retrieval process focuses on
+self-supervised vs. label information at inference time. We evaluate the
+learned embeddings through a variety of video-to-music and music-to-video
+retrieval tasks. Our experiments show that the proposed approach successfully
+combines self-supervised and supervised objectives and is effective for
+controllable music-video retrieval.
+
+
+ Large Multimodal Models (LMMs) have demonstrated impressive capabilities in
+multimodal understanding and generation, pushing forward advancements in
+text-to-image generation. However, achieving accurate text-image alignment for
+LMMs, particularly in compositional scenarios, remains challenging. Existing
+approaches, such as layout planning for multi-step generation and learning from
+human feedback or AI feedback, depend heavily on prompt engineering, costly
+human annotations, and continual upgrading, limiting flexibility and
+scalability. In this work, we introduce a model-agnostic iterative
+self-improvement framework (SILMM) that can enable LMMs to provide helpful and
+scalable self-feedback and optimize text-image alignment via Direct Preference
+Optimization (DPO). DPO can readily applied to LMMs that use discrete visual
+tokens as intermediate image representations; while it is less suitable for
+LMMs with continuous visual features, as obtaining generation probabilities is
+challenging. To adapt SILMM to LMMs with continuous features, we propose a
+diversity mechanism to obtain diverse representations and a kernel-based
+continuous DPO for alignment. Extensive experiments on three compositional
+text-to-image generation benchmarks validate the effectiveness and superiority
+of SILMM, showing improvements exceeding 30% on T2I-CompBench++ and around 20%
+on DPG-Bench.
+
+
+ Effective compression technology is crucial for 3DGS to adapt to varying
+storage and transmission conditions. However, existing methods fail to address
+size constraints while maintaining optimal quality. In this paper, we introduce
+SizeGS, a framework that compresses 3DGS within a specified size budget while
+optimizing visual quality. We start with a size estimator to establish a clear
+relationship between file size and hyperparameters. Leveraging this estimator,
+we incorporate mixed precision quantization (MPQ) into 3DGS attributes,
+structuring MPQ in two hierarchical level -- inter-attribute and
+intra-attribute -- to optimize visual quality under the size constraint. At the
+inter-attribute level, we assign bit-widths to each attribute channel by
+formulating the combinatorial optimization as a 0-1 integer linear program,
+which can be efficiently solved. At the intra-attribute level, we divide each
+attribute channel into blocks of vectors, quantizing each vector based on the
+optimal bit-width derived at the inter-attribute level. Dynamic programming
+determines block lengths. Using the size estimator and MPQ, we develop a
+calibrated algorithm to identify optimal hyperparameters in just 10 minutes,
+achieving a 1.69$\times$ efficiency increase with quality comparable to
+state-of-the-art methods.
+
+
+
+ comment: Automatically compressing 3DGS into the desired file size while
+ maximizing the visual quality
+
+
+
+
+
+
+ ♻ ☆ Emotion-Aligned Contrastive Learning Between Images and Music ICASSP 2024
+
+
+ Traditional music search engines rely on retrieval methods that match natural
+language queries with music metadata. There have been increasing efforts to
+expand retrieval methods to consider the audio characteristics of music itself,
+using queries of various modalities including text, video, and speech. While
+most approaches aim to match general music semantics to the input queries, only
+a few focus on affective qualities. In this work, we address the task of
+retrieving emotionally-relevant music from image queries by learning an
+affective alignment between images and music audio. Our approach focuses on
+learning an emotion-aligned joint embedding space between images and music.
+This embedding space is learned via emotion-supervised contrastive learning,
+using an adapted cross-modal version of the SupCon loss. We evaluate the joint
+embeddings through cross-modal retrieval tasks (image-to-music and
+music-to-image) based on emotion labels. Furthermore, we investigate the
+generalizability of the learned music embeddings via automatic music tagging.
+Our experiments show that the proposed approach successfully aligns images and
+music, and that the learned embedding space is effective for cross-modal
+retrieval applications.
+
+
+
+ comment: Published at ICASSP 2024. Code:
+ https://github.com/shantistewart/Emo-CLIM
+
+
+
+
+
+
+ ♻ ☆ AI-Driven Virtual Teacher for Enhanced Educational Efficiency:
+ Leveraging Large Pretrain Models for Autonomous Error Analysis and Correction AAAI
+
+
+ Students frequently make mistakes while solving mathematical problems, and
+traditional error correction methods are both time-consuming and
+labor-intensive. This paper introduces an innovative \textbf{V}irtual
+\textbf{A}I \textbf{T}eacher system designed to autonomously analyze and
+correct student \textbf{E}rrors (VATE). Leveraging advanced large language
+models (LLMs), the system uses student drafts as a primary source for error
+analysis, which enhances understanding of the student's learning process. It
+incorporates sophisticated prompt engineering and maintains an error pool to
+reduce computational overhead. The AI-driven system also features a real-time
+dialogue component for efficient student interaction. Our approach demonstrates
+significant advantages over traditional and machine learning-based error
+correction methods, including reduced educational costs, high scalability, and
+superior generalizability. The system has been deployed on the Squirrel AI
+learning platform for elementary mathematics education, where it achieves
+78.3\% accuracy in error analysis and shows a marked improvement in student
+learning efficiency. Satisfaction surveys indicate a strong positive reception,
+highlighting the system's potential to transform educational practices.
+
+
+
+ comment: AAAI/IAAI 2025 Innovative Application Award
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ Information Retrieval 6
+
+
+
+
+
+ ☆ PromptRefine: Enhancing Few-Shot Performance on Low-Resource Indic
+ Languages with Example Selection from Related Example Banks
+
+
+ Large Language Models (LLMs) have recently demonstrated impressive few-shot
+learning capabilities through in-context learning (ICL). However, ICL
+performance is highly dependent on the choice of few-shot demonstrations,
+making the selection of the most optimal examples a persistent research
+challenge. This issue is further amplified in low-resource Indic languages,
+where the scarcity of ground-truth data complicates the selection process. In
+this work, we propose PromptRefine, a novel Alternating Minimization approach
+for example selection that improves ICL performance on low-resource Indic
+languages. PromptRefine leverages auxiliary example banks from related
+high-resource Indic languages and employs multi-task learning techniques to
+align language-specific retrievers, enabling effective cross-language
+retrieval. Additionally, we incorporate diversity in the selected examples to
+enhance generalization and reduce bias. Through comprehensive evaluations on
+four text generation tasks -- Cross-Lingual Question Answering, Multilingual
+Question Answering, Machine Translation, and Cross-Lingual Summarization using
+state-of-the-art LLMs such as LLAMA-3.1-8B, LLAMA-2-7B, Qwen-2-7B, and
+Qwen-2.5-7B, we demonstrate that PromptRefine significantly outperforms
+existing frameworks for retrieving examples.
+
+
+
+
+
+
+
+ ☆ On the effective transfer of knowledge from English to Hindi Wikipedia COLING
+
+
+ Although Wikipedia is the largest multilingual encyclopedia, it remains
+inherently incomplete. There is a significant disparity in the quality of
+content between high-resource languages (HRLs, e.g., English) and low-resource
+languages (LRLs, e.g., Hindi), with many LRL articles lacking adequate
+information. To bridge these content gaps, we propose a lightweight framework
+to enhance knowledge equity between English and Hindi. In case the English
+Wikipedia page is not up-to-date, our framework extracts relevant information
+from external resources readily available (such as English books) and adapts it
+to align with Wikipedia's distinctive style, including its \textit{neutral
+point of view} (NPOV) policy, using in-context learning capabilities of large
+language models. The adapted content is then machine-translated into Hindi for
+integration into the corresponding Wikipedia articles. On the other hand, if
+the English version is comprehensive and up-to-date, the framework directly
+transfers knowledge from English to Hindi. Our framework effectively generates
+new content for Hindi Wikipedia sections, enhancing Hindi Wikipedia articles
+respectively by 65% and 62% according to automatic and human judgment-based
+evaluations.
+
+
+
+ comment: accepted at COLING Industry Track 2025
+
+
+
+
+
+
+ ☆ KG-Retriever: Efficient Knowledge Indexing for Retrieval-Augmented Large
+ Language Models
+
+
+
+
+
+
+
+
+ Weijie Chen, Ting Bai, Jinbo Su, Jian Luan, Wei Liu, Chuan Shi
+
+
+ Large language models with retrieval-augmented generation encounter a pivotal
+challenge in intricate retrieval tasks, e.g., multi-hop question answering,
+which requires the model to navigate across multiple documents and generate
+comprehensive responses based on fragmented information. To tackle this
+challenge, we introduce a novel Knowledge Graph-based RAG framework with a
+hierarchical knowledge retriever, termed KG-Retriever. The retrieval indexing
+in KG-Retriever is constructed on a hierarchical index graph that consists of a
+knowledge graph layer and a collaborative document layer. The associative
+nature of graph structures is fully utilized to strengthen intra-document and
+inter-document connectivity, thereby fundamentally alleviating the information
+fragmentation problem and meanwhile improving the retrieval efficiency in
+cross-document retrieval of LLMs. With the coarse-grained collaborative
+information from neighboring documents and concise information from the
+knowledge graph, KG-Retriever achieves marked improvements on five public QA
+datasets, showing the effectiveness and efficiency of our proposed RAG
+framework.
+
+
+
+
+
+
+
+ ☆ ULMRec: User-centric Large Language Model for Sequential Recommendation
+
+
+ Recent advances in Large Language Models (LLMs) have demonstrated promising
+performance in sequential recommendation tasks, leveraging their superior
+language understanding capabilities. However, existing LLM-based recommendation
+approaches predominantly focus on modeling item-level co-occurrence patterns
+while failing to adequately capture user-level personalized preferences. This
+is problematic since even users who display similar behavioral patterns (e.g.,
+clicking or purchasing similar items) may have fundamentally different
+underlying interests. To alleviate this problem, in this paper, we propose
+ULMRec, a framework that effectively integrates user personalized preferences
+into LLMs for sequential recommendation. Considering there has the semantic gap
+between item IDs and LLMs, we replace item IDs with their corresponding titles
+in user historical behaviors, enabling the model to capture the item semantics.
+For integrating the user personalized preference, we design two key components:
+(1) user indexing: a personalized user indexing mechanism that leverages vector
+quantization on user reviews and user IDs to generate meaningful and unique
+user representations, and (2) alignment tuning: an alignment-based tuning stage
+that employs comprehensive preference alignment tasks to enhance the model's
+capability in capturing personalized information. Through this design, ULMRec
+achieves deep integration of language semantics with user personalized
+preferences, facilitating effective adaptation to recommendation. Extensive
+experiments on two public datasets demonstrate that ULMRec significantly
+outperforms existing methods, validating the effectiveness of our approach.
+
+
+
+
+
+
+
+ ♻ ☆ The Impact of User-Level Explanation Properties on Explanation Goals in
+ Recommender Systems
+
+
+
+
+
+
+
+
+ André Levi Zanon, Marcelo Garcia Manzato, Leonardo Rocha
+
+
+ Explanations are crucial for improving users' transparency, persuasiveness,
+engagement, and trust in Recommender Systems (RSs) by connecting interacted
+items to recommended items based on shared attributes. However, evaluating the
+effectiveness of explanation algorithms regarding those goals offline remains
+challenging due to their subjectiveness. This paper investigates the impact of
+user-level explanation properties, such as diversity and popularity of
+attributes, on the user perception of explanation goals. In an offline setting,
+we used metrics adapted from ranking to evaluate the characteristics of
+explanations generated by three state-of-the-art post-hoc explanation
+algorithms, based on the items and properties used to form the explanation
+sentence, across six recommendation systems. We compared the offline metrics
+results with those of an online user study. The findings highlight a trade-off
+between the goals of transparency and trust, which are related to popular
+properties, and the goals of engagement and persuasiveness, which are
+associated with the diversification of properties displayed to users.
+Furthermore, the study contributes to developing more robust evaluation methods
+for explanation algorithms in RSs.
+
+
+
+
+
+
+
+ ♻ ☆ Big data searching using words
+
+
+ Big data analytics is one of the most promising areas of new research and
+development in computer science, enterprises, e-commerce, and defense. For many
+organizations, big data is regarded as one of their most important strategic
+assets. This explosive growth has made it necessary to develop effective
+techniques for examining and analyzing big data from a mathematical
+perspective. Among various methods of analyzing big data, topological data
+analysis (TDA) is now considered one of the useful tools. However, there is no
+fundamental concept related to topological structure in big data. In this
+paper, we introduce some fundamental ideas related to the neighborhood
+structure of words in data searching, which can be extended to form important
+topological structures of big data in the future. Additionally, we introduce
+big data primal in big data searching and discuss the application of
+neighborhood structures in detecting anomalies in data searching using the
+Jaccard similarity coefficient.
+
+
+
+
+
+
+
+
+
+
+ Multimedia 4
+
+
+
+
+
+ ☆ Combining Genre Classification and Harmonic-Percussive Features with
+ Diffusion Models for Music-Video Generation
+
+
+ This study presents a novel method for generating music visualisers using
+diffusion models, combining audio input with user-selected artwork. The process
+involves two main stages: image generation and video creation. First, music
+captioning and genre classification are performed, followed by the retrieval of
+artistic style descriptions. A diffusion model then generates images based on
+the user's input image and the derived artistic style descriptions. The video
+generation stage utilises the same diffusion model to interpolate frames,
+controlled by audio energy vectors derived from key musical features of
+harmonics and percussives. The method demonstrates promising results across
+various genres, and a new metric, Audio-Visual Synchrony (AVS), is introduced
+to quantitatively evaluate the synchronisation between visual and audio
+elements. Comparative analysis shows significantly higher AVS values for videos
+generated using the proposed method with audio energy vectors, compared to
+linear interpolation. This approach has potential applications in diverse
+fields, including independent music video creation, film production, live music
+events, and enhancing audio-visual experiences in public spaces.
+
+
+ Speech emotion recognition (SER) remains a challenging yet crucial task due
+to the inherent complexity and diversity of human emotions. To address this
+problem, researchers attempt to fuse information from other modalities via
+multimodal learning. However, existing multimodal fusion techniques often
+overlook the intricacies of cross-modal interactions, resulting in suboptimal
+feature representations. In this paper, we propose WavFusion, a multimodal
+speech emotion recognition framework that addresses critical research problems
+in effective multimodal fusion, heterogeneity among modalities, and
+discriminative representation learning. By leveraging a gated cross-modal
+attention mechanism and multimodal homogeneous feature discrepancy learning,
+WavFusion demonstrates improved performance over existing state-of-the-art
+methods on benchmark datasets. Our work highlights the importance of capturing
+nuanced cross-modal interactions and learning discriminative representations
+for accurate multimodal SER. Experimental results on two benchmark datasets
+(IEMOCAP and MELD) demonstrate that WavFusion succeeds over the
+state-of-the-art strategies on emotion recognition.
+
+
+
+ comment: Accepted by 31st International Conference on MultiMedia Modeling
+ (MMM2025)
+
+
+
+
+
+
+ ☆ Securing Social Media Against Deepfakes using Identity, Behavioral, and
+ Geometric Signatures
+
+
+
+
+
+
+
+
+ Muhammad Umar Farooq, Awais Khan, Ijaz Ul Haq, Khalid Mahmood Malik
+
+
+ Trust in social media is a growing concern due to its ability to influence
+significant societal changes. However, this space is increasingly compromised
+by various types of deepfake multimedia, which undermine the authenticity of
+shared content. Although substantial efforts have been made to address the
+challenge of deepfake content, existing detection techniques face a major
+limitation in generalization: they tend to perform well only on specific types
+of deepfakes they were trained on.This dependency on recognizing specific
+deepfake artifacts makes current methods vulnerable when applied to unseen or
+varied deepfakes, thereby compromising their performance in real-world
+applications such as social media platforms. To address the generalizability of
+deepfake detection, there is a need for a holistic approach that can capture a
+broader range of facial attributes and manipulations beyond isolated artifacts.
+To address this, we propose a novel deepfake detection framework featuring an
+effective feature descriptor that integrates Deep identity, Behavioral, and
+Geometric (DBaG) signatures, along with a classifier named DBaGNet.
+Specifically, the DBaGNet classifier utilizes the extracted DBaG signatures,
+leveraging a triplet loss objective to enhance generalized representation
+learning for improved classification. Specifically, the DBaGNet classifier
+utilizes the extracted DBaG signatures and applies a triplet loss objective to
+enhance generalized representation learning for improved classification. To
+test the effectiveness and generalizability of our proposed approach, we
+conduct extensive experiments using six benchmark deepfake datasets: WLDR,
+CelebDF, DFDC, FaceForensics++, DFD, and NVFAIR. Specifically, to ensure the
+effectiveness of our approach, we perform cross-dataset evaluations, and the
+results demonstrate significant performance gains over several state-of-the-art
+methods.
+
+
+
+
+
+
+
+
+ Konstantinos Kontras, Thomas Strypsteen, Christos Chatzichristos, Paul Pu Liang, Matthew Blaschko, Maarten De Vos
+
+
+ Multimodal learning can complete the picture of information extraction by
+uncovering key dependencies between data sources. However, current systems fail
+to fully leverage multiple modalities for optimal performance. This has been
+attributed to modality competition, where modalities strive for training
+resources, leaving some underoptimized. We show that current balancing methods
+struggle to train multimodal models that surpass even simple baselines, such as
+ensembles. This raises the question: how can we ensure that all modalities in
+multimodal training are sufficiently trained, and that learning from new
+modalities consistently improves performance? This paper proposes the
+Multimodal Competition Regularizer (MCR), a new loss component inspired by
+mutual information (MI) decomposition designed to prevent the adverse effects
+of competition in multimodal training. Our key contributions are: 1)
+Introducing game-theoretic principles in multimodal learning, where each
+modality acts as a player competing to maximize its influence on the final
+outcome, enabling automatic balancing of the MI terms. 2) Refining lower and
+upper bounds for each MI term to enhance the extraction of task-relevant unique
+and shared information across modalities. 3) Suggesting latent space
+permutations for conditional MI estimation, significantly improving
+computational efficiency. MCR outperforms all previously suggested training
+strategies and is the first to consistently improve multimodal learning beyond
+the ensemble baseline, clearly demonstrating that combining modalities leads to
+significant performance gains on both synthetic and large real-world datasets.
+
+
+
+
+
+ ☆ A Graph-Based Approach for Conversational AI-Driven Personal Memory
+ Capture and Retrieval in a Real-world Application
+
+
+
+
+
+
+
+
+ Savini Kashmira, Jayanaka L. Dantanarayana, Joshua Brodsky, Ashish Mahendra, Yiping Kang, Krisztian Flautner, Lingjia Tang, Jason Mars
+
+
+ TOBU is a novel mobile application that captures and retrieves `personal
+memories' (pictures/videos together with stories and context around those
+moments) in a user-engaging AI-guided conversational approach. Our initial
+prototype showed that existing retrieval techniques such as retrieval-augmented
+generation (RAG) systems fall short due to their limitations in understanding
+memory relationships, causing low recall, hallucination, and unsatisfactory
+user experience. We design TOBUGraph, a novel graph-based retrieval approach.
+During capturing, TOBUGraph leverages large language models (LLMs) to
+automatically create a dynamic knowledge graph of memories, establishing
+context and relationships of those memories. During retrieval, TOBUGraph
+combines LLMs with the memory graph to achieve comprehensive recall through
+graph traversal. Our evaluation using real user data demonstrates that
+TOBUGraph outperforms multiple RAG implementations in both precision and
+recall, significantly improving user experience through improved retrieval
+accuracy and reduced hallucination.
+
+
+
+
+
+
+
+
+ Kaustubh D. Dhole, Kai Shu, Eugene Agichtein
+
+
+ Computational argumentation, which involves generating answers or summaries
+for controversial topics like abortion bans and vaccination, has become
+increasingly important in today's polarized environment. Sophisticated LLM
+capabilities offer the potential to provide nuanced, evidence-based answers to
+such questions through Retrieval-Augmented Argumentation (RAArg), leveraging
+real-world evidence for high-quality, grounded arguments. However, evaluating
+RAArg remains challenging, as human evaluation is costly and difficult for
+complex, lengthy answers on complicated topics. At the same time, re-using
+existing argumentation datasets is no longer sufficient, as they lack long,
+complex arguments and realistic evidence from potentially misleading sources,
+limiting holistic evaluation of retrieval effectiveness and argument quality.
+To address these gaps, we investigate automated evaluation methods using
+multiple fine-grained LLM judges, providing better and more interpretable
+assessments than traditional single-score metrics and even previously reported
+human crowdsourcing. To validate the proposed techniques, we introduce ConQRet,
+a new benchmark featuring long and complex human-authored arguments on debated
+topics, grounded in real-world websites, allowing an exhaustive evaluation
+across retrieval effectiveness, argument quality, and groundedness. We validate
+our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed
+LLM Judges and the ConQRet benchmark can enable rapid progress in computational
+argumentation and can be naturally extended to other complex
+retrieval-augmented generation tasks.
+
+
+ Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but
+commonly suffers from interpretability issues. While several methods have been
+proposed to explain embedding-based LP models, they are generally limited to
+local explanations on KG and are deficient in providing human interpretable
+semantics. Based on real-world observations of the characteristics of KGs from
+multiple domains, we propose to explain LP models in KG with path-based
+explanations. An integrated framework, namely eXpath, is introduced which
+incorporates the concept of relation path with ontological closed path rules to
+enhance both the efficiency and effectiveness of LP interpretation. Notably,
+the eXpath explanations can be fused with other single-link explanation
+approaches to achieve a better overall solution. Extensive experiments across
+benchmark datasets and LP models demonstrate that introducing eXpath can boost
+the quality of resulting explanations by about 20% on two key metrics and
+reduce the required explanation time by 61.4%, in comparison to the best
+existing method. Case studies further highlight eXpath's ability to provide
+more semantically meaningful explanations through path-based evidence.
+
+
+
+ comment: 13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201
+
+
+
+
+
+
+ ☆ PyTerrier-GenRank: The PyTerrier Plugin for Reranking with Large
+ Language Models
+
+
+ Using LLMs as rerankers requires experimenting with various hyperparameters,
+such as prompt formats, model choice, and reformulation strategies. We
+introduce PyTerrier-GenRank, a PyTerrier plugin to facilitate seamless
+reranking experiments with LLMs, supporting popular ranking strategies like
+pointwise and listwise prompting. We validate our plugin through HuggingFace
+and OpenAI hosted endpoints.
+
+
+
+
+
+
+
+ ☆ Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval
+ with Semantic Guidance NeurIPS 2024
+
+
+
+
+
+
+
+
+ Xuchan Bao, Judith Yue Li, Zhong Yi Wan, Kun Su, Timo Denk, Joonseok Lee, Dima Kuzmin, Fei Sha
+
+
+ Modern music retrieval systems often rely on fixed representations of user
+preferences, limiting their ability to capture users' diverse and uncertain
+retrieval needs. To address this limitation, we introduce Diff4Steer, a novel
+generative retrieval framework that employs lightweight diffusion models to
+synthesize diverse seed embeddings from user queries that represent potential
+directions for music exploration. Unlike deterministic methods that map user
+query to a single point in embedding space, Diff4Steer provides a statistical
+prior on the target modality (audio) for retrieval, effectively capturing the
+uncertainty and multi-faceted nature of user preferences. Furthermore,
+Diff4Steer can be steered by image or text inputs, enabling more flexible and
+controllable music discovery combined with nearest neighbor search. Our
+framework outperforms deterministic regression methods and LLM-based generative
+retrieval baseline in terms of retrieval and ranking metrics, demonstrating its
+effectiveness in capturing user preferences, leading to more diverse and
+relevant recommendations. Listening examples are available at
+tinyurl.com/diff4steer.
+
+
+
+ comment: NeurIPS 2024 Creative AI Track
+
+
+
+
+
+
+ ♻ ☆ Unifying Generative and Dense Retrieval for Sequential Recommendation
+
+
+
+
+
+
+
+
+ Liu Yang, Fabian Paischer, Kaveh Hassani, Jiacheng Li, Shuai Shao, Zhang Gabriel Li, Yun He, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Robert D Nowak, Xiaoli Gao, Hamid Eghbalzadeh
+
+
+ Sequential dense retrieval models utilize advanced sequence learning
+techniques to compute item and user representations, which are then used to
+rank relevant items for a user through inner product computation between the
+user and all item representations. However, this approach requires storing a
+unique representation for each item, resulting in significant memory
+requirements as the number of items grow. In contrast, the recently proposed
+generative retrieval paradigm offers a promising alternative by directly
+predicting item indices using a generative model trained on semantic IDs that
+encapsulate items' semantic information. Despite its potential for large-scale
+applications, a comprehensive comparison between generative retrieval and
+sequential dense retrieval under fair conditions is still lacking, leaving open
+questions regarding performance, and computation trade-offs. To address this,
+we compare these two approaches under controlled conditions on academic
+benchmarks and propose LIGER (LeveragIng dense retrieval for GEnerative
+Retrieval), a hybrid model that combines the strengths of these two widely used
+methods. LIGER integrates sequential dense retrieval into generative retrieval,
+mitigating performance differences and enhancing cold-start item recommendation
+in the datasets evaluated. This hybrid approach provides insights into the
+trade-offs between these approaches and demonstrates improvements in efficiency
+and effectiveness for recommendation systems in small-scale benchmarks.
+
+
+ Relevance modeling is a critical component for enhancing user experience in
+search engines, with the primary objective of identifying items that align with
+users' queries. Traditional models only rely on the semantic congruence between
+queries and items to ascertain relevance. However, this approach represents
+merely one aspect of the relevance judgement, and is insufficient in isolation.
+Even powerful Large Language Models (LLMs) still cannot accurately judge the
+relevance of a query and an item from a semantic perspective. To augment
+LLMs-driven relevance modeling, this study proposes leveraging user
+interactions recorded in search logs to yield insights into users' implicit
+search intentions. The challenge lies in the effective prompting of LLMs to
+capture dynamic search intentions, which poses several obstacles in real-world
+relevance scenarios, i.e., the absence of domain-specific knowledge, the
+inadequacy of an isolated prompt, and the prohibitive costs associated with
+deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved
+Behavior-augmented Prompting framework for integrating search scenario-oriented
+knowledge with LLMs effectively. Specifically, we perform the user-driven
+behavior neighbors retrieval from the daily search logs to obtain
+domain-specific knowledge in time, retrieving candidates that users consider to
+meet their expectations. Then, we guide LLMs for relevance modeling by
+employing advanced prompting techniques that progressively improve the outputs
+of the LLMs, followed by a progressive aggregation with comprehensive
+consideration of diverse aspects. For online serving, we have developed an
+industrial application framework tailored for the deployment of LLMs in
+relevance modeling. Experiments on real-world industry data and online A/B
+testing demonstrate our proposal achieves promising performance.
+
+
+
+ comment: Accepted By COLING 2025
+
+
+
+
+
+
+ ♻ ☆ TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient
+ and Effective Retrieval
+
+
+
+
+
+
+
+
+ Hang Li, Chuting Yu, Ahmed Mourad, Bevan Koopman, Guido Zuccon
+
+
+ This paper considers Pseudo-Relevance Feedback (PRF) methods for dense
+retrievers in a resource constrained environment such as that of cheap cloud
+instances or embedded systems (e.g., smartphones and smartwatches), where
+memory and CPU are limited and GPUs are not present. For this, we propose a
+transformer-based PRF method (TPRF), which has a much smaller memory footprint
+and faster inference time compared to other deep language models that employ
+PRF mechanisms, with a marginal effectiveness loss. TPRF learns how to
+effectively combine the relevance feedback signals from dense passage
+representations. Specifically, TPRF provides a mechanism for modelling
+relationships and weights between the query and the relevance feedback signals.
+The method is agnostic to the specific dense representation used and thus can
+be generally applied to any dense retriever.
+
+
+ Cold-start rating prediction is a fundamental problem in recommender systems
+that has been extensively studied. Many methods have been proposed that exploit
+explicit relations among existing data, such as collaborative filtering, social
+recommendations and heterogeneous information network, to alleviate the data
+insufficiency issue for cold-start users and items. However, the explicit
+relations constructed based on data between different roles may be unreliable
+and irrelevant, which limits the performance ceiling of the specific
+recommendation task. Motivated by this, in this paper, we propose a flexible
+framework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not
+solely rely on the pre-defined interaction pattern or the manually constructed
+heterogeneous information network. Instead, we devise a Heterogeneous
+Interaction Module (HIM) to jointly model the heterogeneous interactions and
+directly infer the important interactions via the observed data. In the
+experiments, we evaluate our model under three cold-start settings on three
+real-world datasets. The experimental results show that HIRE outperforms other
+baselines by a large margin. Furthermore, we visualize the inferred
+interactions of HIRE to confirm the contribution of our model.
+
+
+
+ comment: 14 pages, 9 figures
+
+
+
+
+
+
+
+
+
+ Multimedia 5
+
+
+
+
+
+ ☆ pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data
+ Estimation and Multi-modal Processing
+
+
+
+
+
+
+
+
+ Johanna Devaney, Daniel McKemie, Alex Morgan
+
+
+ pyAMPACT (Python-based Automatic Music Performance Analysis and Comparison
+Toolkit) links symbolic and audio music representations to facilitate
+score-informed estimation of performance data in audio as well as general
+linking of symbolic and audio music representations with a variety of
+annotations. pyAMPACT can read a range of symbolic formats and can output
+note-linked audio descriptors/performance data into MEI-formatted files. The
+audio analysis uses score alignment to calculate time-frequency regions of
+importance for each note in the symbolic representation from which to estimate
+a range of parameters. These include tuning-, dynamics-, and timbre-related
+performance descriptors, with timing-related information available from the
+score alignment. Beyond performance data estimation, pyAMPACT also facilitates
+multi-modal investigations through its infrastructure for linking symbolic
+representations and annotations to audio.
+
+
+
+ comment: International Society for Music Information Retrieval, Late Breaking
+ Demo
+
+
+
+
+
+
+ ☆ SMIC: Semantic Multi-Item Compression based on CLIP dictionary
+
+
+ Semantic compression, a compression scheme where the distortion metric,
+typically MSE, is replaced with semantic fidelity metrics, tends to become more
+and more popular. Most recent semantic compression schemes rely on the
+foundation model CLIP. In this work, we extend such a scheme to image
+collection compression, where inter-item redundancy is taken into account
+during the coding phase. For that purpose, we first show that CLIP's latent
+space allows for easy semantic additions and subtractions. From this property,
+we define a dictionary-based multi-item codec that outperforms state-of-the-art
+generative codec in terms of compression rate, around $10^{-5}$ BPP per image,
+while not sacrificing semantic fidelity. We also show that the learned
+dictionary is of a semantic nature and works as a semantic projector for the
+semantic content of images.
+
+
+
+
+
+
+
+ ☆ Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval
+ with Semantic Guidance NeurIPS 2024
+
+
+
+
+
+
+
+
+ Xuchan Bao, Judith Yue Li, Zhong Yi Wan, Kun Su, Timo Denk, Joonseok Lee, Dima Kuzmin, Fei Sha
+
+
+ Modern music retrieval systems often rely on fixed representations of user
+preferences, limiting their ability to capture users' diverse and uncertain
+retrieval needs. To address this limitation, we introduce Diff4Steer, a novel
+generative retrieval framework that employs lightweight diffusion models to
+synthesize diverse seed embeddings from user queries that represent potential
+directions for music exploration. Unlike deterministic methods that map user
+query to a single point in embedding space, Diff4Steer provides a statistical
+prior on the target modality (audio) for retrieval, effectively capturing the
+uncertainty and multi-faceted nature of user preferences. Furthermore,
+Diff4Steer can be steered by image or text inputs, enabling more flexible and
+controllable music discovery combined with nearest neighbor search. Our
+framework outperforms deterministic regression methods and LLM-based generative
+retrieval baseline in terms of retrieval and ranking metrics, demonstrating its
+effectiveness in capturing user preferences, leading to more diverse and
+relevant recommendations. Listening examples are available at
+tinyurl.com/diff4steer.
+
+
+
+ comment: NeurIPS 2024 Creative AI Track
+
+
+
+
+
+
+ ♻ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware
+ Omni-Modal Perception of Long Videos
+
+
+ Despite impressive advancements in video understanding, most efforts remain
+limited to coarse-grained or visual-only video tasks. However, real-world
+videos encompass omni-modal information (vision, audio, and speech) with a
+series of events forming a cohesive storyline. The lack of multi-modal video
+data with fine-grained event annotations and the high cost of manual labeling
+are major obstacles to comprehensive omni-modality video perception. To address
+this gap, we propose an automatic pipeline consisting of high-quality
+multi-modal video filtering, semantically coherent omni-modal event boundary
+detection, and cross-modal correlation-aware event captioning. In this way, we
+present LongVALE, the first-ever Vision-Audio-Language Event understanding
+benchmark comprising 105K omni-modal events with precise temporal boundaries
+and detailed relation-aware captions within 8.4K high-quality long videos.
+Further, we build a baseline that leverages LongVALE to enable video large
+language models (LLMs) for omni-modality fine-grained temporal video
+understanding for the first time. Extensive experiments demonstrate the
+effectiveness and great potential of LongVALE in advancing comprehensive
+multi-modal video understanding.
+
+
+
+ comment: 18 pages, 15 figures
+
+
+
+
+
+
+ ♻ ☆ TopoCode: Topologically Informed Error Detection and Correction in
+ Communication Systems
+
+
+ Traditional error detection and correction codes focus on bit-level fidelity,
+which is insufficient for emerging technologies like eXtended Reality (XR) and
+holographic communications requiring high-data-rate, low-latency systems.
+Bit-level metrics cannot comprehensively evaluate Quality-of-Service (QoS) in
+these scenarios. This letter proposes TopoCode which leverages Topological Data
+Analysis (TDA) and persistent homology to encode topological information for
+message-level error detection and correction. It introduces minimal redundancy
+while enabling effective data reconstruction, especially in low Signal-to-Noise
+Ratio (SNR) conditions. TopoCode offers a promising approach to meet the
+demands of next-generation communication systems prioritizing semantic accuracy
+and message-level integrity.
+
+