diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 00000000..e69de29b diff --git a/cache.json b/cache.json new file mode 100644 index 00000000..7f621454 --- /dev/null +++ b/cache.json @@ -0,0 +1 @@ +{"2024-12-06T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2412.05255v1","updated":"2024-12-06T18:41:16Z","published":"2024-12-06T18:41:16Z","title":"TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft","summary":" Collaboration is a cornerstone of society. In the real world, human teammates\nmake use of multi-sensory data to tackle challenging tasks in ever-changing\nenvironments. It is essential for embodied agents collaborating in\nvisually-rich environments replete with dynamic interactions to understand\nmulti-modal observations and task specifications. To evaluate the performance\nof generalizable multi-modal collaborative agents, we present TeamCraft, a\nmulti-modal multi-agent benchmark built on top of the open-world video game\nMinecraft. The benchmark features 55,000 task variants specified by multi-modal\nprompts, procedurally-generated expert demonstrations for imitation learning,\nand carefully designed protocols to evaluate model generalization capabilities.\nWe also perform extensive analyses to better understand the limitations and\nstrengths of existing approaches. Our results indicate that existing models\ncontinue to face significant challenges in generalizing to novel goals, scenes,\nand unseen numbers of agents. These findings underscore the need for further\nresearch in this area. The TeamCraft platform and dataset are publicly\navailable at https://github.com/teamcraft-bench/teamcraft.\n","authors":["Qian Long","Zhi Li","Ran Gong","Ying Nian Wu","Demetri Terzopoulos","Xiaofeng Gao"],"pdf_url":"https://arxiv.org/pdf/2412.05255v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05251v1","updated":"2024-12-06T18:31:51Z","published":"2024-12-06T18:31:51Z","title":"Uncertainty Quantification for Transformer Models for Dark-Pattern\n Detection","summary":" The opaque nature of transformer-based models, particularly in applications\nsusceptible to unethical practices such as dark-patterns in user interfaces,\nrequires models that integrate uncertainty quantification to enhance trust in\npredictions. This study focuses on dark-pattern detection, deceptive design\nchoices that manipulate user decisions, undermining autonomy and consent. We\npropose a differential fine-tuning approach implemented at the final\nclassification head via uncertainty quantification with transformer-based\npre-trained models. Employing a dense neural network (DNN) head architecture as\na baseline, we examine two methods capable of quantifying uncertainty:\nSpectral-normalized Neural Gaussian Processes (SNGPs) and Bayesian Neural\nNetworks (BNNs). These methods are evaluated on a set of open-source\nfoundational models across multiple dimensions: model performance, variance in\ncertainty of predictions and environmental impact during training and inference\nphases. Results demonstrate that integrating uncertainty quantification\nmaintains performance while providing insights into challenging instances\nwithin the models. Moreover, the study reveals that the environmental impact\ndoes not uniformly increase with the incorporation of uncertainty\nquantification techniques. The study's findings demonstrate that uncertainty\nquantification enhances transparency and provides measurable confidence in\npredictions, improving the explainability and clarity of black-box models. This\nfacilitates informed decision-making and mitigates the influence of\ndark-patterns on user interfaces. These results highlight the importance of\nincorporating uncertainty quantification techniques in developing machine\nlearning models, particularly in domains where interpretability and\ntrustworthiness are critical.\n","authors":["Javier Muñoz","Álvaro Huertas-García","Carlos Martí-González","Enrique De Miguel Ambite"],"pdf_url":"https://arxiv.org/pdf/2412.05251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05248v1","updated":"2024-12-06T18:27:15Z","published":"2024-12-06T18:27:15Z","title":"Enhancing FKG.in: automating Indian food composition analysis","summary":" This paper presents a novel approach to compute food composition data for\nIndian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The\nprimary focus is to provide a broad overview of an automated food composition\nanalysis workflow and describe its core functionalities: nutrition data\naggregation, food composition analysis, and LLM-augmented information\nresolution. This workflow aims to complement FKG.in and iteratively supplement\nfood composition data from verified knowledge bases. Additionally, this paper\nhighlights the challenges of representing Indian food and accessing food\ncomposition data digitally. It also reviews three key sources of food\ncomposition data: the Indian Food Composition Tables, the Indian Nutrient\nDatabank, and the Nutritionix API. Furthermore, it briefly outlines how users\ncan interact with the workflow to obtain diet-based health recommendations and\ndetailed food composition information for numerous recipes. We then explore the\ncomplex challenges of analyzing Indian recipe information across dimensions\nsuch as structure, multilingualism, and uncertainty as well as present our\nongoing work on LLM-based solutions to address these issues. The methods\nproposed in this workshop paper for AI-driven knowledge curation and\ninformation resolution are application-agnostic, generalizable, and replicable\nfor any domain.\n","authors":["Saransh Kumar Gupta","Lipika Dey","Partha Pratim Das","Geeta Trilok-Kumar","Ramesh Jain"],"pdf_url":"https://arxiv.org/pdf/2412.05248v1.pdf","comment":"15 pages, 3 figures, 30 references, International Conference on\n Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop"},{"id":"http://arxiv.org/abs/2408.14471v2","updated":"2024-12-06T18:22:32Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v2.pdf","comment":"Technical Report. 52 pages. Shorter version published at the NeurIPS\n 2024 Dataset & Benchmarks track"},{"id":"http://arxiv.org/abs/2412.05237v1","updated":"2024-12-06T18:14:24Z","published":"2024-12-06T18:14:24Z","title":"MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at\n Scale","summary":" Open-source multimodal large language models (MLLMs) have shown significant\npotential in a broad range of multimodal tasks. However, their reasoning\ncapabilities remain constrained by existing instruction-tuning datasets, which\nwere predominately repurposed from academic datasets such as VQA, AI2D, and\nChartQA. These datasets target simplistic tasks, and only provide phrase-level\nanswers without any intermediate rationales. To address these challenges, we\nintroduce a scalable and cost-effective method to construct a large-scale\nmultimodal instruction-tuning dataset with rich intermediate rationales\ndesigned to elicit CoT reasoning. Using only open models, we create a dataset\ncontaining 12M instruction-response pairs to cover diverse, reasoning-intensive\ntasks with detailed and faithful rationales. Experiments demonstrate that\ntraining MLLMs on this dataset significantly improves reasoning capabilities,\nachieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),\nMMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates\nnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablation\nstudies further highlight the importance of key components, such as rewriting\nand self-filtering, in the dataset construction process.\n","authors":["Jarvis Guo","Tuney Zheng","Yuelin Bai","Bo Li","Yubo Wang","King Zhu","Yizhi Li","Graham Neubig","Wenhu Chen","Xiang Yue"],"pdf_url":"https://arxiv.org/pdf/2412.05237v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05232v1","updated":"2024-12-06T18:02:59Z","published":"2024-12-06T18:02:59Z","title":"LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds","summary":" Many existing jailbreak techniques rely on solving discrete combinatorial\noptimization, while more recent approaches involve training LLMs to generate\nmultiple adversarial prompts. However, both approaches require significant\ncomputational resources to produce even a single adversarial prompt. We\nhypothesize that the inefficiency of current approaches stems from an\ninadequate characterization of the jailbreak problem. To address this gap, we\nformulate the jailbreak problem in terms of alignment. By starting from an\navailable safety-aligned model, we leverage an unsafe reward to guide the safe\nmodel towards generating unsafe outputs using alignment techniques (e.g.,\nreinforcement learning from human feedback), effectively performing\njailbreaking via alignment. We propose a novel jailbreak method called LIAR\n(LeveragIng Alignment to jailbReak). To demonstrate the simplicity and\neffectiveness of our approach, we employ a best-of-N method to solve the\nalignment problem. LIAR offers significant advantages: lower computational\nrequirements without additional training, fully black-box operation,\ncompetitive attack success rates, and more human-readable prompts. We provide\ntheoretical insights into the possibility of jailbreaking a safety-aligned\nmodel, revealing inherent vulnerabilities in current alignment strategies for\nLLMs. We also provide sub-optimality guarantees for the proposed \\algo.\nExperimentally, we achieve ASR comparable to the SoTA with a 10x improvement to\nperplexity and a Time-to-Attack measured in seconds rather than tens of hours.\n","authors":["James Beetham","Souradip Chakraborty","Mengdi Wang","Furong Huang","Amrit Singh Bedi","Mubarak Shah"],"pdf_url":"https://arxiv.org/pdf/2412.05232v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05225v1","updated":"2024-12-06T17:58:14Z","published":"2024-12-06T17:58:14Z","title":"BEExformer: A Fast Inferencing Transformer Architecture via Binarization\n with Multiple Early Exits","summary":" Large Language Models (LLMs) based on transformers achieve cutting-edge\nresults on a variety of applications. However, their enormous size and\nprocessing requirements make deployment on devices with constrained resources\nextremely difficult. Among various efficiency considerations, model\nbinarization and Early Exit (EE) are common effective solutions. However,\nbinarization may lead to performance loss due to reduced precision affecting\ngradient estimation and parameter updates. Besides, the present early-exit\nmechanisms are still in the nascent stages of research. To ameliorate these\nissues, we propose Binarized Early Exit Transformer (BEExformer), the\nfirst-ever selective learning transformer architecture to combine early exit\nwith binarization for textual inference. It improves the binarization process\nthrough a differentiable second-order approximation to the impulse function.\nThis enables gradient computation concerning both the sign as well as the\nmagnitude of the weights. In contrast to absolute threshold-based EE, the\nproposed EE mechanism hinges on fractional reduction in entropy among\nintermediate transformer blocks with soft-routing loss estimation. While\nbinarization results in 18.44 times reduction in model size, early exit reduces\nthe FLOPs during inference by 54.85% and even improves accuracy by 5.98%\nthrough resolving the \"overthinking\" problem inherent in deep networks.\nMoreover, the proposed BEExformer simplifies training by not requiring\nknowledge distillation from a full-precision LLM. Extensive evaluation on the\nGLUE dataset and comparison with the SOTA works showcase its pareto-optimal\nperformance-efficiency trade-off.\n","authors":["Wazib Ansar","Saptarsi Goswami","Amlan Chakrabarti"],"pdf_url":"https://arxiv.org/pdf/2412.05225v1.pdf","comment":"15 pages, 15 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.05223v1","updated":"2024-12-06T17:54:54Z","published":"2024-12-06T17:54:54Z","title":"100% Hallucination Elimination Using Acurai","summary":" The issue of hallucinations in large language models (LLMs) remains a\ncritical barrier to the adoption of AI in enterprise and other high-stakes\napplications. Despite advancements in retrieval-augmented generation (RAG)\nsystems, current state-of-the-art methods fail to achieve more than 80%\naccuracy in generating faithful and factually correct outputs, even when\nprovided with relevant and accurate context. In this work, we introduce Acurai,\na novel systematic approach that achieves 100% hallucination-free responses in\nLLMs by reformatting queries and context data prior to input. Leveraging a deep\nunderstanding of LLM internal representations, the importance of noun-phrase\ndominance, and the role of discrete functional units (DFUs), Acurai ensures\nalignment between input context and generated output. We validate this method\nusing the RAGTruth corpus, demonstrating its ability to eliminate 100%\nhallucinations for both GPT-4 and GPT-3.5 Turbo. Acurai sets a new standard for\nachieving consistent, accurate, and faithful AI responses, marking a\nsignificant step forward in the development of trustworthy AI systems.\n","authors":["Michael C. Wood","Adam A. Forbes"],"pdf_url":"https://arxiv.org/pdf/2412.05223v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05210v1","updated":"2024-12-06T17:40:38Z","published":"2024-12-06T17:40:38Z","title":"Evaluating and Aligning CodeLLMs on Human Preference","summary":" Code large language models (codeLLMs) have made significant strides in code\ngeneration. Most previous code-related benchmarks, which consist of various\nprogramming exercises along with the corresponding test cases, are used as a\ncommon measure to evaluate the performance and capabilities of code LLMs.\nHowever, the current code LLMs focus on synthesizing the correct code snippet,\nignoring the alignment with human preferences, where the query should be\nsampled from the practical application scenarios and the model-generated\nresponses should satisfy the human preference. To bridge the gap between the\nmodel-generated response and human preference, we present a rigorous\nhuman-curated benchmark CodeArena to emulate the complexity and diversity of\nreal-world coding tasks, where 397 high-quality samples spanning 40 categories\nand 44 programming languages, carefully curated from user queries. Further, we\npropose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B\ntokens) by scaling instructions from the website to verify the effectiveness of\nthe large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder\ntotally trained on synthetic instruction data can achieve top-tier performance\nof open-source code LLMs. The results find performance differences between\nexecution-based benchmarks and CodeArena. Our systematic experiments of\nCodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code\nLLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring\nthe importance of the human preference\nalignment.\\footnote{\\url{https://codearenaeval.github.io/ }}\n","authors":["Jian Yang","Jiaxi Yang","Ke Jin","Yibo Miao","Lei Zhang","Liqun Yang","Zeyu Cui","Yichang Zhang","Binyuan Hui","Junyang Lin"],"pdf_url":"https://arxiv.org/pdf/2412.05210v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05206v1","updated":"2024-12-06T17:35:52Z","published":"2024-12-06T17:35:52Z","title":"ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented\n Argumentation with LLM Judges","summary":" Computational argumentation, which involves generating answers or summaries\nfor controversial topics like abortion bans and vaccination, has become\nincreasingly important in today's polarized environment. Sophisticated LLM\ncapabilities offer the potential to provide nuanced, evidence-based answers to\nsuch questions through Retrieval-Augmented Argumentation (RAArg), leveraging\nreal-world evidence for high-quality, grounded arguments. However, evaluating\nRAArg remains challenging, as human evaluation is costly and difficult for\ncomplex, lengthy answers on complicated topics. At the same time, re-using\nexisting argumentation datasets is no longer sufficient, as they lack long,\ncomplex arguments and realistic evidence from potentially misleading sources,\nlimiting holistic evaluation of retrieval effectiveness and argument quality.\nTo address these gaps, we investigate automated evaluation methods using\nmultiple fine-grained LLM judges, providing better and more interpretable\nassessments than traditional single-score metrics and even previously reported\nhuman crowdsourcing. To validate the proposed techniques, we introduce ConQRet,\na new benchmark featuring long and complex human-authored arguments on debated\ntopics, grounded in real-world websites, allowing an exhaustive evaluation\nacross retrieval effectiveness, argument quality, and groundedness. We validate\nour LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed\nLLM Judges and the ConQRet benchmark can enable rapid progress in computational\nargumentation and can be naturally extended to other complex\nretrieval-augmented generation tasks.\n","authors":["Kaustubh D. Dhole","Kai Shu","Eugene Agichtein"],"pdf_url":"https://arxiv.org/pdf/2412.05206v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.03019v2","updated":"2024-12-06T17:23:53Z","published":"2024-10-03T22:05:06Z","title":"Is Your Paper Being Reviewed by an LLM? Investigating AI Text\n Detectability in Peer Review","summary":" Peer review is a critical process for ensuring the integrity of published\nscientific research. Confidence in this process is predicated on the assumption\nthat experts in the relevant domain give careful consideration to the merits of\nmanuscripts which are submitted for publication. With the recent rapid\nadvancements in the linguistic capabilities of large language models (LLMs), a\nnew potential risk to the peer review process is that negligent reviewers will\nrely on LLMs to perform the often time consuming process of reviewing a paper.\nIn this study, we investigate the ability of existing AI text detection\nalgorithms to distinguish between peer reviews written by humans and different\nstate-of-the-art LLMs. Our analysis shows that existing approaches fail to\nidentify many GPT-4o written reviews without also producing a high number of\nfalse positive classifications. To address this deficiency, we propose a new\ndetection approach which surpasses existing methods in the identification of\nGPT-4o written peer reviews at low levels of false positive classifications.\nOur work reveals the difficulty of accurately identifying AI-generated text at\nthe individual review level, highlighting the urgent need for new tools and\nmethods to detect this type of unethical application of generative AI.\n","authors":["Sungduk Yu","Man Luo","Avinash Madasu","Vasudev Lal","Phillip Howard"],"pdf_url":"https://arxiv.org/pdf/2410.03019v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05184v1","updated":"2024-12-06T17:04:21Z","published":"2024-12-06T17:04:21Z","title":"QueEn: A Large Language Model for Quechua-English Translation","summary":" Recent studies show that large language models (LLMs) are powerful tools for\nworking with natural language, bringing advances in many areas of computational\nlinguistics. However, these models face challenges when applied to low-resource\nlanguages due to limited training data and difficulty in understanding cultural\nnuances. In this paper, we propose QueEn, a novel approach for Quechua-English\ntranslation that combines Retrieval-Augmented Generation (RAG) with\nparameter-efficient fine-tuning techniques. Our method leverages external\nlinguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for\nefficient model adaptation. Experimental results show that our approach\nsubstantially exceeds baseline models, with a BLEU score of 17.6 compared to\n1.5 for standard GPT models. The integration of RAG with fine-tuning allows our\nsystem to address the challenges of low-resource language translation while\nmaintaining computational efficiency. This work contributes to the broader goal\nof preserving endangered languages through advanced language technologies.\n","authors":["Junhao Chen","Peng Shu","Yiwei Li","Huaqin Zhao","Hanqi Jiang","Yi Pan","Yifan Zhou","Zhengliang Liu","Lewis C Howe","Tianming Liu"],"pdf_url":"https://arxiv.org/pdf/2412.05184v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05167v1","updated":"2024-12-06T16:34:15Z","published":"2024-12-06T16:34:15Z","title":"Benchmarking Open-ended Audio Dialogue Understanding for Large\n Audio-Language Models","summary":" Large Audio-Language Models (LALMs) have unclocked audio dialogue\ncapabilities, where audio dialogues are a direct exchange of spoken language\nbetween LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs\nin back-and-forth audio dialogues with humans. This progression not only\nunderscores the potential of LALMs but also broadens their applicability across\na wide range of practical scenarios supported by audio dialogues. However,\ngiven these advancements, a comprehensive benchmark to evaluate the performance\nof LALMs in the open-ended audio dialogue understanding remains absent\ncurrently. To address this gap, we propose an Audio Dialogue Understanding\nBenchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the\nopen-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills,\n9 multilingual languages, and 4 categories of ambiguity handling. Notably, we\nfirstly propose the evaluation of ambiguity handling in audio dialogues that\nexpresses different intentions beyond the same literal meaning of sentences,\ne.g., \"Really!?\" with different intonations. In summary, ADU-Bench includes\nover 20,000 open-ended audio dialogues for the assessment of LALMs. Through\nextensive experiments conducted on 13 LALMs, our analysis reveals that there is\nstill considerable room for improvement in the audio dialogue understanding\nabilities of existing LALMs. In particular, they struggle with mathematical\nsymbols and formulas, understanding human behavior such as roleplay,\ncomprehending multiple languages, and handling audio dialogue ambiguities from\ndifferent phonetic elements, such as intonations, pause positions, and\nhomophones.\n","authors":["Kuofeng Gao","Shu-Tao Xia","Ke Xu","Philip Torr","Jindong Gu"],"pdf_url":"https://arxiv.org/pdf/2412.05167v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.12537v2","updated":"2024-12-06T16:22:21Z","published":"2024-11-19T14:35:38Z","title":"Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues","summary":" Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and\nDeltaNet have emerged as efficient alternatives to Transformers in large\nlanguage modeling, offering linear scaling with sequence length and improved\ntraining efficiency. However, LRNNs struggle to perform state-tracking which\nmay impair performance in tasks such as code evaluation or tracking a chess\ngame. Even parity, the simplest state-tracking task, which non-linear RNNs like\nLSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et\nal. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity\nstems from restricting the value range of their diagonal state-transition\nmatrices to $[0, 1]$ and that incorporating negative values can resolve this\nissue. We extend this result to non-diagonal LRNNs, which have recently shown\npromise in models such as DeltaNet. We prove that finite precision LRNNs with\nstate-transition matrices having only positive eigenvalues cannot solve parity,\nwhile complex eigenvalues are needed to count modulo $3$. Notably, we also\nprove that LRNNs can learn any regular language when their state-transition\nmatrices are products of identity minus vector outer product matrices, each\nwith eigenvalues in the range $[-1, 1]$. Our empirical results confirm that\nextending the eigenvalue range of models like Mamba and DeltaNet to include\nnegative values not only enables them to solve parity but consistently improves\ntheir performance on state-tracking tasks. Furthermore, pre-training LRNNs with\nan extended eigenvalue range for language modeling achieves comparable\nperformance and stability while showing promise on code and math data. Our work\nenhances the expressivity of modern LRNNs, broadening their applicability\nwithout changing the cost of training or inference.\n","authors":["Riccardo Grazzi","Julien Siems","Jörg K. H. Franke","Arber Zela","Frank Hutter","Massimiliano Pontil"],"pdf_url":"https://arxiv.org/pdf/2411.12537v2.pdf","comment":"Main changes: Correction to Theorem 1 and 2 (we excluded from the\n only if condition complex eigenvalues with modulus strictly less than one).\n Correction to point 3 of Proposition 3"},{"id":"http://arxiv.org/abs/2412.05155v1","updated":"2024-12-06T16:13:19Z","published":"2024-12-06T16:13:19Z","title":"Multimodal Fact-Checking with Vision Language Models: A Probing\n Classifier based Solution with Embedding Strategies","summary":" This study evaluates the effectiveness of Vision Language Models (VLMs) in\nrepresenting and utilizing multimodal content for fact-checking. To be more\nspecific, we investigate whether incorporating multimodal content improves\nperformance compared to text-only models and how well VLMs utilize text and\nimage information to enhance misinformation detection. Furthermore we propose a\nprobing classifier based solution using VLMs. Our approach extracts embeddings\nfrom the last hidden layer of selected VLMs and inputs them into a neural\nprobing classifier for multi-class veracity classification. Through a series of\nexperiments on two fact-checking datasets, we demonstrate that while\nmultimodality can enhance performance, fusing separate embeddings from text and\nimage encoders yielded superior results compared to using VLM embeddings.\nFurthermore, the proposed neural classifier significantly outperformed KNN and\nSVM baselines in leveraging extracted embeddings, highlighting its\neffectiveness for multimodal fact-checking.\n","authors":["Recep Firat Cekinel","Pinar Karagoz","Cagri Coltekin"],"pdf_url":"https://arxiv.org/pdf/2412.05155v1.pdf","comment":"Accepted to COLING2025"},{"id":"http://arxiv.org/abs/2412.01806v2","updated":"2024-12-06T16:13:01Z","published":"2024-12-02T18:50:27Z","title":"Random Tree Model of Meaningful Memory","summary":" Traditional studies of memory for meaningful narratives focus on specific\nstories and their semantic structures but do not address common quantitative\nfeatures of recall across different narratives. We introduce a statistical\nensemble of random trees to represent narratives as hierarchies of key points,\nwhere each node is a compressed representation of its descendant leaves, which\nare the original narrative segments. Recall is modeled as constrained by\nworking memory capacity from this hierarchical structure. Our analytical\nsolution aligns with observations from large-scale narrative recall\nexperiments. Specifically, our model explains that (1) average recall length\nincreases sublinearly with narrative length, and (2) individuals summarize\nincreasingly longer narrative segments in each recall sentence. Additionally,\nthe theory predicts that for sufficiently long narratives, a universal,\nscale-invariant limit emerges, where the fraction of a narrative summarized by\na single recall sentence follows a distribution independent of narrative\nlength.\n","authors":["Weishun Zhong","Tankut Can","Antonis Georgiou","Ilya Shnayderman","Mikhail Katkov","Misha Tsodyks"],"pdf_url":"https://arxiv.org/pdf/2412.01806v2.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.05149v1","updated":"2024-12-06T16:06:08Z","published":"2024-12-06T16:06:08Z","title":"Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on\n Developmentally Plausible Corpora","summary":" The BabyLM Challenge is a community effort to close the data-efficiency gap\nbetween human and computational language learners. Participants compete to\noptimize language model training on a fixed language data budget of 100 million\nwords or less. This year, we released improved text corpora, as well as a\nvision-and-language corpus to facilitate research into cognitively plausible\nvision language models. Submissions were compared on evaluation tasks targeting\ngrammatical ability, (visual) question answering, pragmatic abilities, and\ngrounding, among other abilities. Participants could submit to a 10M-word\ntext-only track, a 100M-word text-only track, and/or a 100M-word and image\nmultimodal track. From 31 submissions employing diverse methods, a hybrid\ncausal-masked language model architecture outperformed other approaches. No\nsubmissions outperformed the baselines in the multimodal track. In follow-up\nanalyses, we found a strong relationship between training FLOPs and average\nperformance across tasks, and that the best-performing submissions proposed\nchanges to the training data, training objective, and model architecture. This\nyear's BabyLM Challenge shows that there is still significant room for\ninnovation in this setting, in particular for image-text modeling, but\ncommunity-driven research can yield actionable insights about effective\nstrategies for small-scale language modeling.\n","authors":["Michael Y. Hu","Aaron Mueller","Candace Ross","Adina Williams","Tal Linzen","Chengxu Zhuang","Ryan Cotterell","Leshem Choshen","Alex Warstadt","Ethan Gotlieb Wilcox"],"pdf_url":"https://arxiv.org/pdf/2412.05149v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05145v1","updated":"2024-12-06T16:01:30Z","published":"2024-12-06T16:01:30Z","title":"Explingo: Explaining AI Predictions using Large Language Models","summary":" Explanations of machine learning (ML) model predictions generated by\nExplainable AI (XAI) techniques such as SHAP are essential for people using ML\noutputs for decision-making. We explore the potential of Large Language Models\n(LLMs) to transform these explanations into human-readable, narrative formats\nthat align with natural communication. We address two key research questions:\n(1) Can LLMs reliably transform traditional explanations into high-quality\nnarratives? and (2) How can we effectively evaluate the quality of narrative\nexplanations? To answer these questions, we introduce Explingo, which consists\nof two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML\nexplanations and transforms them into natural-language descriptions. The Grader\nscores these narratives on a set of metrics including accuracy, completeness,\nfluency, and conciseness.\n Our experiments demonstrate that LLMs can generate high-quality narratives\nthat achieve high scores across all metrics, particularly when guided by a\nsmall number of human-labeled and bootstrapped examples. We also identified\nareas that remain challenging, in particular for effectively scoring narratives\nin complex domains. The findings from this work have been integrated into an\nopen-source tool that makes narrative explanations available for further\napplications.\n","authors":["Alexandra Zytek","Sara Pido","Sarah Alnegheimish","Laure Berti-Equille","Kalyan Veeramachaneni"],"pdf_url":"https://arxiv.org/pdf/2412.05145v1.pdf","comment":"To be presented in the 2024 IEEE International Conference on Big Data\n (IEEE BigData)"},{"id":"http://arxiv.org/abs/2412.05139v1","updated":"2024-12-06T15:56:11Z","published":"2024-12-06T15:56:11Z","title":"A Practical Examination of AI-Generated Text Detectors for Large\n Language Models","summary":" The proliferation of large language models has raised growing concerns about\ntheir misuse, particularly in cases where AI-generated text is falsely\nattributed to human authors. Machine-generated content detectors claim to\neffectively identify such text under various conditions and from any language\nmodel. This paper critically evaluates these claims by assessing several\npopular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank,\nBinoculars) on a range of domains, datasets, and models that these detectors\nhave not previously encountered. We employ various prompting strategies to\nsimulate adversarial attacks, demonstrating that even moderate efforts can\nsignificantly evade detection. We emphasize the importance of the true positive\nrate at a specific false positive rate (TPR@FPR) metric and demonstrate that\nthese detectors perform poorly in certain settings, with TPR@.01 as low as 0\\%.\nOur findings suggest that both trained and zero-shot detectors struggle to\nmaintain high sensitivity while achieving a reasonable true positive rate.\n","authors":["Brian Tufts","Xuandong Zhao","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2412.05139v1.pdf","comment":"8 pages. Submitted to ARR October cycle"},{"id":"http://arxiv.org/abs/2406.07057v2","updated":"2024-12-06T14:21:06Z","published":"2024-06-11T08:38:13Z","title":"MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal\n Large Language Models","summary":" Despite the superior capabilities of Multimodal Large Language Models (MLLMs)\nacross diverse tasks, they still face significant trustworthiness challenges.\nYet, current literature on the assessment of trustworthy MLLMs remains limited,\nlacking a holistic evaluation to offer thorough insights into future\nimprovements. In this work, we establish MultiTrust, the first comprehensive\nand unified benchmark on the trustworthiness of MLLMs across five primary\naspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark\nemploys a rigorous evaluation strategy that addresses both multimodal risks and\ncross-modal impacts, encompassing 32 diverse tasks with self-curated datasets.\nExtensive experiments with 21 modern MLLMs reveal some previously unexplored\ntrustworthiness issues and risks, highlighting the complexities introduced by\nthe multimodality and underscoring the necessity for advanced methodologies to\nenhance their reliability. For instance, typical proprietary models still\nstruggle with the perception of visually confusing images and are vulnerable to\nmultimodal jailbreaking and adversarial attacks; MLLMs are more inclined to\ndisclose privacy in text and reveal ideological and cultural biases even when\npaired with irrelevant images in inference, indicating that the multimodality\namplifies the internal risks from base LLMs. Additionally, we release a\nscalable toolbox for standardized trustworthiness research, aiming to\nfacilitate future advancements in this important field. Code and resources are\npublicly available at: https://multi-trust.github.io/.\n","authors":["Yichi Zhang","Yao Huang","Yitong Sun","Chang Liu","Zhe Zhao","Zhengwei Fang","Yifan Wang","Huanran Chen","Xiao Yang","Xingxing Wei","Hang Su","Yinpeng Dong","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2406.07057v2.pdf","comment":"100 pages, 84 figures, 33 tables"},{"id":"http://arxiv.org/abs/2411.19832v2","updated":"2024-12-06T13:41:53Z","published":"2024-11-29T16:44:02Z","title":"Sensitive Content Classification in Social Media: A Holistic Resource\n and Evaluation","summary":" The detection of sensitive content in large datasets is crucial for ensuring\nthat shared and analysed data is free from harmful material. However, current\nmoderation tools, such as external APIs, suffer from limitations in\ncustomisation, accuracy across diverse sensitive categories, and privacy\nconcerns. Additionally, existing datasets and open-source models focus\npredominantly on toxic language, leaving gaps in detecting other sensitive\ncategories such as substance abuse or self-harm. In this paper, we put forward\na unified dataset tailored for social media content moderation across six\nsensitive categories: conflictual language, profanity, sexually explicit\nmaterial, drug-related content, self-harm, and spam. By collecting and\nannotating data with consistent retrieval strategies and guidelines, we address\nthe shortcomings of previous focalised research. Our analysis demonstrates that\nfine-tuning large language models (LLMs) on this novel dataset yields\nsignificant improvements in detection performance compared to open\noff-the-shelf models such as LLaMA, and even proprietary OpenAI models, which\nunderperform by 10-15% overall. This limitation is even more pronounced on\npopular moderation APIs, which cannot be easily tailored to specific sensitive\ncontent categories, among others.\n","authors":["Dimosthenis Antypas","Indira Sen","Carla Perez-Almendros","Jose Camacho-Collados","Francesco Barbieri"],"pdf_url":"https://arxiv.org/pdf/2411.19832v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05028v1","updated":"2024-12-06T13:25:09Z","published":"2024-12-06T13:25:09Z","title":"Unifying Dual-Space Embedding for Entity Alignment via Contrastive\n Learning","summary":" Entity alignment aims to match identical entities across different knowledge\ngraphs (KGs). Graph neural network-based entity alignment methods have achieved\npromising results in Euclidean space. However, KGs often contain complex\nstructures, including both local and hierarchical ones, which make it\nchallenging to efficiently represent them within a single space. In this paper,\nwe proposed a novel method UniEA, which unifies dual-space embedding to\npreserve the intrinsic structure of KGs. Specifically, we learn graph structure\nembedding in both Euclidean and hyperbolic spaces simultaneously to maximize\nthe consistency between the embedding in both spaces. Moreover, we employ\ncontrastive learning to mitigate the misalignment issues caused by similar\nentities, where embedding of similar neighboring entities within the KG become\ntoo close in distance. Extensive experiments on benchmark datasets demonstrate\nthat our method achieves state-of-the-art performance in structure-based EA.\nOur code is available at https://github.com/wonderCS1213/UniEA.\n","authors":["Cunda Wang","Weihua Wang","Qiuyu Liang","Feilong Bao","Guanglai Gao"],"pdf_url":"https://arxiv.org/pdf/2412.05028v1.pdf","comment":"Accepted by COLING2025"},{"id":"http://arxiv.org/abs/2410.13166v3","updated":"2024-12-06T13:22:11Z","published":"2024-10-17T02:47:10Z","title":"An Evolved Universal Transformer Memory","summary":" Prior methods propose to offset the escalating costs of modern foundation\nmodels by dropping specific parts of their contexts with hand-designed rules,\nwhile attempting to preserve their original performance. We overcome this\ntrade-off with Neural Attention Memory Models (NAMMs), introducing a learned\nnetwork for memory management that improves both the performance and efficiency\nof transformers. We evolve NAMMs atop pre-trained transformers to provide\ndifferent latent contexts focusing on the most relevant information for\nindividual layers and attention heads. NAMMs are universally applicable to any\nmodel using self-attention as they condition exclusively on the values in the\nproduced attention matrices. Learning NAMMs on a small set of problems, we\nachieve substantial performance improvements across multiple long-context\nbenchmarks while cutting the model's input contexts up to a fraction of the\noriginal sizes. We show the generality of our conditioning enables zero-shot\ntransfer of NAMMs trained only on language to entirely new transformer\narchitectures even across input modalities, with their benefits carrying over\nto vision and reinforcement learning.\n","authors":["Edoardo Cetin","Qi Sun","Tianyu Zhao","Yujin Tang"],"pdf_url":"https://arxiv.org/pdf/2410.13166v3.pdf","comment":"Preprint, under submission. Source code is available at\n https://github.com/SakanaAI/evo-memory"},{"id":"http://arxiv.org/abs/2412.05023v1","updated":"2024-12-06T13:20:57Z","published":"2024-12-06T13:20:57Z","title":"Steps are all you need: Rethinking STEM Education with Prompt\n Engineering","summary":" Few shot and Chain-of-Thought prompting have shown promise when applied to\nPhysics Question Answering Tasks, but are limited by the lack of mathematical\nability inherent to LLMs, and are prone to hallucination. By utilizing a\nMixture of Experts (MoE) Model, along with analogical prompting, we are able to\nshow improved model performance when compared to the baseline on standard LLMs.\nWe also survey the limits of these prompting techniques and the effects they\nhave on model performance. Additionally, we propose Analogical CoT prompting, a\nprompting technique designed to allow smaller, open source models to leverage\nAnalogical prompting, something they have struggled with, possibly due to a\nlack of specialist training data.\n","authors":["Krishnasai Addala","Kabir Dev Paul Baghel","Chhavi Kirtani","Avinash Anand","Rajiv Ratn Shah"],"pdf_url":"https://arxiv.org/pdf/2412.05023v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.02976v2","updated":"2024-12-06T12:39:00Z","published":"2024-09-04T13:59:38Z","title":"Hallucination Detection in LLMs: Fast and Memory-Efficient Fine-Tuned\n Models","summary":" Uncertainty estimation is a necessary component when implementing AI in\nhigh-risk settings, such as autonomous cars, medicine, or insurances. Large\nLanguage Models (LLMs) have seen a surge in popularity in recent years, but\nthey are subject to hallucinations, which may cause serious harm in high-risk\nsettings. Despite their success, LLMs are expensive to train and run: they need\na large amount of computations and memory, preventing the use of ensembling\nmethods in practice. In this work, we present a novel method that allows for\nfast and memory-friendly training of LLM ensembles. We show that the resulting\nensembles can detect hallucinations and are a viable approach in practice as\nonly one GPU is needed for training and inference.\n","authors":["Gabriel Y. Arteaga","Thomas B. Schön","Nicolas Pielawski"],"pdf_url":"https://arxiv.org/pdf/2409.02976v2.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2402.01349v3","updated":"2024-12-06T11:54:40Z","published":"2024-02-02T12:07:00Z","title":"LLMs May Perform MCQA by Selecting the Least Incorrect Option","summary":" In the field of NLP, Large Language Models (LLMs) have markedly enhanced\nperformance across a variety of tasks. However, the comprehensive evaluation of\nLLMs remains an inevitable challenge for the community. Recently, the adoption\nof Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs\nhas gained considerable traction. However, concerns regarding the robustness of\nthis evaluative method persist. Building upon previous discussions on the issue\nof \\textit{variability}, we reveal an additional dimension of concern: LLMs may\nperform MCQA by selecting the least incorrect option rather than distinctly\ncorrect. This observation suggests that LLMs might regard multiple options as\ncorrect, which could undermine the reliability of MCQA as a metric for\nevaluating LLMs. To address this challenge, we introduce an enhanced dataset\naugmentation method for MCQA, termed MCQA+, to provide a more accurate\nreflection of the model performance, thereby highlighting the necessity for\nmore sophisticated evaluation mechanisms in the assessment of LLM capabilities.\n","authors":["Haochun Wang","Sendong Zhao","Zewen Qiang","Nuwa Xi","Bing Qin","Ting Liu"],"pdf_url":"https://arxiv.org/pdf/2402.01349v3.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.04975v1","updated":"2024-12-06T11:49:18Z","published":"2024-12-06T11:49:18Z","title":"PETapter: Leveraging PET-style classification heads for modular few-shot\n parameter-efficient fine-tuning","summary":" Few-shot learning and parameter-efficient fine-tuning (PEFT) are crucial to\novercome the challenges of data scarcity and ever growing language model sizes.\nThis applies in particular to specialized scientific domains, where researchers\nmight lack expertise and resources to fine-tune high-performing language models\nto nuanced tasks. We propose PETapter, a novel method that effectively combines\nPEFT methods with PET-style classification heads to boost few-shot learning\ncapabilities without the significant computational overhead typically\nassociated with full model training. We validate our approach on three\nestablished NLP benchmark datasets and one real-world dataset from\ncommunication research. We show that PETapter not only achieves comparable\nperformance to full few-shot fine-tuning using pattern-exploiting training\n(PET), but also provides greater reliability and higher parameter efficiency\nwhile enabling higher modularity and easy sharing of the trained modules, which\nenables more researchers to utilize high-performing NLP-methods in their\nresearch.\n","authors":["Jonas Rieger","Mattes Ruckdeschel","Gregor Wiedemann"],"pdf_url":"https://arxiv.org/pdf/2412.04975v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04315v2","updated":"2024-12-06T11:39:27Z","published":"2024-12-05T16:31:13Z","title":"Densing Law of LLMs","summary":" Large Language Models (LLMs) have emerged as a milestone in artificial\nintelligence, and their performance can improve as the model size increases.\nHowever, this scaling brings great challenges to training and inference\nefficiency, particularly for deploying LLMs in resource-constrained\nenvironments, and the scaling trend is becoming increasingly unsustainable.\nThis paper introduces the concept of ``\\textit{capacity density}'' as a new\nmetric to evaluate the quality of the LLMs across different scales and\ndescribes the trend of LLMs in terms of both effectiveness and efficiency. To\ncalculate the capacity density of a given target LLM, we first introduce a set\nof reference models and develop a scaling law to predict the downstream\nperformance of these reference models based on their parameter sizes. We then\ndefine the \\textit{effective parameter size} of the target LLM as the parameter\nsize required by a reference model to achieve equivalent performance, and\nformalize the capacity density as the ratio of the effective parameter size to\nthe actual parameter size of the target LLM. Capacity density provides a\nunified framework for assessing both model effectiveness and efficiency. Our\nfurther analysis of recent open-source base LLMs reveals an empirical law (the\ndensing law)that the capacity density of LLMs grows exponentially over time.\nMore specifically, using some widely used benchmarks for evaluation, the\ncapacity density of LLMs doubles approximately every three months. The law\nprovides new perspectives to guide future LLM development, emphasizing the\nimportance of improving capacity density to achieve optimal results with\nminimal computational overhead.\n","authors":["Chaojun Xiao","Jie Cai","Weilin Zhao","Guoyang Zeng","Biyuan Lin","Jie Zhou","Zhi Zheng","Xu Han","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2412.04315v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04954v1","updated":"2024-12-06T11:14:03Z","published":"2024-12-06T11:14:03Z","title":"Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for\n Radiology Report Generation","summary":" We introduce a radiology-focused visual language model designed to generate\nradiology reports from chest X-rays. Building on previous findings that large\nlanguage models (LLMs) can acquire multimodal capabilities when aligned with\npretrained vision encoders, we demonstrate similar potential with chest X-ray\nimages. This integration enhances the ability of model to understand and\ndescribe chest X-ray images. Our model combines an image encoder with a\nfine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate\ndifferent sections of a radiology report with notable accuracy. The training\nprocess involves a two-stage approach: (i) initial alignment of chest X-ray\nfeatures with the LLM (ii) followed by fine-tuning for radiology report\ngeneration.\n","authors":["Xi Zhang","Zaiqiao Meng","Jake Lever","Edmond S. L. Ho"],"pdf_url":"https://arxiv.org/pdf/2412.04954v1.pdf","comment":"Accepted by BioNLP@ACL 2024"},{"id":"http://arxiv.org/abs/2412.04948v1","updated":"2024-12-06T11:08:24Z","published":"2024-12-06T11:08:24Z","title":"KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view\n Knowledge Graph Contrastive Learning","summary":" Autoregressive large language models (LLMs) pre-trained by next token\nprediction are inherently proficient in generative tasks. However, their\nperformance on knowledge-driven tasks such as factual knowledge querying\nremains unsatisfactory. Knowledge graphs (KGs), as high-quality structured\nknowledge bases, can provide reliable knowledge for LLMs, potentially\ncompensating for their knowledge deficiencies. Aligning LLMs with explicit,\nstructured knowledge from KGs has been a challenge; previous attempts either\nfailed to effectively align knowledge representations or compromised the\ngenerative capabilities of LLMs, leading to less-than-optimal outcomes. This\npaper proposes \\textbf{KaLM}, a \\textit{Knowledge-aligned Language Modeling}\napproach, which fine-tunes autoregressive LLMs to align with KG knowledge via\nthe joint objective of explicit knowledge alignment and implicit knowledge\nalignment. The explicit knowledge alignment objective aims to directly optimize\nthe knowledge representation of LLMs through dual-view knowledge graph\ncontrastive learning. The implicit knowledge alignment objective focuses on\nincorporating textual patterns of knowledge into LLMs through triple completion\nlanguage modeling. Notably, our method achieves a significant performance boost\nin evaluations of knowledge-driven tasks, specifically embedding-based\nknowledge graph completion and generation-based knowledge graph question\nanswering.\n","authors":["Peng Yu","Cheng Deng","Beiya Dai","Xinbing Wang","Ying Wen"],"pdf_url":"https://arxiv.org/pdf/2412.04948v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04947v1","updated":"2024-12-06T11:07:44Z","published":"2024-12-06T11:07:44Z","title":"C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model\n Evaluation","summary":" Recent advances in large language models (LLMs) have shown significant\npromise, yet their evaluation raises concerns, particularly regarding data\ncontamination due to the lack of access to proprietary training data. To\naddress this issue, we present C$^2$LEVA, a comprehensive bilingual benchmark\nfeaturing systematic contamination prevention. C$^2$LEVA firstly offers a\nholistic evaluation encompassing 22 tasks, each targeting a specific\napplication or ability of LLMs, and secondly a trustworthy assessment due to\nour contamination-free tasks, ensured by a systematic contamination prevention\nstrategy that fully automates test data renewal and enforces data protection\nduring benchmark data release. Our large-scale evaluation of 15 open-source and\nproprietary models demonstrates the effectiveness of C$^2$LEVA.\n","authors":["Yanyang Li","Tin Long Wong","Cheung To Hung","Jianqiao Zhao","Duo Zheng","Ka Wai Liu","Michael R. Lyu","Liwei Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04947v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04942v1","updated":"2024-12-06T11:00:05Z","published":"2024-12-06T11:00:05Z","title":"A Federated Approach to Few-Shot Hate Speech Detection for Marginalized\n Communities","summary":" Hate speech online remains an understudied issue for marginalized\ncommunities, and has seen rising relevance, especially in the Global South,\nwhich includes developing societies with increasing internet penetration. In\nthis paper, we aim to provide marginalized communities living in societies\nwhere the dominant language is low-resource with a privacy-preserving tool to\nprotect themselves from hate speech on the internet by filtering offensive\ncontent in their native languages. Our contribution in this paper is twofold:\n1) we release REACT (REsponsive hate speech datasets Across ConTexts), a\ncollection of high-quality, culture-specific hate speech detection datasets\ncomprising seven distinct target groups in eight low-resource languages,\ncurated by experienced data collectors; 2) we propose a solution to few-shot\nhate speech detection utilizing federated learning (FL), a privacy-preserving\nand collaborative learning approach, to continuously improve a central model\nthat exhibits robustness when tackling different target groups and languages.\nBy keeping the training local to the users' devices, we ensure the privacy of\nthe users' data while benefitting from the efficiency of federated learning.\nFurthermore, we personalize client models to target-specific training data and\nevaluate their performance. Our results indicate the effectiveness of FL across\ndifferent target groups, whereas the benefits of personalization on few-shot\nlearning are not clear.\n","authors":["Haotian Ye","Axel Wisiorek","Antonis Maronikolakis","Özge Alaçam","Hinrich Schütze"],"pdf_url":"https://arxiv.org/pdf/2412.04942v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04937v1","updated":"2024-12-06T10:45:54Z","published":"2024-12-06T10:45:54Z","title":"Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of\n Turn-taking in Murder Mystery Games","summary":" Multi-agent systems utilizing large language models (LLMs) have shown great\npromise in achieving natural dialogue. However, smooth dialogue control and\nautonomous decision making among agents still remain challenges. In this study,\nwe focus on conversational norms such as adjacency pairs and turn-taking found\nin conversation analysis and propose a new framework called \"Murder Mystery\nAgents\" that applies these norms to AI agents' dialogue control. As an\nevaluation target, we employed the \"Murder Mystery\" game, a reasoning-type\ntable-top role-playing game that requires complex social reasoning and\ninformation manipulation. In this game, players need to unravel the truth of\nthe case based on fragmentary information through cooperation and bargaining.\nThe proposed framework integrates next speaker selection based on adjacency\npairs and a self-selection mechanism that takes agents' internal states into\naccount to achieve more natural and strategic dialogue. To verify the\neffectiveness of this new approach, we analyzed utterances that led to dialogue\nbreakdowns and conducted automatic evaluation using LLMs, as well as human\nevaluation using evaluation criteria developed for the Murder Mystery game.\nExperimental results showed that the implementation of the next speaker\nselection mechanism significantly reduced dialogue breakdowns and improved the\nability of agents to share information and perform logical reasoning. The\nresults of this study demonstrate that the systematics of turn-taking in human\nconversation are also effective in controlling dialogue among AI agents, and\nprovide design guidelines for more advanced multi-agent dialogue systems.\n","authors":["Ryota Nonomura","Hiroki Mori"],"pdf_url":"https://arxiv.org/pdf/2412.04937v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09869v4","updated":"2024-12-06T10:44:56Z","published":"2024-08-19T10:20:06Z","title":"Docling Technical Report","summary":" We introduce Docling, an easy-to-use, self-contained, MIT-licensed,\nopen-source toolkit for document conversion, that can parse several types of\npopular document formats into a unified, richly structured representation. It\nis powered by state-of-the-art specialized AI models for layout analysis\n(DocLayNet) and table structure recognition (TableFormer), and runs efficiently\non commodity hardware in a small resource budget. Docling is released as a\nPython package and can be used as a Python API or as a CLI tool. Docling's\nmodular architecture and efficient document representation %, known as\nDoclingDocument, make it easy to implement extensions, new features, models,\nand customizations. Docling has been already integrated in other popular\nopen-source frameworks (e.g., LlamaIndex, LangChain, spaCy), making it a\nnatural fit for the processing of documents and the development of high-end\napplications. The open-source community has fully engaged in using, promoting,\nand developing for Docling, which gathered 10k stars on GitHub in less than a\nmonth and was reported as the No. 1 trending repository in GitHub worldwide in\nNovember 2024.\n","authors":["Nikolaos Livathinos","Christoph Auer","Maksym Lysak","Ahmed Nassar","Michele Dolfi","Panos Vagenas","Cesar Berrospi Ramis","Matteo Omenetti","Kasper Dinkla","Yusik Kim","Shubham Gupta","Rafael Teixeira de Lima","Valery Weber","Lucas Morin","Ingmar Meijer","Viktor Kuropiatnyk","Peter W. J. Staar"],"pdf_url":"https://arxiv.org/pdf/2408.09869v4.pdf","comment":"Submitted to AAAI 25: Workshop on Open-Source AI for Mainstream Use"},{"id":"http://arxiv.org/abs/2412.04936v1","updated":"2024-12-06T10:44:20Z","published":"2024-12-06T10:44:20Z","title":"Probing the contents of semantic representations from text, behavior,\n and brain data using the psychNorms metabase","summary":" Semantic representations are integral to natural language processing,\npsycholinguistics, and artificial intelligence. Although often derived from\ninternet text, recent years have seen a rise in the popularity of\nbehavior-based (e.g., free associations) and brain-based (e.g., fMRI)\nrepresentations, which promise improvements in our ability to measure and model\nhuman representations. We carry out the first systematic evaluation of the\nsimilarities and differences between semantic representations derived from\ntext, behavior, and brain data. Using representational similarity analysis, we\nshow that word vectors derived from behavior and brain data encode information\nthat differs from their text-derived cousins. Furthermore, drawing on our\npsychNorms metabase, alongside an interpretability method that we call\nrepresentational content analysis, we find that, in particular, behavior\nrepresentations capture unique variance on certain affective, agentic, and\nsocio-moral dimensions. We thus establish behavior as an important complement\nto text for capturing human representations and behavior. These results are\nbroadly relevant to research aimed at learning human-aligned semantic\nrepresentations, including work on evaluating and aligning large language\nmodels.\n","authors":["Zak Hussain","Rui Mata","Ben R. Newell","Dirk U. Wulff"],"pdf_url":"https://arxiv.org/pdf/2412.04936v1.pdf","comment":"13 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2410.01294v2","updated":"2024-12-06T10:31:43Z","published":"2024-10-02T07:40:56Z","title":"Endless Jailbreaks with Bijection Learning","summary":" Despite extensive safety measures, LLMs are vulnerable to adversarial inputs,\nor jailbreaks, which can elicit unsafe behaviors. In this work, we introduce\nbijection learning, a powerful attack algorithm which automatically fuzzes LLMs\nfor safety vulnerabilities using randomly-generated encodings whose complexity\ncan be tightly controlled. We leverage in-context learning to teach models\nbijective encodings, pass encoded queries to the model to bypass built-in\nsafety mechanisms, and finally decode responses back into English. Our attack\nis extremely effective on a wide range of frontier language models. Moreover,\nby controlling complexity parameters such as number of key-value mappings in\nthe encodings, we find a close relationship between the capability level of the\nattacked LLM and the average complexity of the most effective bijection\nattacks. Our work highlights that new vulnerabilities in frontier models can\nemerge with scale: more capable models are more severely jailbroken by\nbijection attacks.\n","authors":["Brian R. Y. Huang","Maximilian Li","Leonard Tang"],"pdf_url":"https://arxiv.org/pdf/2410.01294v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00353v2","updated":"2024-12-06T10:24:47Z","published":"2024-11-30T04:22:00Z","title":"Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided\n Strategy Selection","summary":" Chain-of-thought (CoT) prompting has significantly enhanced the capability of\nlarge language models (LLMs) by structuring their reasoning processes. However,\nexisting methods face critical limitations: handcrafted demonstrations require\nextensive human expertise, while trigger phrases are prone to inaccuracies. In\nthis paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method,\na novel approach that improves CoT prompting by utilizing uncertainty estimates\nto select effective demonstrations without needing access to model parameters.\nUnlike traditional methods, ZEUS offers high sensitivity in distinguishing\nbetween helpful and ineffective questions, ensuring more precise and reliable\nselection. Our extensive evaluation shows that ZEUS consistently outperforms\nexisting CoT strategies across four challenging reasoning benchmarks,\ndemonstrating its robustness and scalability.\n","authors":["Shanu Kumar","Saish Mendke","Karody Lubna Abdul Rahman","Santosh Kurasa","Parag Agrawal","Sandipan Dandapat"],"pdf_url":"https://arxiv.org/pdf/2412.00353v2.pdf","comment":"Accepted in COLING 2025"},{"id":"http://arxiv.org/abs/2412.04922v1","updated":"2024-12-06T10:21:25Z","published":"2024-12-06T10:21:25Z","title":"Large Language Models for Ingredient Substitution in Food Recipes using\n Supervised Fine-tuning and Direct Preference Optimization","summary":" In this paper, we address the challenge of recipe personalization through\ningredient substitution. We make use of Large Language Models (LLMs) to build\nan ingredient substitution system designed to predict plausible substitute\ningredients within a given recipe context. Given that the use of LLMs for this\ntask has been barely done, we carry out an extensive set of experiments to\ndetermine the best LLM, prompt, and the fine-tuning setups. We further\nexperiment with methods such as multi-task learning, two-stage fine-tuning, and\nDirect Preference Optimization (DPO). The experiments are conducted using the\npublicly available Recipe1MSub corpus. The best results are produced by the\nMistral7-Base LLM after fine-tuning and DPO. This result outperforms the strong\nbaseline available for the same corpus with a Hit@1 score of 22.04. Thus we\nbelieve that this research represents a significant step towards enabling\npersonalized and creative culinary experiences by utilizing LLM-based\ningredient substitution.\n","authors":["Thevin Senath","Kumuthu Athukorala","Ransika Costa","Surangika Ranathunga","Rishemjit Kaur"],"pdf_url":"https://arxiv.org/pdf/2412.04922v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04905v1","updated":"2024-12-06T10:01:38Z","published":"2024-12-06T10:01:38Z","title":"DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling","summary":" Large language models (LLMs) have made dialogue one of the central modes of\nhuman-machine interaction, leading to the accumulation of vast amounts of\nconversation logs and increasing demand for dialogue generation. A\nconversational life-cycle spans from the Prelude through the Interlocution to\nthe Epilogue, encompassing various elements. Despite the existence of numerous\ndialogue-related studies, there is a lack of benchmarks that encompass\ncomprehensive dialogue elements, hindering precise modeling and systematic\nevaluation. To bridge this gap, we introduce an innovative research task\n$\\textbf{D}$ialogue $\\textbf{E}$lement $\\textbf{MO}$deling, including\n$\\textit{Element Awareness}$ and $\\textit{Dialogue Agent Interaction}$, and\npropose a novel benchmark, $\\textbf{DEMO}$, designed for a comprehensive\ndialogue modeling and assessment. Inspired by imitation learning, we further\nbuild the agent which possesses the adept ability to model dialogue elements\nbased on the DEMO benchmark. Extensive experiments indicate that existing LLMs\nstill exhibit considerable potential for enhancement, and our DEMO agent has\nsuperior performance in both in-domain and out-of-domain tasks.\n","authors":["Minzheng Wang","Xinghua Zhang","Kun Chen","Nan Xu","Haiyang Yu","Fei Huang","Wenji Mao","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2412.04905v1.pdf","comment":"We release the code and data at https://github.com/MozerWang/DEMO"},{"id":"http://arxiv.org/abs/2412.04903v1","updated":"2024-12-06T09:59:47Z","published":"2024-12-06T09:59:47Z","title":"EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation","summary":" Multimodal large language models (MLLMs) have achieved remarkable progress on\nvarious visual question answering and reasoning tasks leveraging instruction\nfine-tuning specific datasets. They can also learn from preference data\nannotated by human to enhance their reasoning ability and mitigate\nhallucinations. Most of preference data is generated from the model itself.\nHowever, existing methods require high-quality critical labels, which are\ncostly and rely on human or proprietary models like GPT-4V. In this work, we\npropose Enhancing Alignment in MLLMs via Critical Observation (EACO), which\naligns MLLMs by self-generated preference data using only 5k images\neconomically. Our approach begins with collecting and refining a Scoring\nEvaluation Instruction-tuning dataset to train a critical evaluation model,\ntermed the Critic. This Critic observes model responses across multiple\ndimensions, selecting preferred and non-preferred outputs for refined Direct\nPreference Optimization (DPO) tuning. To further enhance model performance, we\nemploy an additional supervised fine-tuning stage after preference tuning. EACO\nreduces the overall hallucinations by 65.6% on HallusionBench and improves the\nreasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement\nover LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also\nshows the potential critical ability in open-source MLLMs, demonstrating that\nEACO is a viable path to boost the competence of MLLMs.\n","authors":["Yongxin Wang","Meng Cao","Haokun Lin","Mingfei Han","Liang Ma","Jin Jiang","Yuhao Cheng","Xiaodan Liang"],"pdf_url":"https://arxiv.org/pdf/2412.04903v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2412.03681v2","updated":"2024-12-06T09:43:00Z","published":"2024-12-04T19:23:37Z","title":"Acquired TASTE: Multimodal Stance Detection with Textual and Structural\n Embeddings","summary":" Stance detection plays a pivotal role in enabling an extensive range of\ndownstream applications, from discourse parsing to tracing the spread of fake\nnews and the denial of scientific facts. While most stance classification\nmodels rely on textual representation of the utterance in question, prior work\nhas demonstrated the importance of the conversational context in stance\ndetection. In this work we introduce TASTE -- a multimodal architecture for\nstance detection that harmoniously fuses Transformer-based content embedding\nwith unsupervised structural embedding. Through the fine-tuning of a pretrained\ntransformer and the amalgamation with social embedding via a Gated Residual\nNetwork (GRN) layer, our model adeptly captures the complex interplay between\ncontent and conversational structure in determining stance. TASTE achieves\nstate-of-the-art results on common benchmarks, significantly outperforming an\narray of strong baselines. Comparative evaluations underscore the benefits of\nsocial grounding -- emphasizing the criticality of concurrently harnessing both\ncontent and structure for enhanced stance detection.\n","authors":["Guy Barel","Oren Tsur","Dan Vilenchik"],"pdf_url":"https://arxiv.org/pdf/2412.03681v2.pdf","comment":"The modified camera ready version will be published in January 2025\n at COLING"},{"id":"http://arxiv.org/abs/2410.02691v3","updated":"2024-12-06T09:27:09Z","published":"2024-10-03T17:18:03Z","title":"On the Proper Treatment of Tokenization in Psycholinguistics","summary":" Language models are widely used in computational psycholinguistics to test\ntheories that relate the negative log probability (the surprisal) of a region\nof interest (a substring of characters) under a language model to its cognitive\ncost experienced by readers, as operationalized, for example, by gaze duration\non the region. However, the application of modern language models to\npsycholinguistic studies is complicated by the practice of using tokenization\nas an intermediate step in training a model. Doing so results in a language\nmodel over token strings rather than one over character strings. Vexingly,\nregions of interest are generally misaligned with these token strings. The\npaper argues that token-level language models should be (approximately)\nmarginalized into character-level language models before they are used in\npsycholinguistic studies to compute the surprisal of a region of interest;\nthen, the marginalized character-level language model can be used to compute\nthe surprisal of an arbitrary character substring, which we term a focal area,\nthat the experimenter may wish to use as a predictor. Our proposal of\nmarginalizing a token-level model into a character-level one solves this\nmisalignment issue independently of the tokenization scheme. Empirically, we\ndiscover various focal areas whose surprisal is a better psychometric predictor\nthan the surprisal of the region of interest itself.\n","authors":["Mario Giulianelli","Luca Malagutti","Juan Luis Gastaldi","Brian DuSell","Tim Vieira","Ryan Cotterell"],"pdf_url":"https://arxiv.org/pdf/2410.02691v3.pdf","comment":"Main conference long paper at EMNLP 2024. New version: copy-editing\n and updated bib"},{"id":"http://arxiv.org/abs/2410.20936v2","updated":"2024-12-06T09:06:20Z","published":"2024-10-28T11:37:39Z","title":"Autoformalize Mathematical Statements by Symbolic Equivalence and\n Semantic Consistency","summary":" Autoformalization, the task of automatically translating natural language\ndescriptions into a formal language, poses a significant challenge across\nvarious domains, especially in mathematics. Recent advancements in large\nlanguage models (LLMs) have unveiled their promising capabilities to formalize\neven competition-level math problems. However, we observe a considerable\ndiscrepancy between pass@1 and pass@k accuracies in LLM-generated\nformalizations. To address this gap, we introduce a novel framework that scores\nand selects the best result from k autoformalization candidates based on two\ncomplementary self-consistency methods: symbolic equivalence and semantic\nconsistency. Elaborately, symbolic equivalence identifies the logical\nhomogeneity among autoformalization candidates using automated theorem provers,\nand semantic consistency evaluates the preservation of the original meaning by\ninformalizing the candidates and computing the similarity between the\nembeddings of the original and informalized texts. Our extensive experiments on\nthe MATH and miniF2F datasets demonstrate that our approach significantly\nenhances autoformalization accuracy, achieving up to 0.22-1.35x relative\nimprovements across various LLMs and baseline methods.\n","authors":["Zenan Li","Yifan Wu","Zhaoyu Li","Xinming Wei","Xian Zhang","Fan Yang","Xiaoxing Ma"],"pdf_url":"https://arxiv.org/pdf/2410.20936v2.pdf","comment":"Published as a conference paper at NeurIPS 2024. Code is available at\n https://github.com/Miracle-Messi/Isa-AutoFormal"},{"id":"http://arxiv.org/abs/2405.07623v2","updated":"2024-12-06T09:04:55Z","published":"2024-05-13T10:30:33Z","title":"COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via\n Nonlinear Integer Programming","summary":" When performing classification tasks with language models, would you prefer\nhaving only one highly accurate class or having every class deliver reliable\nperformance? Obviously, a more balanced accuracy among classes better reflects\nthe expectations of the majority of users. Especially for large language models\n(LLMs), the fact that they achieve a fair overall accuracy by in-context\nlearning (ICL) obscures a large difference in individual class accuracies. In\nthis work, we uncover and tackle language models' imbalance in per-class\nprediction accuracy by reconceptualizing it as the Contextual Oddity Bias\n(COBias), and we are the first to engage nonlinear integer programming (NIP) to\ndebias it. Briefly, the proposed COBias metric measures accuracy differences\namong class pairs, with which we reveal the large per-class accuracy\ndifferences exhibited in LLMs of varied scales and families. Then we propose\nDebiasing as Nonlinear Integer Programming (DNIP) to correct ICL per-class\nprobabilities towards lower COBias and higher overall accuracy. Our\noptimization objective is directly based on the evaluation scores by COBias and\naccuracy metrics, which is non-differentiable and solved by the simulated\nannealing metaheuristic. Evaluations on three LLMs across seven NLP\nclassification tasks show that DNIP simultaneously achieves significant COBias\nreduction (-27%) and accuracy improvement (+12%) over the conventional ICL\napproach, suggesting that modeling pairwise class accuracy differences is a\ndirection in pushing forward more accurate, more reliable LLM predictions.\n","authors":["Ruixi Lin","Yang You"],"pdf_url":"https://arxiv.org/pdf/2405.07623v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04871v1","updated":"2024-12-06T09:04:12Z","published":"2024-12-06T09:04:12Z","title":"Building a Family of Data Augmentation Models for Low-cost LLM\n Fine-tuning on the Cloud","summary":" Specializing LLMs in various domain-specific tasks has emerged as a critical\nstep towards achieving high performance. However, the construction and\nannotation of datasets in specific domains are always very costly. Apart from\nusing superior and expensive closed-source LLM APIs to construct datasets, some\nopen-source models have become strong enough to handle dataset construction in\nmany scenarios. Thus, we present a family of data augmentation models designed\nto significantly improve the efficiency for model fine-tuning. These models,\ntrained based on sufficiently small LLMs, support key functionalities with low\ninference costs: instruction expansion, instruction refinement, and\ninstruction-response pair expansion. To fulfill this goal, we first construct\nan automatic data collection system with seed datasets generated from both\npublic repositories and our in-house datasets. This system leverages powerful\nLLMs to expand, refine and re-write the instructions and responses,\nincorporating quality assessment techniques. Following this, we introduce the\ntraining process of our models, which effectively distills task-solving and\ntext synthesis abilities from teacher LLMs. Finally, we demonstrate how we\nintegrate these functionalities into a machine learning platform to support\nlow-cost LLM fine-tuning from both dataset preparation and training\nperspectives for users. Experiments and an application study prove the\neffectiveness of our approach.\n","authors":["Yuanhao Yue","Chengyu Wang","Jun Huang","Peng Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04871v1.pdf","comment":"coling 2025 industry track"},{"id":"http://arxiv.org/abs/2412.04862v1","updated":"2024-12-06T08:53:46Z","published":"2024-12-06T08:53:46Z","title":"EXAONE 3.5: Series of Large Language Models for Real-world Use Cases","summary":" This technical report introduces the EXAONE 3.5 instruction-tuned language\nmodels, developed and released by LG AI Research. The EXAONE 3.5 language\nmodels are offered in three configurations: 32B, 7.8B, and 2.4B. These models\nfeature several standout capabilities: 1) exceptional instruction following\ncapabilities in real-world scenarios, achieving the highest scores across seven\nbenchmarks, 2) outstanding long-context comprehension, attaining the top\nperformance in four benchmarks, and 3) competitive results compared to\nstate-of-the-art open models of similar sizes across nine general benchmarks.\nThe EXAONE 3.5 language models are open to anyone for research purposes and can\nbe downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use,\nplease reach out to the official contact point of LG AI Research:\ncontact_us@lgresearch.ai.\n","authors":["LG AI Research","Soyoung An","Kyunghoon Bae","Eunbi Choi","Kibong Choi","Stanley Jungkyu Choi","Seokhee Hong","Junwon Hwang","Hyojin Jeon","Gerrard Jeongwon Jo","Hyunjik Jo","Jiyeon Jung","Yountae Jung","Hyosang Kim","Joonkee Kim","Seonghwan Kim","Soyeon Kim","Sunkyoung Kim","Yireun Kim","Yongil Kim","Youchul Kim","Edward Hwayoung Lee","Haeju Lee","Honglak Lee","Jinsik Lee","Kyungmin Lee","Woohyung Lim","Sangha Park","Sooyoun Park","Yongmin Park","Sihoon Yang","Heuiyeen Yeen","Hyeongu Yun"],"pdf_url":"https://arxiv.org/pdf/2412.04862v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2408.03541"},{"id":"http://arxiv.org/abs/2409.01497v2","updated":"2024-12-06T08:53:43Z","published":"2024-09-02T23:37:20Z","title":"DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using\n Large Language Models","summary":" As large language models (LLMs) gain traction in healthcare, concerns about\ntheir susceptibility to demographic biases are growing. We introduce\n{DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical\nqueries across diverse patient demographics, such as gender and ethnicity. By\nperturbing questions from the MedQA dataset, which comprises medical board exam\nquestions, we created a benchmark that captures the nuanced differences in\nmedical diagnosis across varying patient profiles. Our findings reveal notable\ndiscrepancies in model performance when tested against these demographic\nvariations. Furthermore, to ensure the perturbations were accurate, we also\npropose a filtering strategy that validates each perturbation. By releasing\nDiversityMedQA, we provide a resource for evaluating and mitigating demographic\nbias in LLM medical diagnoses.\n","authors":["Rajat Rawat","Hudson McBride","Dhiyaan Nirmal","Rajarshi Ghosh","Jong Moon","Dhruv Alamuri","Sean O'Brien","Kevin Zhu"],"pdf_url":"https://arxiv.org/pdf/2409.01497v2.pdf","comment":"Published in NLP4PI @ EMNLP 2024, Accepted to AIM-FM @ NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04859v1","updated":"2024-12-06T08:52:30Z","published":"2024-12-06T08:52:30Z","title":"Breaking Event Rumor Detection via Stance-Separated Multi-Agent Debate","summary":" The rapid spread of rumors on social media platforms during breaking events\nseverely hinders the dissemination of the truth. Previous studies reveal that\nthe lack of annotated resources hinders the direct detection of unforeseen\nbreaking events not covered in yesterday's news. Leveraging large language\nmodels (LLMs) for rumor detection holds significant promise. However, it is\nchallenging for LLMs to provide comprehensive responses to complex or\ncontroversial issues due to limited diversity. In this work, we propose the\nStance Separated Multi-Agent Debate (S2MAD) to address this issue.\nSpecifically, we firstly introduce Stance Separation, categorizing comments as\neither supporting or opposing the original claim. Subsequently, claims are\nclassified as subjective or objective, enabling agents to generate reasonable\ninitial viewpoints with different prompt strategies for each type of claim.\nDebaters then follow specific instructions through multiple rounds of debate to\nreach a consensus. If a consensus is not reached, a judge agent evaluates the\nopinions and delivers a final verdict on the claim's veracity. Extensive\nexperiments conducted on two real-world datasets demonstrate that our proposed\nmodel outperforms state-of-the-art methods in terms of performance and\neffectively improves the performance of LLMs in breaking event rumor detection.\n","authors":["Mingqing Zhang","Haisong Gong","Qiang Liu","Shu Wu","Liang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04859v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.01084v2","updated":"2024-12-06T08:39:13Z","published":"2024-11-01T23:53:00Z","title":"Plentiful Jailbreaks with String Compositions","summary":" Large language models (LLMs) remain vulnerable to a slew of adversarial\nattacks and jailbreaking methods. One common approach employed by white-hat\nattackers, or red-teamers, is to process model inputs and outputs using\nstring-level obfuscations, which can include leetspeak, rotary ciphers, Base64,\nASCII, and more. Our work extends these encoding-based attacks by unifying them\nin a framework of invertible string transformations. With invertibility, we can\ndevise arbitrary string compositions, defined as sequences of transformations,\nthat we can encode and decode end-to-end programmatically. We devise a\nautomated best-of-n attack that samples from a combinatorially large number of\nstring compositions. Our jailbreaks obtain competitive attack success rates on\nseveral leading frontier models when evaluated on HarmBench, highlighting that\nencoding-based attacks remain a persistent vulnerability even in advanced LLMs.\n","authors":["Brian R. Y. Huang"],"pdf_url":"https://arxiv.org/pdf/2411.01084v2.pdf","comment":"NeurIPS SoLaR Workshop 2024"},{"id":"http://arxiv.org/abs/2412.03205v2","updated":"2024-12-06T08:29:43Z","published":"2024-12-04T10:44:50Z","title":"U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills\n in LLMs","summary":" The current evaluation of mathematical skills in LLMs is limited, as existing\nbenchmarks are either relatively small, primarily focus on elementary and\nhigh-school problems, or lack diversity in topics. Additionally, the inclusion\nof visual elements in tasks remains largely under-explored.\n To address these gaps, we introduce U-MATH, a novel benchmark of 1,100\nunpublished open-ended university-level problems sourced from teaching\nmaterials. It is balanced across six core subjects, with 20% of multimodal\nproblems. Given the open-ended nature of U-MATH problems, we employ an LLM to\njudge the correctness of generated solutions. To this end, we release\n$\\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions.\n The evaluation of general domain, math-specific, and multimodal LLMs\nhighlights the challenges presented by U-MATH. Our findings reveal that LLMs\nachieve a maximum accuracy of only 63% on text-based tasks, with even lower 45%\non visual problems. The solution assessment proves challenging for LLMs, with\nthe best LLM judge having an F1-score of 80% on $\\mu$-MATH.\n","authors":["Konstantin Chernyshev","Vitaliy Polshkov","Ekaterina Artemova","Alex Myasnikov","Vlad Stepanov","Alexei Miasnikov","Sergei Tilga"],"pdf_url":"https://arxiv.org/pdf/2412.03205v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04836v1","updated":"2024-12-06T08:05:02Z","published":"2024-12-06T08:05:02Z","title":"Adaptive Dropout for Pruning Conformers","summary":" This paper proposes a method to effectively perform joint\ntraining-and-pruning based on adaptive dropout layers with unit-wise retention\nprobabilities. The proposed method is based on the estimation of a unit-wise\nretention probability in a dropout layer. A unit that is estimated to have a\nsmall retention probability can be considered to be prunable. The retention\nprobability of the unit is estimated using back-propagation and the\nGumbel-Softmax technique. This pruning method is applied at several application\npoints in Conformers such that the effective number of parameters can be\nsignificantly reduced. Specifically, adaptive dropout layers are introduced in\nthree locations in each Conformer block: (a) the hidden layer of the\nfeed-forward-net component, (b) the query vectors and the value vectors of the\nself-attention component, and (c) the input vectors of the LConv component. The\nproposed method is evaluated by conducting a speech recognition experiment on\nthe LibriSpeech task. It was shown that this approach could simultaneously\nachieve a parameter reduction and accuracy improvement. The word error rates\nimproved by approx 1% while reducing the number of parameters by 54%.\n","authors":["Yotaro Kubo","Xingyu Cai","Michiel Bacchiani"],"pdf_url":"https://arxiv.org/pdf/2412.04836v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19772v2","updated":"2024-12-06T07:24:10Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v2.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2406.17276v3","updated":"2024-12-06T07:13:53Z","published":"2024-06-25T04:45:53Z","title":"OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure","summary":" Autoregressive language models demonstrate excellent performance in various\nscenarios. However, the inference efficiency is limited by its\none-step-one-word generation mode, which has become a pressing problem recently\nas the models become increasingly larger. Speculative decoding employs a \"draft\nand then verify\" mechanism to allow multiple tokens to be generated in one\nstep, realizing lossless acceleration. Existing methods mainly adopt fixed\nheuristic draft structures, which fail to adapt to different situations to\nmaximize the acceptance length during verification. To alleviate this dilemma,\nwe proposed OPT-Tree, an algorithm to construct adaptive and scalable draft\ntrees. It searches the optimal tree structure that maximizes the mathematical\nexpectation of the acceptance length in each decoding step. Experimental\nresults reveal that OPT-Tree outperforms the existing draft structures and\nachieves a speed-up ratio of up to 3.2 compared with autoregressive decoding.\nIf the draft model is powerful enough and the node budget is sufficient, it can\ngenerate more than ten tokens in a single step. Our code is available at\nhttps://github.com/Jikai0Wang/OPT-Tree.\n","authors":["Jikai Wang","Yi Su","Juntao Li","Qingrong Xia","Zi Ye","Xinyu Duan","Zhefeng Wang","Min Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.17276v3.pdf","comment":"Accepted at TACL; pre-MIT Press publication version"},{"id":"http://arxiv.org/abs/2407.05721v3","updated":"2024-12-06T06:51:46Z","published":"2024-07-08T08:25:56Z","title":"PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation","summary":" Mental health has attracted substantial attention in recent years and LLM can\nbe an effective technology for alleviating this problem owing to its capability\nin text understanding and dialogue. However, existing research in this domain\noften suffers from limitations, such as training on datasets lacking crucial\nprior knowledge and evidence, and the absence of comprehensive evaluation\nmethods. In this paper, we propose a specialized psychological large language\nmodel (LLM), named PsycoLLM, trained on a proposed high-quality psychological\ndataset, including single-turn QA, multi-turn dialogues and knowledge-based QA.\nSpecifically, we construct multi-turn dialogues through a three-step pipeline\ncomprising multi-turn QA generation, evidence judgment, and dialogue\nrefinement. We augment this process with real-world psychological case\nbackgrounds extracted from online platforms, enhancing the relevance and\napplicability of the generated data. Additionally, to compare the performance\nof PsycoLLM with other LLMs, we develop a comprehensive psychological benchmark\nbased on authoritative psychological counseling examinations in China, which\nincludes assessments of professional ethics, theoretical proficiency, and case\nanalysis. The experimental results on the benchmark illustrate the\neffectiveness of PsycoLLM, which demonstrates superior performance compared to\nother LLMs.\n","authors":["Jinpeng Hu","Tengteng Dong","Luo Gang","Hui Ma","Peng Zou","Xiao Sun","Dan Guo","Xun Yang","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2407.05721v3.pdf","comment":"Accepted by IEEE Transactions on Computational Social Systems.\n https://github.com/MACLAB-HFUT/PsycoLLM"},{"id":"http://arxiv.org/abs/2412.04806v1","updated":"2024-12-06T06:32:47Z","published":"2024-12-06T06:32:47Z","title":"Rethinking Time Series Forecasting with LLMs via Nearest Neighbor\n Contrastive Learning","summary":" Adapting Large Language Models (LLMs) that are extensively trained on\nabundant text data, and customizing the input prompt to enable time series\nforecasting has received considerable attention. While recent work has shown\ngreat potential for adapting the learned prior of LLMs, the formulation of the\nprompt to finetune LLMs remains challenging as prompt should be aligned with\ntime series data. Additionally, current approaches do not effectively leverage\nword token embeddings which embody the rich representation space learned by\nLLMs. This emphasizes the need for a robust approach to formulate the prompt\nwhich utilizes the word token embeddings while effectively representing the\ncharacteristics of the time series. To address these challenges, we propose\nNNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting\nvia LLMs. First, we generate time series compatible text prototypes such that\neach text prototype represents both word token embeddings in its neighborhood\nand time series characteristics via end-to-end finetuning. Next, we draw\ninspiration from Nearest Neighbor Contrastive Learning to formulate the prompt\nwhile obtaining the top-$k$ nearest neighbor time series compatible text\nprototypes. We then fine-tune the layer normalization and positional embeddings\nof the LLM, keeping the other layers intact, reducing the trainable parameters\nand decreasing the computational cost. Our comprehensive experiments\ndemonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving\ncompetitive or superior performance over the state-of-the-art methods in\nlong-term and short-term forecasting tasks.\n","authors":["Jayanie Bogahawatte","Sachith Seneviratne","Maneesha Perera","Saman Halgamuge"],"pdf_url":"https://arxiv.org/pdf/2412.04806v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.12287v2","updated":"2024-12-06T05:43:58Z","published":"2024-11-19T07:16:48Z","title":"CUE-M: Contextual Understanding and Enhanced Search with Multimodal\n Large Language Model","summary":" The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large\nLanguage Models (MLLMs) has revolutionized information retrieval and expanded\nthe practical applications of AI. However, current systems struggle in\naccurately interpreting user intent, employing diverse retrieval strategies,\nand effectively filtering unintended or inappropriate responses, limiting their\neffectiveness. This paper introduces Contextual Understanding and Enhanced\nSearch with MLLM (CUE-M), a novel multimodal search framework that addresses\nthese challenges through a multi-stage pipeline comprising image context\nenrichment, intent refinement, contextual query generation, external API\nintegration, and relevance-based filtering. CUE-M incorporates a robust\nfiltering pipeline combining image-based, text-based, and multimodal\nclassifiers, dynamically adapting to instance- and category-specific concern\ndefined by organizational policies. Evaluations on a multimodal Q&A dataset and\na public safety benchmark demonstrate that CUE-M outperforms baselines in\naccuracy, knowledge integration, and safety, advancing the capabilities of\nmultimodal retrieval systems.\n","authors":["Dongyoung Go","Taesun Whang","Chanhee Lee","Hwa-Yeon Kim","Sunghoon Park","Seunghwan Ji","Jinho Kim","Dongchan Kim","Young-Bum Kim"],"pdf_url":"https://arxiv.org/pdf/2411.12287v2.pdf","comment":"Preprint. Under review"},{"id":"http://arxiv.org/abs/2412.04787v1","updated":"2024-12-06T05:41:11Z","published":"2024-12-06T05:41:11Z","title":"Direct Quantized Training of Language Models with Stochastic Rounding","summary":" Although recent quantized Large Language Models (LLMs), such as BitNet, have\npaved the way for significant reduction in memory usage during deployment with\nbinary or ternary weights, training these models still demands substantial\nmemory footprints. This is partly because high-precision (i.e., unquantized)\nweight matrices required for straight-through estimation must be maintained\nthroughout the whole training process. To address this, we explore the\npotential of directly updating the quantized low-precision weight matrices\nwithout relying on the straight-through estimator during backpropagation,\nthereby saving memory usage during training. Specifically, we employ a\nstochastic rounding technique to minimize information loss caused by the use of\nlow-bit weights throughout training. Experimental results on our\nLLaMA-structured models indicate that (1) training with only low-precision\nweights is feasible even when they are constrained to ternary values, (2)\nextending the bit width to 8 bits results in only a 5% loss degradation\ncompared to BitNet b1.58 while offering the potential for reduced memory usage\nduring training, and (3) our models can also perform inference using ternary\nweights, showcasing their flexibility in deployment.\n","authors":["Kaiyan Zhao","Tsuguchika Tabaru","Kenichi Kobayashi","Takumi Honda","Masafumi Yamazaki","Yoshimasa Tsuruoka"],"pdf_url":"https://arxiv.org/pdf/2412.04787v1.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2412.04784v1","updated":"2024-12-06T05:30:41Z","published":"2024-12-06T05:30:41Z","title":"NLP-ADBench: NLP Anomaly Detection Benchmark","summary":" Anomaly detection (AD) is a critical machine learning task with diverse\napplications in web systems, including fraud detection, content moderation, and\nuser behavior analysis. Despite its significance, AD in natural language\nprocessing (NLP) remains underexplored, limiting advancements in detecting\nanomalies in text data such as harmful content, phishing attempts, or spam\nreviews. In this paper, we introduce NLP-ADBench, the most comprehensive\nbenchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets\nand evaluations of nineteen state-of-the-art algorithms. These include three\nend-to-end methods and sixteen two-step algorithms that apply traditional\nanomaly detection techniques to language embeddings generated by\nbert-base-uncased and OpenAI's text-embedding-3-large models.\n Our results reveal critical insights and future directions for NLP-AD.\nNotably, no single model excels across all datasets, highlighting the need for\nautomated model selection. Moreover, two-step methods leveraging\ntransformer-based embeddings consistently outperform specialized end-to-end\napproaches, with OpenAI embeddings demonstrating superior performance over BERT\nembeddings. By releasing NLP-ADBench at\nhttps://github.com/USC-FORTIS/NLP-ADBench, we provide a standardized framework\nfor evaluating NLP-AD methods, fostering the development of innovative\napproaches. This work fills a crucial gap in the field and establishes a\nfoundation for advancing NLP anomaly detection, particularly in the context of\nimproving the safety and reliability of web-based systems.\n","authors":["Yuangang Li","Jiaqi Li","Zhuo Xiao","Tiankai Yang","Yi Nian","Xiyang Hu","Yue Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.04784v1.pdf","comment":"The project is available at https://github.com/USC-FORTIS/NLP-ADBench"},{"id":"http://arxiv.org/abs/2412.04774v1","updated":"2024-12-06T04:34:45Z","published":"2024-12-06T04:34:45Z","title":"Foundation Models for Low-Resource Language Education (Vision Paper)","summary":" Recent studies show that large language models (LLMs) are powerful tools for\nworking with natural language, bringing advances in many areas of computational\nlinguistics. However, these models face challenges when applied to low-resource\nlanguages due to limited training data and difficulty in understanding cultural\nnuances. Research is now focusing on multilingual models to improve LLM\nperformance for these languages. Education in these languages also struggles\nwith a lack of resources and qualified teachers, particularly in underdeveloped\nregions. Here, LLMs can be transformative, supporting innovative methods like\ncommunity-driven learning and digital platforms. This paper discusses how LLMs\ncould enhance education for low-resource languages, emphasizing practical\napplications and benefits.\n","authors":["Zhaojun Ding","Zhengliang Liu","Hanqi Jiang","Yizhu Gao","Xiaoming Zhai","Tianming Liu","Ninghao Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04774v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17993v3","updated":"2024-12-06T04:08:34Z","published":"2024-11-27T02:20:44Z","title":"DRS: Deep Question Reformulation With Structured Output","summary":" Question answering represents a core capability of large language models\n(LLMs). However, when individuals encounter unfamiliar knowledge in texts, they\noften formulate questions that the text itself cannot answer due to\ninsufficient understanding of the underlying information. Recent studies reveal\nthat while LLMs can detect unanswerable questions, they struggle to assist\nusers in reformulating these questions. Even advanced models like GPT-3.5\ndemonstrate limited effectiveness in this regard. To address this limitation,\nwe propose DRS: Deep Question Reformulation with Structured Output, a novel\nzero-shot method aimed at enhancing LLMs ability to assist users in\nreformulating questions to extract relevant information from new documents. DRS\ncombines the strengths of LLMs with a DFS-based algorithm to iteratively\nexplore potential entity combinations and constrain outputs using predefined\nentities. This structured approach significantly enhances the reformulation\ncapabilities of LLMs. Comprehensive experimental evaluations demonstrate that\nDRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while\nalso enhancing the performance of open-source models, such as Gemma2-9B, from\n26.35% to 56.75%.\n","authors":["Zhecheng Li","Yiwei Wang","Bryan Hooi","Yujun Cai","Nanyun Peng","Kai-Wei Chang"],"pdf_url":"https://arxiv.org/pdf/2411.17993v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04757v1","updated":"2024-12-06T03:46:06Z","published":"2024-12-06T03:46:06Z","title":"Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free\n Dynamic Triangular Attention Pattern","summary":" The quadratic computational complexity of the attention mechanism in current\nLarge Language Models (LLMs) renders inference with long contexts prohibitively\nexpensive. To address this challenge, various approaches aim to retain critical\nportions of the context to optimally approximate Full Attention (FA) through\nKey-Value (KV) compression or Sparse Attention (SA), enabling the processing of\nvirtually unlimited text lengths in a streaming manner. However, these methods\nstruggle to achieve performance levels comparable to FA, particularly in\nretrieval tasks. In this paper, our analysis of attention head patterns reveals\nthat LLMs' attention distributions show strong local correlations, naturally\nreflecting a chunking mechanism for input context. We propose Ltri-LLM\nframework, which divides KVs into spans, stores them in an offline index, and\nretrieves the relevant KVs into memory for various queries. Experimental\nresults on popular long text benchmarks show that Ltri-LLM can achieve\nperformance close to FA while maintaining efficient, streaming-based inference.\n","authors":["Hongyin Tang","Di Xiu","Lanrui Wang","Xiurui Geng","Jingang Wang","Xunliang Cai"],"pdf_url":"https://arxiv.org/pdf/2412.04757v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04756v1","updated":"2024-12-06T03:45:49Z","published":"2024-12-06T03:45:49Z","title":"ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large\n Language Models","summary":" The increasing frequency and sophistication of cybersecurity vulnerabilities\nin software systems underscore the urgent need for robust and effective methods\nof vulnerability assessment. However, existing approaches often rely on highly\ntechnical and abstract frameworks, which hinders understanding and increases\nthe likelihood of exploitation, resulting in severe cyberattacks. Given the\ngrowing adoption of Large Language Models (LLMs) across diverse domains, this\npaper explores their potential application in cybersecurity, specifically for\nenhancing the assessment of software vulnerabilities. We propose ChatNVD, an\nLLM-based cybersecurity vulnerability assessment tool leveraging the National\nVulnerability Database (NVD) to provide context-rich insights and streamline\nvulnerability analysis for cybersecurity professionals, developers, and\nnon-technical users. We develop three variants of ChatNVD, utilizing three\nprominent LLMs: GPT-4o mini by OpenAI, Llama 3 by Meta, and Gemini 1.5 Pro by\nGoogle. To evaluate their efficacy, we conduct a comparative analysis of these\nmodels using a comprehensive questionnaire comprising common security\nvulnerability questions, assessing their accuracy in identifying and analyzing\nsoftware vulnerabilities. This study provides valuable insights into the\npotential of LLMs to address critical challenges in understanding and\nmitigation of software vulnerabilities.\n","authors":["Shivansh Chopra","Hussain Ahmad","Diksha Goel","Claudia Szabo"],"pdf_url":"https://arxiv.org/pdf/2412.04756v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04741v1","updated":"2024-12-06T03:02:58Z","published":"2024-12-06T03:02:58Z","title":"Question Answering for Decisionmaking in Green Building Design: A\n Multimodal Data Reasoning Method Driven by Large Language Models","summary":" In recent years, the critical role of green buildings in addressing energy\nconsumption and environmental issues has become widely acknowledged. Research\nindicates that over 40% of potential energy savings can be achieved during the\nearly design stage. Therefore, decision-making in green building design (DGBD),\nwhich is based on modeling and performance simulation, is crucial for reducing\nbuilding energy costs. However, the field of green building encompasses a broad\nrange of specialized knowledge, which involves significant learning costs and\nresults in low decision-making efficiency. Many studies have already applied\nartificial intelligence (AI) methods to this field. Based on previous research,\nthis study innovatively integrates large language models with DGBD, creating\nGreenQA, a question answering framework for multimodal data reasoning.\nUtilizing Retrieval Augmented Generation, Chain of Thought, and Function Call\nmethods, GreenQA enables multimodal question answering, including weather data\nanalysis and visualization, retrieval of green building cases, and knowledge\nquery. Additionally, this study conducted a user survey using the GreenQA web\nplatform. The results showed that 96% of users believed the platform helped\nimprove design efficiency. This study not only effectively supports DGBD but\nalso provides inspiration for AI-assisted design.\n","authors":["Yihui Li","Xiaoyue Yan","Hao Zhou","Borong Lin"],"pdf_url":"https://arxiv.org/pdf/2412.04741v1.pdf","comment":"Published at Association for Computer Aided Design in Architecture\n (ACADIA) 2024"},{"id":"http://arxiv.org/abs/2412.04726v1","updated":"2024-12-06T02:34:40Z","published":"2024-12-06T02:34:40Z","title":"BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for\n Varieties of English","summary":" Despite large language models (LLMs) being known to exhibit bias against\nnon-mainstream varieties, there are no known labeled datasets for sentiment\nanalysis of English. To address this gap, we introduce BESSTIE, a benchmark for\nsentiment and sarcasm classification for three varieties of English: Australian\n(en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two\ndomains, namely, Google Place reviews and Reddit comments, we collect datasets\nfor these language varieties using two methods: location-based and topic-based\nfiltering. Native speakers of the language varieties manually annotate the\ndatasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine\nlarge language models (LLMs) (representing a range of encoder/decoder and\nmono/multilingual models) on these datasets, and evaluate their performance on\nthe two tasks. Our results reveal that the models consistently perform better\non inner-circle varieties (i.e., en-AU and en-UK), with significant performance\ndrops for en-IN, particularly in sarcasm detection. We also report challenges\nin cross-variety generalisation, highlighting the need for language\nvariety-specific datasets such as ours. BESSTIE promises to be a useful\nevaluative benchmark for future research in equitable LLMs, specifically in\nterms of language varieties. The BESSTIE datasets, code, and models are\ncurrently available on request, while the paper is under review. Please email\naditya.joshi@unsw.edu.au.\n","authors":["Dipankar Srirag","Aditya Joshi","Jordan Painter","Diptesh Kanojia"],"pdf_url":"https://arxiv.org/pdf/2412.04726v1.pdf","comment":"10 pages, 7 figures, under review"},{"id":"http://arxiv.org/abs/2412.03704v2","updated":"2024-12-06T02:21:48Z","published":"2024-12-04T20:35:07Z","title":"Scaling Inference-Time Search with Vision Value Model for Improved\n Visual Comprehension","summary":" Despite significant advancements in vision-language models (VLMs), there\nlacks effective approaches to enhance response quality by scaling\ninference-time computation. This capability is known to be a core step towards\nthe self-improving models in recent large language model studies. In this\npaper, we present Vision Value Model (VisVM) that can guide VLM inference-time\nsearch to generate responses with better visual comprehension. Specifically,\nVisVM not only evaluates the generated sentence quality in the current search\nstep, but also anticipates the quality of subsequent sentences that may result\nfrom the current step, thus providing a long-term value. In this way, VisVM\nsteers VLMs away from generating sentences prone to hallucinations or\ninsufficient detail, thereby producing higher quality responses. Experimental\nresults demonstrate that VisVM-guided search significantly enhances VLMs'\nability to generate descriptive captions with richer visual details and fewer\nhallucinations, compared with greedy decoding and search methods with other\nvisual reward signals. Furthermore, we find that self-training the model with\nthe VisVM-guided captions improve VLM's performance across a wide range of\nmultimodal benchmarks, indicating the potential for developing self-improving\nVLMs. Our value model and code are available at\nhttps://github.com/si0wang/VisVM.\n","authors":["Xiyao Wang","Zhengyuan Yang","Linjie Li","Hongjin Lu","Yuancheng Xu","Chung-Ching Lin","Kevin Lin","Furong Huang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03704v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04717v1","updated":"2024-12-06T02:15:53Z","published":"2024-12-06T02:15:53Z","title":"NoLoR: An ASR-Based Framework for Expedited Endangered Language\n Documentation with Neo-Aramaic as a Case Study","summary":" The documentation of the Neo-Aramaic dialects before their extinction has\nbeen described as the most urgent task in all of Semitology today. The death of\nthis language will be an unfathomable loss to the descendents of the indigenous\nspeakers of Aramaic, now predominantly diasporic after forced displacement due\nto violence. This paper develops an ASR model to expedite the documentation of\nthis endangered language and generalizes the strategy in a new framework we\ncall NoLoR.\n","authors":["Matthew Nazari"],"pdf_url":"https://arxiv.org/pdf/2412.04717v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.18130v2","updated":"2024-12-06T01:34:38Z","published":"2024-04-28T10:02:28Z","title":"Logic Agent: Enhancing Validity with Logic Rule Invocation","summary":" Chain-of-Thought (CoT) prompting has emerged as a pivotal technique for\naugmenting the inferential capabilities of language models during reasoning\ntasks. Despite its advancements, CoT often grapples with challenges in\nvalidating reasoning validity and ensuring informativeness. Addressing these\nlimitations, this paper introduces the Logic Agent (LA), an agent-based\nframework aimed at enhancing the validity of reasoning processes in Large\nLanguage Models (LLMs) through strategic logic rule invocation. Unlike\nconventional approaches, LA transforms LLMs into logic agents that dynamically\napply propositional logic rules, initiating the reasoning process by converting\nnatural language inputs into structured logic forms. The logic agent leverages\na comprehensive set of predefined functions to systematically navigate the\nreasoning process. This methodology not only promotes the structured and\ncoherent generation of reasoning constructs but also significantly improves\ntheir interpretability and logical coherence. Through extensive\nexperimentation, we demonstrate LA's capacity to scale effectively across\nvarious model sizes, markedly improving the precision of complex reasoning\nacross diverse tasks.\n","authors":["Hanmeng Liu","Zhiyang Teng","Chaoli Zhang","Yue Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.18130v2.pdf","comment":"The experiment is subject to certain errors"},{"id":"http://arxiv.org/abs/2412.04703v1","updated":"2024-12-06T01:29:24Z","published":"2024-12-06T01:29:24Z","title":"Transformers Struggle to Learn to Search","summary":" Search is an ability foundational in many important tasks, and recent studies\nhave shown that large language models (LLMs) struggle to perform search\nrobustly. It is unknown whether this inability is due to a lack of data,\ninsufficient model parameters, or fundamental limitations of the transformer\narchitecture. In this work, we use the foundational graph connectivity problem\nas a testbed to generate effectively limitless high-coverage data to train\nsmall transformers and test whether they can learn to perform search. We find\nthat, when given the right training distribution, the transformer is able to\nlearn to search.\n We analyze the algorithm that the transformer has learned through a novel\nmechanistic interpretability technique that enables us to extract the\ncomputation graph from the trained model. We find that for each vertex in the\ninput graph, transformers compute the set of vertices reachable from that\nvertex. Each layer then progressively expands these sets, allowing the model to\nsearch over a number of vertices exponential in the number of layers.\n However, we find that as the input graph size increases, the transformer has\ngreater difficulty in learning the task. This difficulty is not resolved even\nas the number of parameters is increased, suggesting that increasing model\nscale will not lead to robust search abilities. We also find that performing\nsearch in-context (i.e., chain-of-thought) does not resolve this inability to\nlearn to search on larger graphs.\n","authors":["Abulhair Saparov","Srushti Pawar","Shreyas Pimpalgaonkar","Nitish Joshi","Richard Yuanzhe Pang","Vishakh Padmakumar","Seyed Mehran Kazemi","Najoung Kim","He He"],"pdf_url":"https://arxiv.org/pdf/2412.04703v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04697v1","updated":"2024-12-06T01:20:16Z","published":"2024-12-06T01:20:16Z","title":"Privacy-Preserving Retrieval Augmented Generation with Differential\n Privacy","summary":" With the recent remarkable advancement of large language models (LLMs), there\nhas been a growing interest in utilizing them in the domains with highly\nsensitive data that lies outside their training data. For this purpose,\nretrieval augmented generation (RAG) is particularly effective -- it assists\nLLMs by directly providing relevant information from the external knowledge\nsources. However, without extra privacy safeguards, RAG outputs risk leaking\nsensitive information from the external data source. In this work, we explore\nRAG under differential privacy (DP), a formal guarantee of data privacy. The\nmain challenge with differentially private RAG is how to generate long accurate\nanswers within a moderate privacy budget. We address this by proposing an\nalgorithm that smartly spends privacy budget only for the tokens that require\nthe sensitive information and uses the non-private LLM for other tokens. Our\nextensive empirical evaluations reveal that our algorithm outperforms the\nnon-RAG baseline under a reasonable privacy budget of $\\epsilon\\approx 10$\nacross different models and datasets.\n","authors":["Tatsuki Koga","Ruihan Wu","Kamalika Chaudhuri"],"pdf_url":"https://arxiv.org/pdf/2412.04697v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04690v1","updated":"2024-12-06T01:05:37Z","published":"2024-12-06T01:05:37Z","title":"LLM-Align: Utilizing Large Language Models for Entity Alignment in\n Knowledge Graphs","summary":" Entity Alignment (EA) seeks to identify and match corresponding entities\nacross different Knowledge Graphs (KGs), playing a crucial role in knowledge\nfusion and integration. Embedding-based entity alignment (EA) has recently\ngained considerable attention, resulting in the emergence of many innovative\napproaches. Initially, these approaches concentrated on learning entity\nembeddings based on the structural features of knowledge graphs (KGs) as\ndefined by relation triples. Subsequent methods have integrated entities' names\nand attributes as supplementary information to improve the embeddings used for\nEA. However, existing methods lack a deep semantic understanding of entity\nattributes and relations. In this paper, we propose a Large Language Model\n(LLM) based Entity Alignment method, LLM-Align, which explores the\ninstruction-following and zero-shot capabilities of Large Language Models to\ninfer alignments of entities. LLM-Align uses heuristic methods to select\nimportant attributes and relations of entities, and then feeds the selected\ntriples of entities to an LLM to infer the alignment results. To guarantee the\nquality of alignment results, we design a multi-round voting mechanism to\nmitigate the hallucination and positional bias issues that occur with LLMs.\nExperiments on three EA datasets, demonstrating that our approach achieves\nstate-of-the-art performance compared to existing EA methods.\n","authors":["Xuan Chen","Tong Lu","Zhichun Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04690v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.15124v2","updated":"2024-12-06T01:01:20Z","published":"2024-11-22T18:44:04Z","title":"Tulu 3: Pushing Frontiers in Open Language Model Post-Training","summary":" Language model post-training is applied to refine behaviors and unlock new\nskills across a wide range of recent language models, but open recipes for\napplying these techniques lag behind proprietary ones. The underlying training\ndata and recipes for post-training are simultaneously the most important pieces\nof the puzzle and the portion with the least transparency. To bridge this gap,\nwe introduce Tulu 3, a family of fully-open state-of-the-art post-trained\nmodels, alongside its data, code, and training recipes, serving as a\ncomprehensive guide for modern post-training techniques. Tulu 3, which builds\non Llama 3.1 base models, achieves results surpassing the instruct versions of\nLlama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and\nClaude 3.5-Haiku. The training algorithms for our models include supervised\nfinetuning (SFT), Direct Preference Optimization (DPO), and a novel method we\ncall Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we\nintroduce a multi-task evaluation scheme for post-training recipes with\ndevelopment and unseen evaluations, standard benchmark implementations, and\nsubstantial decontamination of existing open datasets on said benchmarks. We\nconclude with analysis and discussion of training methods that did not reliably\nimprove performance.\n In addition to the Tulu 3 model weights and demo, we release the complete\nrecipe -- including datasets for diverse core skills, a robust toolkit for data\ncuration and evaluation, the training code and infrastructure, and, most\nimportantly, a detailed report for reproducing and further adapting the Tulu 3\napproach to more domains.\n","authors":["Nathan Lambert","Jacob Morrison","Valentina Pyatkin","Shengyi Huang","Hamish Ivison","Faeze Brahman","Lester James V. Miranda","Alisa Liu","Nouha Dziri","Shane Lyu","Yuling Gu","Saumya Malik","Victoria Graf","Jena D. Hwang","Jiangjiang Yang","Ronan Le Bras","Oyvind Tafjord","Chris Wilhelm","Luca Soldaini","Noah A. Smith","Yizhong Wang","Pradeep Dasigi","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2411.15124v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2302.04865v2","updated":"2024-12-06T00:13:04Z","published":"2023-02-09T18:59:41Z","title":"ELBA: Learning by Asking for Embodied Visual Navigation and Task\n Completion","summary":" The research community has shown increasing interest in designing intelligent\nembodied agents that can assist humans in accomplishing tasks. Although there\nhave been significant advancements in related vision-language benchmarks, most\nprior work has focused on building agents that follow instructions rather than\nendowing agents the ability to ask questions to actively resolve ambiguities\narising naturally in embodied environments. To address this gap, we propose an\nEmbodied Learning-By-Asking (ELBA) model that learns when and what questions to\nask to dynamically acquire additional information for completing the task. We\nevaluate ELBA on the TEACh vision-dialog navigation and task completion\ndataset. Experimental results show that the proposed method achieves improved\ntask performance compared to baseline models without question-answering\ncapabilities.\n","authors":["Ying Shen","Ismini Lourentzou"],"pdf_url":"https://arxiv.org/pdf/2302.04865v2.pdf","comment":"14 pages, 10 figures, WACV 2025"}],"Computer Vision and Pattern Recognition":[{"id":"http://arxiv.org/abs/2412.05280v1","updated":"2024-12-06T18:59:56Z","published":"2024-12-06T18:59:56Z","title":"Stag-1: Towards Realistic 4D Driving Simulation with Video Generation\n Model","summary":" 4D driving simulation is essential for developing realistic autonomous\ndriving simulators. Despite advancements in existing methods for generating\ndriving scenes, significant challenges remain in view transformation and\nspatial-temporal dynamic modeling. To address these limitations, we propose a\nSpatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct\nreal-world scenes and design a controllable generative network to achieve 4D\nsimulation. Stag-1 constructs continuous 4D point cloud scenes using\nsurround-view data from autonomous vehicles. It decouples spatial-temporal\nrelationships and produces coherent keyframe videos. Additionally, Stag-1\nleverages video generation models to obtain photo-realistic and controllable 4D\ndriving simulation videos from any perspective. To expand the range of view\ngeneration, we train vehicle motion videos based on decomposed camera poses,\nenhancing modeling capabilities for distant scenes. Furthermore, we reconstruct\nvehicle camera trajectories to integrate 3D points across consecutive views,\nenabling comprehensive scene understanding along the temporal dimension.\nFollowing extensive multi-level scene training, Stag-1 can simulate from any\ndesired viewpoint and achieve a deep understanding of scene evolution under\nstatic spatial-temporal conditions. Compared to existing methods, our approach\nshows promising performance in multi-view scene consistency, background\ncoherence, and accuracy, and contributes to the ongoing advancements in\nrealistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag.\n","authors":["Lening Wang","Wenzhao Zheng","Dalong Du","Yunpeng Zhang","Yilong Ren","Han Jiang","Zhiyong Cui","Haiyang Yu","Jie Zhou","Jiwen Lu","Shanghang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.05280v1.pdf","comment":"Code is available at: https://github.com/wzzheng/Stag"},{"id":"http://arxiv.org/abs/2412.05279v1","updated":"2024-12-06T18:59:53Z","published":"2024-12-06T18:59:53Z","title":"Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories","summary":" The fields of 3D reconstruction and text-based 3D editing have advanced\nsignificantly with the evolution of text-based diffusion models. While existing\n3D editing methods excel at modifying color, texture, and style, they struggle\nwith extensive geometric or appearance changes, thus limiting their\napplications. We propose Perturb-and-Revise, which makes possible a variety of\nNeRF editing. First, we perturb the NeRF parameters with random initializations\nto create a versatile initialization. We automatically determine the\nperturbation magnitude through analysis of the local loss landscape. Then, we\nrevise the edited NeRF via generative trajectories. Combined with the\ngenerative process, we impose identity-preserving gradients to refine the\nedited NeRF. Extensive experiments demonstrate that Perturb-and-Revise\nfacilitates flexible, effective, and consistent editing of color, appearance,\nand geometry in 3D. For 360{\\deg} results, please visit our project page:\nhttps://susunghong.github.io/Perturb-and-Revise.\n","authors":["Susung Hong","Johanna Karras","Ricardo Martin-Brualla","Ira Kemelmacher-Shlizerman"],"pdf_url":"https://arxiv.org/pdf/2412.05279v1.pdf","comment":"Project page: https://susunghong.github.io/Perturb-and-Revise"},{"id":"http://arxiv.org/abs/2412.05278v1","updated":"2024-12-06T18:59:52Z","published":"2024-12-06T18:59:52Z","title":"Birth and Death of a Rose","summary":" We study the problem of generating temporal object intrinsics -- temporally\nevolving sequences of object geometry, reflectance, and texture, such as a\nblooming rose -- from pre-trained 2D foundation models. Unlike conventional 3D\nmodeling and animation techniques that require extensive manual effort and\nexpertise, we introduce a method that generates such assets with signals\ndistilled from pre-trained 2D diffusion models. To ensure the temporal\nconsistency of object intrinsics, we propose Neural Templates for\ntemporal-state-guided distillation, derived automatically from image features\nfrom self-supervised learning. Our method can generate high-quality temporal\nobject intrinsics for several natural phenomena and enable the sampling and\ncontrollable rendering of these dynamic objects from any viewpoint, under any\nenvironmental lighting conditions, at any time of their lifespan. Project\nwebsite: https://chen-geng.com/rose4d\n","authors":["Chen Geng","Yunzhi Zhang","Shangzhe Wu","Jiajun Wu"],"pdf_url":"https://arxiv.org/pdf/2412.05278v1.pdf","comment":"Project website: https://chen-geng.com/rose4d"},{"id":"http://arxiv.org/abs/2412.05276v1","updated":"2024-12-06T18:59:51Z","published":"2024-12-06T18:59:51Z","title":"Sparse autoencoders reveal selective remapping of visual concepts during\n adaptation","summary":" Adapting foundation models for specific purposes has become a standard\napproach to build machine learning systems for downstream applications. Yet, it\nis an open question which mechanisms take place during adaptation. Here we\ndevelop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named\nPatchSAE, to extract interpretable concepts at granular levels (e.g. shape,\ncolor, or semantics of an object) and their patch-wise spatial attributions. We\nexplore how these concepts influence the model output in downstream image\nclassification tasks and investigate how recent state-of-the-art prompt-based\nadaptation techniques change the association of model inputs to these concepts.\nWhile activations of concepts slightly change between adapted and non-adapted\nmodels, we find that the majority of gains on common adaptation tasks can be\nexplained with the existing concepts already present in the non-adapted\nfoundation model. This work provides a concrete framework to train and use SAEs\nfor Vision Transformers and provides insights into explaining adaptation\nmechanisms.\n","authors":["Hyesu Lim","Jinho Choi","Jaegul Choo","Steffen Schneider"],"pdf_url":"https://arxiv.org/pdf/2412.05276v1.pdf","comment":"A demo is available at github.com/dynamical-inference/patchsae"},{"id":"http://arxiv.org/abs/2412.05277v1","updated":"2024-12-06T18:59:51Z","published":"2024-12-06T18:59:51Z","title":"Text to Blind Motion","summary":" People who are blind perceive the world differently than those who are\nsighted, which can result in distinct motion characteristics. For instance,\nwhen crossing at an intersection, blind individuals may have different patterns\nof movement, such as veering more from a straight path or using touch-based\nexploration around curbs and obstacles. These behaviors may appear less\npredictable to motion models embedded in technologies such as autonomous\nvehicles. Yet, the ability of 3D motion models to capture such behavior has not\nbeen previously studied, as existing datasets for 3D human motion currently\nlack diversity and are biased toward people who are sighted. In this work, we\nintroduce BlindWays, the first multimodal motion benchmark for pedestrians who\nare blind. We collect 3D motion data using wearable sensors with 11 blind\nparticipants navigating eight different routes in a real-world urban setting.\nAdditionally, we provide rich textual descriptions that capture the distinctive\nmovement characteristics of blind pedestrians and their interactions with both\nthe navigation aid (e.g., a white cane or a guide dog) and the environment. We\nbenchmark state-of-the-art 3D human prediction models, finding poor performance\nwith off-the-shelf and pre-training-based methods for our novel task. To\ncontribute toward safer and more reliable systems that can seamlessly reason\nover diverse human movements in their environments, our text-and-motion\nbenchmark is available at https://blindways.github.io.\n","authors":["Hee Jae Kim","Kathakoli Sengupta","Masaki Kuribayashi","Hernisa Kacorri","Eshed Ohn-Bar"],"pdf_url":"https://arxiv.org/pdf/2412.05277v1.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.05275v1","updated":"2024-12-06T18:59:12Z","published":"2024-12-06T18:59:12Z","title":"MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models","summary":" Text-to-video models have demonstrated impressive capabilities in producing\ndiverse and captivating video content, showcasing a notable advancement in\ngenerative AI. However, these models generally lack fine-grained control over\nmotion patterns, limiting their practical applicability. We introduce\nMotionFlow, a novel framework designed for motion transfer in video diffusion\nmodels. Our method utilizes cross-attention maps to accurately capture and\nmanipulate spatial and temporal dynamics, enabling seamless motion transfers\nacross various contexts. Our approach does not require training and works on\ntest-time by leveraging the inherent capabilities of pre-trained video\ndiffusion models. In contrast to traditional approaches, which struggle with\ncomprehensive scene changes while maintaining consistent motion, MotionFlow\nsuccessfully handles such complex transformations through its attention-based\nmechanism. Our qualitative and quantitative experiments demonstrate that\nMotionFlow significantly outperforms existing models in both fidelity and\nversatility even during drastic scene alterations.\n","authors":["Tuna Han Salih Meral","Hidir Yesiltepe","Connor Dunlop","Pinar Yanardag"],"pdf_url":"https://arxiv.org/pdf/2412.05275v1.pdf","comment":"Project Page: https://motionflow-diffusion.github.io"},{"id":"http://arxiv.org/abs/2412.05274v1","updated":"2024-12-06T18:59:04Z","published":"2024-12-06T18:59:04Z","title":"SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images","summary":" The 3D contrastive learning paradigm has demonstrated remarkable performance\nin downstream tasks through pretraining on point cloud data. Recent advances\ninvolve additional 2D image priors associated with 3D point clouds for further\nimprovement. Nonetheless, these existing frameworks are constrained by the\nrestricted range of available point cloud datasets, primarily due to the high\ncosts of obtaining point cloud data. To this end, we propose SimC3D, a simple\nbut effective 3D contrastive learning framework, for the first time,\npretraining 3D backbones from pure RGB image data. SimC3D performs contrastive\n3D pretraining with three appealing properties. (1) Pure image data: SimC3D\nsimplifies the dependency of costly 3D point clouds and pretrains 3D backbones\nusing solely RBG images. By employing depth estimation and suitable data\nprocessing, the monocular synthesized point cloud shows great potential for 3D\npretraining. (2) Simple framework: Traditional multi-modal frameworks\nfacilitate 3D pretraining with 2D priors by utilizing an additional 2D\nbackbone, thereby increasing computational expense. In this paper, we\nempirically demonstrate that the primary benefit of the 2D modality stems from\nthe incorporation of locality information. Inspired by this insightful\nobservation, SimC3D directly employs 2D positional embeddings as a stronger\ncontrastive objective, eliminating the necessity for 2D backbones and leading\nto considerable performance improvements. (3) Strong performance: SimC3D\noutperforms previous approaches that leverage ground-truth point cloud data for\npretraining in various downstream tasks. Furthermore, the performance of SimC3D\ncan be further enhanced by combining multiple image datasets, showcasing its\nsignificant potential for scalability. The code will be available at\nhttps://github.com/Dongjiahua/SimC3D.\n","authors":["Jiahua Dong","Tong Wu","Rui Qian","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05274v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05271v1","updated":"2024-12-06T18:57:08Z","published":"2024-12-06T18:57:08Z","title":"Expanding Performance Boundaries of Open-Source Multimodal Models with\n Model, Data, and Test-Time Scaling","summary":" We introduce InternVL 2.5, an advanced multimodal large language model (MLLM)\nseries that builds upon InternVL 2.0, maintaining its core model architecture\nwhile introducing significant enhancements in training and testing strategies\nas well as data quality. In this work, we delve into the relationship between\nmodel scaling and performance, systematically exploring the performance trends\nin vision encoders, language models, dataset sizes, and test-time\nconfigurations. Through extensive evaluations on a wide range of benchmarks,\nincluding multi-discipline reasoning, document understanding, multi-image /\nvideo understanding, real-world comprehension, multimodal hallucination\ndetection, visual grounding, multilingual capabilities, and pure language\nprocessing, InternVL 2.5 exhibits competitive performance, rivaling leading\ncommercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is\nthe first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a\n3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing\nstrong potential for test-time scaling. We hope this model contributes to the\nopen-source community by setting new standards for developing and applying\nmultimodal AI systems. HuggingFace demo see\nhttps://huggingface.co/spaces/OpenGVLab/InternVL\n","authors":["Zhe Chen","Weiyun Wang","Yue Cao","Yangzhou Liu","Zhangwei Gao","Erfei Cui","Jinguo Zhu","Shenglong Ye","Hao Tian","Zhaoyang Liu","Lixin Gu","Xuehui Wang","Qingyun Li","Yimin Ren","Zixuan Chen","Jiapeng Luo","Jiahao Wang","Tan Jiang","Bo Wang","Conghui He","Botian Shi","Xingcheng Zhang","Han Lv","Yi Wang","Wenqi Shao","Pei Chu","Zhongying Tu","Tong He","Zhiyong Wu","Huipeng Deng","Jiaye Ge","Kai Chen","Min Dou","Lewei Lu","Xizhou Zhu","Tong Lu","Dahua Lin","Yu Qiao","Jifeng Dai","Wenhai Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05271v1.pdf","comment":"Technical Report"},{"id":"http://arxiv.org/abs/2412.05268v1","updated":"2024-12-06T18:55:09Z","published":"2024-12-06T18:55:09Z","title":"DenseMatcher: Learning 3D Semantic Correspondence for Category-Level\n Manipulation from a Single Demo","summary":" Dense 3D correspondence can enhance robotic manipulation by enabling the\ngeneralization of spatial, functional, and dynamic information from one object\nto an unseen counterpart. Compared to shape correspondence, semantic\ncorrespondence is more effective in generalizing across different object\ncategories. To this end, we present DenseMatcher, a method capable of computing\n3D correspondences between in-the-wild objects that share similar structures.\nDenseMatcher first computes vertex features by projecting multiview 2D features\nonto meshes and refining them with a 3D network, and subsequently finds dense\ncorrespondences with the obtained features using functional map. In addition,\nwe craft the first 3D matching dataset that contains colored object meshes\nacross diverse categories. In our experiments, we show that DenseMatcher\nsignificantly outperforms prior 3D matching baselines by 43.5%. We demonstrate\nthe downstream effectiveness of DenseMatcher in (i) robotic manipulation, where\nit achieves cross-instance and cross-category generalization on long-horizon\ncomplex manipulation tasks from observing only one demo; (ii) zero-shot color\nmapping between digital assets, where appearance can be transferred between\ndifferent objects with relatable geometry.\n","authors":["Junzhe Zhu","Yuanchen Ju","Junyi Zhang","Muhan Wang","Zhecheng Yuan","Kaizhe Hu","Huazhe Xu"],"pdf_url":"https://arxiv.org/pdf/2412.05268v1.pdf","comment":"Project Page: https://tea-lab.github.io/DenseMatcher/"},{"id":"http://arxiv.org/abs/2412.05263v1","updated":"2024-12-06T18:52:20Z","published":"2024-12-06T18:52:20Z","title":"Mind the Time: Temporally-Controlled Multi-Event Video Generation","summary":" Real-world videos consist of sequences of events. Generating such sequences\nwith precise temporal control is infeasible with existing video generators that\nrely on a single paragraph of text as input. When tasked with generating\nmultiple events described using a single prompt, such methods often ignore some\nof the events or fail to arrange them in the correct order. To address this\nlimitation, we present MinT, a multi-event video generator with temporal\ncontrol. Our key insight is to bind each event to a specific period in the\ngenerated video, which allows the model to focus on one event at a time. To\nenable time-aware interactions between event captions and video tokens, we\ndesign a time-based positional encoding method, dubbed ReRoPE. This encoding\nhelps to guide the cross-attention operation. By fine-tuning a pre-trained\nvideo diffusion transformer on temporally grounded data, our approach produces\ncoherent videos with smoothly connected events. For the first time in the\nliterature, our model offers control over the timing of events in generated\nvideos. Extensive experiments demonstrate that MinT outperforms existing\nopen-source models by a large margin.\n","authors":["Ziyi Wu","Aliaksandr Siarohin","Willi Menapace","Ivan Skorokhodov","Yuwei Fang","Varnith Chordia","Igor Gilitschenski","Sergey Tulyakov"],"pdf_url":"https://arxiv.org/pdf/2412.05263v1.pdf","comment":"Project Page: https://mint-video.github.io/"},{"id":"http://arxiv.org/abs/2412.05256v1","updated":"2024-12-06T18:41:39Z","published":"2024-12-06T18:41:39Z","title":"Extrapolated Urban View Synthesis Benchmark","summary":" Photorealistic simulators are essential for the training and evaluation of\nvision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis\n(NVS), a crucial capability that generates diverse unseen viewpoints to\naccommodate the broad and continuous pose distribution of AVs. Recent advances\nin radiance fields, such as 3D Gaussian Splatting, achieve photorealistic\nrendering at real-time speeds and have been widely used in modeling large-scale\ndriving scenes. However, their performance is commonly evaluated using an\ninterpolated setup with highly correlated training and test views. In contrast,\nextrapolation, where test views largely deviate from training views, remains\nunderexplored, limiting progress in generalizable simulation technology. To\naddress this gap, we leverage publicly available AV datasets with multiple\ntraversals, multiple vehicles, and multiple cameras to build the first\nExtrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct\nquantitative and qualitative evaluations of state-of-the-art Gaussian Splatting\nmethods across different difficulty levels. Our results show that Gaussian\nSplatting is prone to overfitting to training views. Besides, incorporating\ndiffusion priors and improving geometry cannot fundamentally improve NVS under\nlarge view changes, highlighting the need for more robust approaches and\nlarge-scale training. We have released our data to help advance self-driving\nand urban robotics simulation technology.\n","authors":["Xiangyu Han","Zhen Jia","Boyi Li","Yan Wang","Boris Ivanovic","Yurong You","Lingjie Liu","Yue Wang","Marco Pavone","Chen Feng","Yiming Li"],"pdf_url":"https://arxiv.org/pdf/2412.05256v1.pdf","comment":"Project page: https://ai4ce.github.io/EUVS-Benchmark/"},{"id":"http://arxiv.org/abs/2412.05255v1","updated":"2024-12-06T18:41:16Z","published":"2024-12-06T18:41:16Z","title":"TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft","summary":" Collaboration is a cornerstone of society. In the real world, human teammates\nmake use of multi-sensory data to tackle challenging tasks in ever-changing\nenvironments. It is essential for embodied agents collaborating in\nvisually-rich environments replete with dynamic interactions to understand\nmulti-modal observations and task specifications. To evaluate the performance\nof generalizable multi-modal collaborative agents, we present TeamCraft, a\nmulti-modal multi-agent benchmark built on top of the open-world video game\nMinecraft. The benchmark features 55,000 task variants specified by multi-modal\nprompts, procedurally-generated expert demonstrations for imitation learning,\nand carefully designed protocols to evaluate model generalization capabilities.\nWe also perform extensive analyses to better understand the limitations and\nstrengths of existing approaches. Our results indicate that existing models\ncontinue to face significant challenges in generalizing to novel goals, scenes,\nand unseen numbers of agents. These findings underscore the need for further\nresearch in this area. The TeamCraft platform and dataset are publicly\navailable at https://github.com/teamcraft-bench/teamcraft.\n","authors":["Qian Long","Zhi Li","Ran Gong","Ying Nian Wu","Demetri Terzopoulos","Xiaofeng Gao"],"pdf_url":"https://arxiv.org/pdf/2412.05255v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05252v1","updated":"2024-12-06T18:32:54Z","published":"2024-12-06T18:32:54Z","title":"From classical techniques to convolution-based models: A review of\n object detection algorithms","summary":" Object detection is a fundamental task in computer vision and image\nunderstanding, with the goal of identifying and localizing objects of interest\nwithin an image while assigning them corresponding class labels. Traditional\nmethods, which relied on handcrafted features and shallow models, struggled\nwith complex visual data and showed limited performance. These methods combined\nlow-level features with contextual information and lacked the ability to\ncapture high-level semantics. Deep learning, especially Convolutional Neural\nNetworks (CNNs), addressed these limitations by automatically learning rich,\nhierarchical features directly from data. These features include both semantic\nand high-level representations essential for accurate object detection. This\npaper reviews object detection frameworks, starting with classical computer\nvision methods. We categorize object detection approaches into two groups: (1)\nclassical computer vision techniques and (2) CNN-based detectors. We compare\nmajor CNN models, discussing their strengths and limitations. In conclusion,\nthis review highlights the significant advancements in object detection through\ndeep learning and identifies key areas for further research to improve\nperformance.\n","authors":["Fnu Neha","Deepshikha Bhati","Deepak Kumar Shukla","Md Amiruzzaman"],"pdf_url":"https://arxiv.org/pdf/2412.05252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05243v1","updated":"2024-12-06T18:22:47Z","published":"2024-12-06T18:22:47Z","title":"CompCap: Improving Multimodal Large Language Models with Composite\n Captions","summary":" How well can Multimodal Large Language Models (MLLMs) understand composite\nimages? Composite images (CIs) are synthetic visuals created by merging\nmultiple visual elements, such as charts, posters, or screenshots, rather than\nbeing captured directly by a camera. While CIs are prevalent in real-world\napplications, recent MLLM developments have primarily focused on interpreting\nnatural images (NIs). Our research reveals that current MLLMs face significant\nchallenges in accurately understanding CIs, often struggling to extract\ninformation or perform complex reasoning based on these images. We find that\nexisting training data for CIs are mostly formatted for question-answer tasks\n(e.g., in datasets like ChartQA and ScienceQA), while high-quality\nimage-caption datasets, critical for robust vision-language alignment, are only\navailable for NIs. To bridge this gap, we introduce Composite Captions\n(CompCap), a flexible framework that leverages Large Language Models (LLMs) and\nautomation tools to synthesize CIs with accurate and detailed captions. Using\nCompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs\nacross six CI types. We validate the effectiveness of CompCap-118K by\nsupervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and\nLLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K\nsignificantly enhances MLLMs' understanding of CIs, yielding average gains of\n1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.\n","authors":["Xiaohui Chen","Satya Narayan Shukla","Mahmoud Azab","Aashu Singh","Qifan Wang","David Yang","ShengYun Peng","Hanchao Yu","Shen Yan","Xuewen Zhang","Baosheng He"],"pdf_url":"https://arxiv.org/pdf/2412.05243v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14471v2","updated":"2024-12-06T18:22:32Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v2.pdf","comment":"Technical Report. 52 pages. Shorter version published at the NeurIPS\n 2024 Dataset & Benchmarks track"},{"id":"http://arxiv.org/abs/2407.00983v3","updated":"2024-12-06T18:16:02Z","published":"2024-07-01T05:47:58Z","title":"FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models","summary":" The advent of foundation models (FMs) in healthcare offers unprecedented\nopportunities to enhance medical diagnostics through automated classification\nand segmentation tasks. However, these models also raise significant concerns\nabout their fairness, especially when applied to diverse and underrepresented\npopulations in healthcare applications. Currently, there is a lack of\ncomprehensive benchmarks, standardized pipelines, and easily adaptable\nlibraries to evaluate and understand the fairness performance of FMs in medical\nimaging, leading to considerable challenges in formulating and implementing\nsolutions that ensure equitable outcomes across diverse patient populations. To\nfill this gap, we introduce FairMedFM, a fairness benchmark for FM research in\nmedical imaging.FairMedFM integrates with 17 popular medical imaging datasets,\nencompassing different modalities, dimensionalities, and sensitive attributes.\nIt explores 20 widely used FMs, with various usages such as zero-shot learning,\nlinear probing, parameter-efficient fine-tuning, and prompting in various\ndownstream tasks -- classification and segmentation. Our exhaustive analysis\nevaluates the fairness performance over different evaluation metrics from\nmultiple perspectives, revealing the existence of bias, varied utility-fairness\ntrade-offs on different FMs, consistent disparities on the same datasets\nregardless FMs, and limited effectiveness of existing unfairness mitigation\nmethods. Checkout FairMedFM's project page and open-sourced codebase, which\nsupports extendible functionalities and applications as well as inclusive for\nstudies on FMs in medical imaging over the long term.\n","authors":["Ruinan Jin","Zikang Xu","Yuan Zhong","Qiongsong Yao","Qi Dou","S. Kevin Zhou","Xiaoxiao Li"],"pdf_url":"https://arxiv.org/pdf/2407.00983v3.pdf","comment":"29 pages, 17 figures"},{"id":"http://arxiv.org/abs/2412.05237v1","updated":"2024-12-06T18:14:24Z","published":"2024-12-06T18:14:24Z","title":"MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at\n Scale","summary":" Open-source multimodal large language models (MLLMs) have shown significant\npotential in a broad range of multimodal tasks. However, their reasoning\ncapabilities remain constrained by existing instruction-tuning datasets, which\nwere predominately repurposed from academic datasets such as VQA, AI2D, and\nChartQA. These datasets target simplistic tasks, and only provide phrase-level\nanswers without any intermediate rationales. To address these challenges, we\nintroduce a scalable and cost-effective method to construct a large-scale\nmultimodal instruction-tuning dataset with rich intermediate rationales\ndesigned to elicit CoT reasoning. Using only open models, we create a dataset\ncontaining 12M instruction-response pairs to cover diverse, reasoning-intensive\ntasks with detailed and faithful rationales. Experiments demonstrate that\ntraining MLLMs on this dataset significantly improves reasoning capabilities,\nachieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%),\nMMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates\nnotable improvements of up to 4% on non-reasoning-based benchmarks. Ablation\nstudies further highlight the importance of key components, such as rewriting\nand self-filtering, in the dataset construction process.\n","authors":["Jarvis Guo","Tuney Zheng","Yuelin Bai","Bo Li","Yubo Wang","King Zhu","Yizhi Li","Graham Neubig","Wenhu Chen","Xiang Yue"],"pdf_url":"https://arxiv.org/pdf/2412.05237v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.04314v2","updated":"2024-12-06T17:59:18Z","published":"2024-06-06T17:57:09Z","title":"Aesthetic Post-Training Diffusion Models from Generic Preferences with\n Step-by-step Preference Optimization","summary":" Generating visually appealing images is fundamental to modern text-to-image\ngeneration models. A potential solution to better aesthetics is direct\npreference optimization (DPO), which has been applied to diffusion models to\nimprove general image quality including prompt alignment and aesthetics.\nPopular DPO methods propagate preference labels from clean image pairs to all\nthe intermediate steps along the two generation trajectories. However,\npreference labels provided in existing datasets are blended with layout and\naesthetic opinions, which would disagree with aesthetic preference. Even if\naesthetic labels were provided (at substantial cost), it would be hard for the\ntwo-trajectory methods to capture nuanced visual differences at different\nsteps. To improve aesthetics economically, this paper uses existing generic\npreference data and introduces step-by-step preference optimization (SPO) that\ndiscards the propagation strategy and allows fine-grained image details to be\nassessed. Specifically, at each denoising step, we 1) sample a pool of\ncandidates by denoising from a shared noise latent, 2) use a step-aware\npreference model to find a suitable win-lose pair to supervise the diffusion\nmodel, and 3) randomly select one from the pool to initialize the next\ndenoising step. This strategy ensures that the diffusion models to focus on the\nsubtle, fine-grained visual differences instead of layout aspect. We find that\naesthetic can be significantly enhanced by accumulating these improved minor\ndifferences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields\nsignificant improvements in aesthetics compared with existing DPO methods while\nnot sacrificing image-text alignment compared with vanilla models. Moreover,\nSPO converges much faster than DPO methods due to the step-by-step alignment of\nfine-grained visual details. Code and models are available at\nhttps://github.com/RockeyCoss/SPO.\n","authors":["Zhanhao Liang","Yuhui Yuan","Shuyang Gu","Bohan Chen","Tiankai Hang","Mingxi Cheng","Ji Li","Liang Zheng"],"pdf_url":"https://arxiv.org/pdf/2406.04314v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05216v1","updated":"2024-12-06T17:48:06Z","published":"2024-12-06T17:48:06Z","title":"ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And\n Segmentation Of GI Bleeding","summary":" This study presents an integrated deep learning model for automatic detection\nand classification of Gastrointestinal bleeding in the frames extracted from\nWireless Capsule Endoscopy (WCE) videos. The dataset has been released as part\nof Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model\nattained the highest performance among 75 teams that took part in this\ncompetition. It aims to efficiently utilizes CNN based model i.e. DenseNet and\nUNet to detect and segment bleeding and non-bleeding areas in the real-world\ncomplex dataset. The model achieves an impressive overall accuracy of 80% which\nwould surely help a skilled doctor to carry out further diagnostics.\n","authors":["Ayushman Singh","Sharad Prakash","Aniket Das","Nidhi Kushwaha"],"pdf_url":"https://arxiv.org/pdf/2412.05216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05203v1","updated":"2024-12-06T17:32:53Z","published":"2024-12-06T17:32:53Z","title":"Archaeoscape: Bringing Aerial Laser Scanning Archaeology to the Deep\n Learning Era","summary":" Airborne Laser Scanning (ALS) technology has transformed modern archaeology\nby unveiling hidden landscapes beneath dense vegetation. However, the lack of\nexpert-annotated, open-access resources has hindered the analysis of ALS data\nusing advanced deep learning techniques. We address this limitation with\nArchaeoscape (available at https://archaeoscape.ai), a novel large-scale\narchaeological ALS dataset spanning 888 km$^2$ in Cambodia with 31,141\nannotated archaeological features from the Angkorian period. Archaeoscape is\nover four times larger than comparable datasets, and the first ALS archaeology\nresource with open-access data, annotations, and models.\n We benchmark several recent segmentation models to demonstrate the benefits\nof modern vision techniques for this problem and highlight the unique\nchallenges of discovering subtle human-made structures under dense jungle\ncanopies. By making Archaeoscape available in open access, we hope to bridge\nthe gap between traditional archaeology and modern computer vision methods.\n","authors":["Yohann Perron","Vladyslav Sydorov","Adam P. Wijker","Damian Evans","Christophe Pottier","Loic Landrieu"],"pdf_url":"https://arxiv.org/pdf/2412.05203v1.pdf","comment":"NeurIPS 2023 - Datasets & Benchmarks Track"},{"id":"http://arxiv.org/abs/2406.11384v3","updated":"2024-12-06T17:26:27Z","published":"2024-06-17T10:11:28Z","title":"Understanding Multi-Granularity for Open-Vocabulary Part Segmentation","summary":" Open-vocabulary part segmentation (OVPS) is an emerging research area focused\non segmenting fine-grained entities using diverse and previously unseen\nvocabularies. Our study highlights the inherent complexities of part\nsegmentation due to intricate boundaries and diverse granularity, reflecting\nthe knowledge-based nature of part identification. To address these challenges,\nwe propose PartCLIPSeg, a novel framework utilizing generalized parts and\nobject-level contexts to mitigate the lack of generalization in fine-grained\nparts. PartCLIPSeg integrates competitive part relationships and attention\ncontrol, alleviating ambiguous boundaries and underrepresented parts.\nExperimental results demonstrate that PartCLIPSeg outperforms existing\nstate-of-the-art OVPS methods, offering refined segmentation and an advanced\nunderstanding of part relationships within images. Through extensive\nexperiments, our model demonstrated a significant improvement over the\nstate-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and\nPartImageNet datasets.\n","authors":["Jiho Choi","Seonho Lee","Seungho Lee","Minhyun Lee","Hyunjung Shim"],"pdf_url":"https://arxiv.org/pdf/2406.11384v3.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.00878v2","updated":"2024-12-06T17:14:05Z","published":"2024-12-01T16:36:22Z","title":"Beyond Pixels: Text Enhances Generalization in Real-World Image\n Restoration","summary":" Generalization has long been a central challenge in real-world image\nrestoration. While recent diffusion-based restoration methods, which leverage\ngenerative priors from text-to-image models, have made progress in recovering\nmore realistic details, they still encounter \"generative capability\ndeactivation\" when applied to out-of-distribution real-world data. To address\nthis, we propose using text as an auxiliary invariant representation to\nreactivate the generative capabilities of these models. We begin by identifying\ntwo key properties of text input: richness and relevance, and examine their\nrespective influence on model performance. Building on these insights, we\nintroduce Res-Captioner, a module that generates enhanced textual descriptions\ntailored to image content and degradation levels, effectively mitigating\nresponse failures. Additionally, we present RealIR, a new benchmark designed to\ncapture diverse real-world scenarios. Extensive experiments demonstrate that\nRes-Captioner significantly enhances the generalization abilities of\ndiffusion-based restoration models, while remaining fully plug-and-play.\n","authors":["Haoze Sun","Wenbo Li","Jiayue Liu","Kaiwen Zhou","Yongqiang Chen","Yong Guo","Yanwei Li","Renjing Pei","Long Peng","Yujiu Yang"],"pdf_url":"https://arxiv.org/pdf/2412.00878v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05187v1","updated":"2024-12-06T17:07:27Z","published":"2024-12-06T17:07:27Z","title":"SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot","summary":" Surgical interventions, particularly in neurology, represent complex and\nhigh-stakes scenarios that impose substantial cognitive burdens on surgical\nteams. Although deliberate education and practice can enhance cognitive\ncapabilities, surgical training opportunities remain limited due to patient\nsafety concerns. To address these cognitive challenges in surgical training and\noperation, we propose SurgBox, an agent-driven sandbox framework to\nsystematically enhance the cognitive capabilities of surgeons in immersive\nsurgical simulations. Specifically, our SurgBox leverages large language models\n(LLMs) with tailored Retrieval-Augmented Generation (RAG) to authentically\nreplicate various surgical roles, enabling realistic training environments for\ndeliberate practice. In particular, we devise Surgery Copilot, an AI-driven\nassistant to actively coordinate the surgical information stream and support\nclinical decision-making, thereby diminishing the cognitive workload of\nsurgical teams during surgery. By incorporating a novel Long-Short Memory\nmechanism, our Surgery Copilot can effectively balance immediate procedural\nassistance with comprehensive surgical knowledge. Extensive experiments using\nreal neurosurgical procedure records validate our SurgBox framework in both\nenhancing surgical cognitive capabilities and supporting clinical\ndecision-making. By providing an integrated solution for training and\noperational support to address cognitive challenges, our SurgBox framework\nadvances surgical education and practice, potentially transforming surgical\noutcomes and healthcare quality. The code is available at\nhttps://github.com/franciszchen/SurgBox.\n","authors":["Jinlin Wu","Xusheng Liang","Xuexue Bai","Zhen Chen"],"pdf_url":"https://arxiv.org/pdf/2412.05187v1.pdf","comment":"This work is accepted by IEEE Big Data 2024"},{"id":"http://arxiv.org/abs/2412.05186v1","updated":"2024-12-06T17:05:34Z","published":"2024-12-06T17:05:34Z","title":"One-shot Federated Learning via Synthetic Distiller-Distillate\n Communication","summary":" One-shot Federated learning (FL) is a powerful technology facilitating\ncollaborative training of machine learning models in a single round of\ncommunication. While its superiority lies in communication efficiency and\nprivacy preservation compared to iterative FL, one-shot FL often compromises\nmodel performance. Prior research has primarily focused on employing data-free\nknowledge distillation to optimize data generators and ensemble models for\nbetter aggregating local knowledge into the server model. However, these\nmethods typically struggle with data heterogeneity, where inconsistent local\ndata distributions can cause teachers to provide misleading knowledge.\nAdditionally, they may encounter scalability issues with complex datasets due\nto inherent two-step information loss: first, during local training (from data\nto model), and second, when transferring knowledge to the server model (from\nmodel to inversed data). In this paper, we propose FedSD2C, a novel and\npractical one-shot FL framework designed to address these challenges. FedSD2C\nintroduces a distiller to synthesize informative distillates directly from\nlocal data to reduce information loss and proposes sharing synthetic\ndistillates instead of inconsistent local models to tackle data heterogeneity.\nOur empirical results demonstrate that FedSD2C consistently outperforms other\none-shot FL methods with more complex and real datasets, achieving up to 2.6\nthe performance of the best baseline. Code: https://github.com/Carkham/FedSD2C\n","authors":["Junyuan Zhang","Songhua Liu","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05186v1.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.05185v1","updated":"2024-12-06T17:04:42Z","published":"2024-12-06T17:04:42Z","title":"LinVT: Empower Your Image-level Large Language Model to Understand\n Videos","summary":" Large Language Models (LLMs) have been widely used in various tasks,\nmotivating us to develop an LLM-based assistant for videos. Instead of training\nfrom scratch, we propose a module to transform arbitrary well-trained\nimage-based LLMs into video-LLMs (after being trained on video data). To better\nadapt image-LLMs for processing videos, we introduce two design principles:\nlinear transformation to preserve the original visual-language alignment and\nrepresentative information condensation from redundant video content. Guided by\nthese principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),\nwhich enables existing image-LLMs to understand videos. We benchmark LinVT with\nsix recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,\nshowcasing the high compatibility of LinVT. LinVT-based LLMs achieve\nstate-of-the-art performance across various video benchmarks, illustrating the\neffectiveness of LinVT in multi-modal video understanding.\n","authors":["Lishuai Gao","Yujie Zhong","Yingsen Zeng","Haoxian Tan","Dengjie Li","Zheng Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.05185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03603v2","updated":"2024-12-06T17:02:10Z","published":"2024-12-03T23:52:37Z","title":"HunyuanVideo: A Systematic Framework For Large Video Generative Models","summary":" Recent advancements in video generation have significantly impacted daily\nlife for both individuals and industries. However, the leading video generation\nmodels remain closed-source, resulting in a notable performance gap between\nindustry capabilities and those available to the public. In this report, we\nintroduce HunyuanVideo, an innovative open-source video foundation model that\ndemonstrates performance in video generation comparable to, or even surpassing,\nthat of leading closed-source models. HunyuanVideo encompasses a comprehensive\nframework that integrates several key elements, including data curation,\nadvanced architectural design, progressive model scaling and training, and an\nefficient infrastructure tailored for large-scale model training and inference.\nAs a result, we successfully trained a video generative model with over 13\nbillion parameters, making it the largest among all open-source models. We\nconducted extensive experiments and implemented a series of targeted designs to\nensure high visual quality, motion dynamics, text-video alignment, and advanced\nfilming techniques. According to evaluations by professionals, HunyuanVideo\noutperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6,\nand three top-performing Chinese video generative models. By releasing the code\nfor the foundation model and its applications, we aim to bridge the gap between\nclosed-source and open-source communities. This initiative will empower\nindividuals within the community to experiment with their ideas, fostering a\nmore dynamic and vibrant video generation ecosystem. The code is publicly\navailable at https://github.com/Tencent/HunyuanVideo.\n","authors":["Weijie Kong","Qi Tian","Zijian Zhang","Rox Min","Zuozhuo Dai","Jin Zhou","Jiangfeng Xiong","Xin Li","Bo Wu","Jianwei Zhang","Kathrina Wu","Qin Lin","Junkun Yuan","Yanxin Long","Aladdin Wang","Andong Wang","Changlin Li","Duojun Huang","Fang Yang","Hao Tan","Hongmei Wang","Jacob Song","Jiawang Bai","Jianbing Wu","Jinbao Xue","Joey Wang","Kai Wang","Mengyang Liu","Pengyu Li","Shuai Li","Weiyan Wang","Wenqing Yu","Xinchi Deng","Yang Li","Yi Chen","Yutao Cui","Yuanbo Peng","Zhentao Yu","Zhiyu He","Zhiyong Xu","Zixiang Zhou","Zunnan Xu","Yangyu Tao","Qinglin Lu","Songtao Liu","Daquan Zhou","Hongfa Wang","Yong Yang","Di Wang","Yuhong Liu","Jie Jiang","Caesar Zhong"],"pdf_url":"https://arxiv.org/pdf/2412.03603v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05180v1","updated":"2024-12-06T16:57:54Z","published":"2024-12-06T16:57:54Z","title":"DreamColour: Controllable Video Colour Editing without Training","summary":" Video colour editing is a crucial task for content creation, yet existing\nsolutions either require painstaking frame-by-frame manipulation or produce\nunrealistic results with temporal artefacts. We present a practical,\ntraining-free framework that makes precise video colour editing accessible\nthrough an intuitive interface while maintaining professional-quality output.\nOur key insight is that by decoupling spatial and temporal aspects of colour\nediting, we can better align with users' natural workflow -- allowing them to\nfocus on precise colour selection in key frames before automatically\npropagating changes across time. We achieve this through a novel technical\nframework that combines: (i) a simple point-and-click interface merging\ngrid-based colour selection with automatic instance segmentation for precise\nspatial control, (ii) bidirectional colour propagation that leverages inherent\nvideo motion patterns, and (iii) motion-aware blending that ensures smooth\ntransitions even with complex object movements. Through extensive evaluation on\ndiverse scenarios, we demonstrate that our approach matches or exceeds\nstate-of-the-art methods while eliminating the need for training or specialized\nhardware, making professional-quality video colour editing accessible to\neveryone.\n","authors":["Chaitat Utintu","Pinaki Nath Chowdhury","Aneeshan Sain","Subhadeep Koley","Ayan Kumar Bhunia","Yi-Zhe Song"],"pdf_url":"https://arxiv.org/pdf/2412.05180v1.pdf","comment":"Project page available at https://chaitron.github.io/DreamColour-demo"},{"id":"http://arxiv.org/abs/2412.05179v1","updated":"2024-12-06T16:54:55Z","published":"2024-12-06T16:54:55Z","title":"Spatially-Adaptive Hash Encodings For Neural Surface Reconstruction","summary":" Positional encodings are a common component of neural scene reconstruction\nmethods, and provide a way to bias the learning of neural fields towards\ncoarser or finer representations. Current neural surface reconstruction methods\nuse a \"one-size-fits-all\" approach to encoding, choosing a fixed set of\nencoding functions, and therefore bias, across all scenes. Current\nstate-of-the-art surface reconstruction approaches leverage grid-based\nmulti-resolution hash encoding in order to recover high-detail geometry. We\npropose a learned approach which allows the network to choose its encoding\nbasis as a function of space, by masking the contribution of features stored at\nseparate grid resolutions. The resulting spatially adaptive approach allows the\nnetwork to fit a wider range of frequencies without introducing noise. We test\nour approach on standard benchmark surface reconstruction datasets and achieve\nstate-of-the-art performance on two benchmark datasets.\n","authors":["Thomas Walker","Octave Mariotti","Amir Vaxman","Hakan Bilen"],"pdf_url":"https://arxiv.org/pdf/2412.05179v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05161v1","updated":"2024-12-06T16:25:57Z","published":"2024-12-06T16:25:57Z","title":"DNF: Unconditional 4D Generation with Dictionary-based Neural Fields","summary":" While remarkable success has been achieved through diffusion-based 3D\ngenerative models for shapes, 4D generative modeling remains challenging due to\nthe complexity of object deformations over time. We propose DNF, a new 4D\nrepresentation for unconditional generative modeling that efficiently models\ndeformable shapes with disentangled shape and motion while capturing\nhigh-fidelity details in the deforming objects. To achieve this, we propose a\ndictionary learning approach to disentangle 4D motion from shape as neural\nfields. Both shape and motion are represented as learned latent spaces, where\neach deformable shape is represented by its shape and motion global latent\ncodes, shape-specific coefficient vectors, and shared dictionary information.\nThis captures both shape-specific detail and global shared information in the\nlearned dictionary. Our dictionary-based representation well balances fidelity,\ncontiguity and compression -- combined with a transformer-based diffusion\nmodel, our method is able to generate effective, high-fidelity 4D animations.\n","authors":["Xinyi Zhang","Naiqi Li","Angela Dai"],"pdf_url":"https://arxiv.org/pdf/2412.05161v1.pdf","comment":"Project page: https://xzhang-t.github.io/project/DNF/"},{"id":"http://arxiv.org/abs/2412.05158v1","updated":"2024-12-06T16:22:00Z","published":"2024-12-06T16:22:00Z","title":"Gaining Explainability from a CNN for Stereotype Detection Based on Mice\n Stopping Behavior","summary":" Understanding the behavior of laboratory animals is a key to find answers\nabout diseases and neurodevelopmental disorders that also affects humans. One\nbehavior of interest is the stopping, as it correlates with exploration,\nfeeding and sleeping habits of individuals. To improve comprehension of\nanimal's behavior, we focus on identifying trait revealing age/sex of mice\nthrough the series of stopping spots of each individual. We track 4 mice using\nLiveMouseTracker (LMT) system during 3 days. Then, we build a stack of 2D\nhistograms of the stop positions. This stack of histograms passes through a\nshallow CNN architecture to classify mice in terms of age and sex. We observe\nthat female mice show more recognizable behavioral patterns, reaching a\nclassification accuracy of more than 90%, while males, which do not present as\nmany distinguishable patterns, reach an accuracy of 62.5%. To gain\nexplainability from the model, we look at the activation function of the\nconvolutional layers and found that some regions of the cage are preferentially\nexplored by females. Males, especially juveniles, present behavior patterns\nthat oscillate between juvenile female and adult male.\n","authors":["Raul Alfredo de Sousa Silva","Yasmine Belaidouni","Rabah Iguernaissi","Djamal Merad","Séverine Dubuisson"],"pdf_url":"https://arxiv.org/pdf/2412.05158v1.pdf","comment":"to be published in VAIB - Visual observation and analysis of\n Vertebrate And Insect Behavior (ICPR) 2024"},{"id":"http://arxiv.org/abs/2405.17446v3","updated":"2024-12-06T16:20:05Z","published":"2024-05-20T20:13:03Z","title":"Comparing ImageNet Pre-training with Digital Pathology Foundation Models\n for Whole Slide Image-Based Survival Analysis","summary":" The abundance of information present in Whole Slide Images (WSIs) renders\nthem an essential tool for survival analysis. Several Multiple Instance\nLearning frameworks proposed for this task utilize a ResNet50 backbone\npre-trained on natural images. By leveraging recenetly released\nhistopathological foundation models such as UNI and Hibou, the predictive\nprowess of existing MIL networks can be enhanced. Furthermore, deploying an\nensemble of digital pathology foundation models yields higher baseline\naccuracy, although the benefits appear to diminish with more complex MIL\narchitectures. Our code will be made publicly available upon acceptance.\n","authors":["Kleanthis Marios Papadopoulos","Tania Stathaki"],"pdf_url":"https://arxiv.org/pdf/2405.17446v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05154v1","updated":"2024-12-06T16:12:38Z","published":"2024-12-06T16:12:38Z","title":"Towards Flexible 3D Perception: Object-Centric Occupancy Completion\n Augments 3D Object Detection","summary":" While 3D object bounding box (bbox) representation has been widely used in\nautonomous driving perception, it lacks the ability to capture the precise\ndetails of an object's intrinsic geometry. Recently, occupancy has emerged as a\npromising alternative for 3D scene perception. However, constructing a\nhigh-resolution occupancy map remains infeasible for large scenes due to\ncomputational constraints. Recognizing that foreground objects only occupy a\nsmall portion of the scene, we introduce object-centric occupancy as a\nsupplement to object bboxes. This representation not only provides intricate\ndetails for detected objects but also enables higher voxel resolution in\npractical applications. We advance the development of object-centric occupancy\nperception from both data and algorithm perspectives. On the data side, we\nconstruct the first object-centric occupancy dataset from scratch using an\nautomated pipeline. From the algorithmic standpoint, we introduce a novel\nobject-centric occupancy completion network equipped with an implicit shape\ndecoder that manages dynamic-size occupancy generation. This network accurately\npredicts the complete object-centric occupancy volume for inaccurate object\nproposals by leveraging temporal information from long sequences. Our method\ndemonstrates robust performance in completing object shapes under noisy\ndetection and tracking conditions. Additionally, we show that our occupancy\nfeatures significantly enhance the detection results of state-of-the-art 3D\nobject detectors, especially for incomplete or distant objects in the Waymo\nOpen Dataset.\n","authors":["Chaoda Zheng","Feng Wang","Naiyan Wang","Shuguang Cui","Zhen Li"],"pdf_url":"https://arxiv.org/pdf/2412.05154v1.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.05150v1","updated":"2024-12-06T16:08:09Z","published":"2024-12-06T16:08:09Z","title":"BIAS: A Body-based Interpretable Active Speaker Approach","summary":" State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on\naudio and facial features to perform, which is not a sustainable approach in\nwild scenarios. Although these methods achieve good results in the standard\nAVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the\nlimitations of such models and raised the need for new approaches. As such, we\npropose BIAS, a model that, for the first time, combines audio, face, and body\ninformation, to accurately predict active speakers in varying/challenging\nconditions. Additionally, we design BIAS to provide interpretability by\nproposing a novel use for Squeeze-and-Excitation blocks, namely in attention\nheatmaps creation and feature importance assessment. For a full\ninterpretability setup, we annotate an ASD-related actions dataset (ASD-Text)\nto finetune a ViT-GPT2 for text scene description to complement BIAS\ninterpretability. The results show that BIAS is state-of-the-art in challenging\nconditions where body-based features are of utmost importance (Columbia,\nopen-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker,\nwhere face is more influential than body for ASD. BIAS interpretability also\nshows the features/aspects more relevant towards ASD prediction in varying\nsettings, making it a strong baseline for further developments in interpretable\nASD models, and is available at https://github.com/Tiago-Roxo/BIAS.\n","authors":["Tiago Roxo","Joana C. Costa","Pedro R. M. Inácio","Hugo Proença"],"pdf_url":"https://arxiv.org/pdf/2412.05150v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05270v2","updated":"2024-12-06T16:07:47Z","published":"2024-10-07T17:59:59Z","title":"Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia","summary":" We consider the problem of adapting a contrastively pretrained\nvision-language model like CLIP (Radford et al., 2021) for few-shot\nclassification. The literature addresses this problem by learning a linear\nclassifier of the frozen visual features, optimizing word embeddings, or\nlearning external feature adapters. This paper introduces an alternative way\nfor CLIP adaptation without adding 'external' parameters to optimize. We find\nthat simply fine-tuning the last projection matrix of the vision encoder leads\nto performance better than all baselines. Furthermore, we show that\nregularizing training with the distance between the fine-tuned and pretrained\nmatrices adds reliability for adapting CLIP. This simple approach, coined\nProLIP, yields state-of-the-art performance on 11 few-shot classification\nbenchmarks, few-shot domain generalization, cross-dataset transfer, base-to-new\nclass generalization, and test-time adaptation. Code will be made available at:\nhttps://github.com/astra-vision/ProLIP .\n","authors":["Mohammad Fahes","Tuan-Hung Vu","Andrei Bursuc","Patrick Pérez","Raoul de Charette"],"pdf_url":"https://arxiv.org/pdf/2410.05270v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05148v1","updated":"2024-12-06T16:04:56Z","published":"2024-12-06T16:04:56Z","title":"LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style\n Conditioned Image Generation","summary":" Recent advancements in image generation models have enabled personalized\nimage creation with both user-defined subjects (content) and styles. Prior\nworks achieved personalization by merging corresponding low-rank adaptation\nparameters (LoRAs) through optimization-based methods, which are\ncomputationally demanding and unsuitable for real-time use on\nresource-constrained devices like smartphones. To address this, we introduce\nLoRA.rar, a method that not only improves image quality but also achieves a\nremarkable speedup of over $4000\\times$ in the merging process. LoRA.rar\npre-trains a hypernetwork on a diverse set of content-style LoRA pairs,\nlearning an efficient merging strategy that generalizes to new, unseen\ncontent-style pairs, enabling fast, high-quality personalization. Moreover, we\nidentify limitations in existing evaluation metrics for content-style quality\nand propose a new protocol using multimodal large language models (MLLM) for\nmore accurate assessment. Our method significantly outperforms the current\nstate of the art in both content and style fidelity, as validated by MLLM\nassessments and human evaluations.\n","authors":["Donald Shenaj","Ondrej Bohdal","Mete Ozay","Pietro Zanuttigh","Umberto Michieli"],"pdf_url":"https://arxiv.org/pdf/2412.05148v1.pdf","comment":"17 pages, 20 figures"},{"id":"http://arxiv.org/abs/2412.05134v1","updated":"2024-12-06T15:47:53Z","published":"2024-12-06T15:47:53Z","title":"How to Squeeze An Explanation Out of Your Model","summary":" Deep learning models are widely used nowadays for their reliability in\nperforming various tasks. However, they do not typically provide the reasoning\nbehind their decision, which is a significant drawback, particularly for more\nsensitive areas such as biometrics, security and healthcare. The most commonly\nused approaches to provide interpretability create visual attention heatmaps of\nregions of interest on an image based on models gradient backpropagation.\nAlthough this is a viable approach, current methods are targeted toward image\nsettings and default/standard deep learning models, meaning that they require\nsignificant adaptations to work on video/multi-modal settings and custom\narchitectures. This paper proposes an approach for interpretability that is\nmodel-agnostic, based on a novel use of the Squeeze and Excitation (SE) block\nthat creates visual attention heatmaps. By including an SE block prior to the\nclassification layer of any model, we are able to retrieve the most influential\nfeatures via SE vector manipulation, one of the key components of the SE block.\nOur results show that this new SE-based interpretability can be applied to\nvarious models in image and video/multi-modal settings, namely biometrics of\nfacial features with CelebA and behavioral biometrics using Active Speaker\nDetection datasets. Furthermore, our proposal does not compromise model\nperformance toward the original task, and has competitive results with current\ninterpretability approaches in state-of-the-art object datasets, highlighting\nits robustness to perform in varying data aside from the biometric context.\n","authors":["Tiago Roxo","Joana C. Costa","Pedro R. M. Inácio","Hugo Proença"],"pdf_url":"https://arxiv.org/pdf/2412.05134v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04384v2","updated":"2024-12-06T15:43:40Z","published":"2024-12-05T17:59:58Z","title":"GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D\n Occupancy Prediction","summary":" 3D semantic occupancy prediction is an important task for robust\nvision-centric autonomous driving, which predicts fine-grained geometry and\nsemantics of the surrounding scene. Most existing methods leverage dense\ngrid-based scene representations, overlooking the spatial sparsity of the\ndriving scenes. Although 3D semantic Gaussian serves as an object-centric\nsparse alternative, most of the Gaussians still describe the empty region with\nlow efficiency. To address this, we propose a probabilistic Gaussian\nsuperposition model which interprets each Gaussian as a probability\ndistribution of its neighborhood being occupied and conforms to probabilistic\nmultiplication to derive the overall geometry. Furthermore, we adopt the exact\nGaussian mixture model for semantics calculation to avoid unnecessary\noverlapping of Gaussians. To effectively initialize Gaussians in non-empty\nregion, we design a distribution-based initialization module which learns the\npixel-aligned occupancy distribution instead of the depth of surfaces. We\nconduct extensive experiments on nuScenes and KITTI-360 datasets and our\nGaussianFormer-2 achieves state-of-the-art performance with high efficiency.\nCode: https://github.com/huang-yh/GaussianFormer.\n","authors":["Yuanhui Huang","Amonnut Thammatadatrakoon","Wenzhao Zheng","Yunpeng Zhang","Dalong Du","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04384v2.pdf","comment":"Code is available at: https://github.com/huang-yh/GaussianFormer"},{"id":"http://arxiv.org/abs/2412.04380v2","updated":"2024-12-06T15:43:38Z","published":"2024-12-05T17:57:09Z","title":"EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online\n Scene Understanding","summary":" 3D occupancy prediction provides a comprehensive description of the\nsurrounding scenes and has become an essential task for 3D perception. Most\nexisting methods focus on offline perception from one or a few views and cannot\nbe applied to embodied agents which demands to gradually perceive the scene\nthrough progressive embodied exploration. In this paper, we formulate an\nembodied 3D occupancy prediction task to target this practical scenario and\npropose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize\nthe global scene with uniform 3D semantic Gaussians and progressively update\nlocal regions observed by the embodied agent. For each update, we extract\nsemantic and structural features from the observed image and efficiently\nincorporate them via deformable cross-attention to refine the regional\nGaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global\n3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown\n(i.e., uniformly distributed) environment and maintains an explicit global\nmemory of it with 3D Gaussians. It gradually gains knowledge through the local\nrefinement of regional Gaussians, which is consistent with how humans\nunderstand new scenes through embodied exploration. We reorganize an\nEmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the\nevaluation of the embodied 3D occupancy prediction task. Experiments\ndemonstrate that our EmbodiedOcc outperforms existing local prediction methods\nand accomplishes the embodied occupancy prediction with high accuracy and\nstrong expandability. Code: https://github.com/YkiWu/EmbodiedOcc.\n","authors":["Yuqi Wu","Wenzhao Zheng","Sicheng Zuo","Yuanhui Huang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04380v2.pdf","comment":"Code: https://github.com/YkiWu/EmbodiedOcc"},{"id":"http://arxiv.org/abs/2410.14462v3","updated":"2024-12-06T15:39:13Z","published":"2024-10-18T13:44:29Z","title":"LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian\n Splatting scenes","summary":" We address the problem of extending the capabilities of vision foundation\nmodels such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a\nnovel method to uplift 2D image features into 3D Gaussian Splatting scenes.\nUnlike traditional approaches that rely on minimizing a reconstruction loss,\nour method employs a simpler and more efficient feature aggregation technique,\naugmented by a graph diffusion mechanism. Graph diffusion enriches features\nfrom a given model, such as CLIP, by leveraging 3D geometry and pairwise\nsimilarities induced by another strong model such as DINOv2. Our approach\nachieves performance comparable to the state of the art on multiple downstream\ntasks while delivering significant speed-ups. Notably, we obtain competitive\nsegmentation results using generic DINOv2 features, despite DINOv2 not being\ntrained on millions of annotated segmentation masks like SAM. When applied to\nCLIP features, our method demonstrates strong performance in open-vocabulary\nobject detection tasks, highlighting the versatility of our approach.\n","authors":["Juliette Marrie","Romain Menegaux","Michael Arbel","Diane Larlus","Julien Mairal"],"pdf_url":"https://arxiv.org/pdf/2410.14462v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18857v2","updated":"2024-12-06T15:20:28Z","published":"2024-10-24T15:42:25Z","title":"Probabilistic Language-Image Pre-Training","summary":" Vision-language models (VLMs) embed aligned image-text pairs into a joint\nspace but often rely on deterministic embeddings, assuming a one-to-one\ncorrespondence between images and texts. This oversimplifies real-world\nrelationships, which are inherently many-to-many, with multiple captions\ndescribing a single image and vice versa. We introduce Probabilistic\nLanguage-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained\non a billion-scale image-text dataset using only probabilistic objectives,\nachieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot\naccuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an\n\"uncertainty token\" without extra parameters. We also introduce a novel\ninclusion loss that enforces distributional inclusion relationships between\nimage-text pairs and between original and masked inputs. Experiments\ndemonstrate that, by leveraging uncertainty estimates, ProLIP benefits\ndownstream tasks and aligns with intuitive notions of uncertainty, e.g.,\nshorter texts being more uncertain and more general inputs including specific\nones. Utilizing text uncertainties, we further improve ImageNet accuracy from\n74.6% to 75.8% (under a few-shot setting), supporting the practical advantages\nof our probabilistic approach. The code is available at\nhttps://github.com/naver-ai/prolip\n","authors":["Sanghyuk Chun","Wonjae Kim","Song Park","Sangdoo Yun"],"pdf_url":"https://arxiv.org/pdf/2410.18857v2.pdf","comment":"Code: https://github.com/naver-ai/prolip HuggingFace Hub:\n https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291\n 31 pages, 4.29 MB"},{"id":"http://arxiv.org/abs/2406.11933v4","updated":"2024-12-06T15:10:36Z","published":"2024-06-17T15:41:57Z","title":"Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset","summary":" Masked Image Modeling (MIM) has become an essential method for building\nfoundational visual models in remote sensing (RS). However, the limitations in\nsize and diversity of existing RS datasets restrict the ability of MIM methods\nto learn generalizable representations. Additionally, conventional MIM\ntechniques, which require reconstructing all tokens, introduce unnecessary\ncomputational overhead. To address these issues, we present a new pre-training\npipeline for RS models, featuring the creation of a large-scale RS dataset and\nan efficient MIM approach. We curated a high-quality dataset named\nOpticalRS-13M by collecting publicly available RS datasets and processing them\nthrough exclusion, slicing, and deduplication. OpticalRS-13M comprises 13\nmillion optical images covering various RS tasks, such as object detection and\npixel segmentation. To enhance efficiency, we propose SelectiveMAE, a\npre-training method that dynamically encodes and reconstructs semantically rich\npatch tokens, thereby reducing the inefficiencies of traditional MIM models\ncaused by redundant background pixels in RS images. Extensive experiments\ndemonstrate that OpticalRS-13M significantly improves classification,\ndetection, and segmentation performance, while SelectiveMAE increases training\nefficiency over 2 times. This highlights the effectiveness and scalability of\nour pipeline in developing RS foundational models.\n","authors":["Fengxiang Wang","Hongzhen Wang","Di Wang","Zonghao Guo","Zhenyu Zhong","Long Lan","Jing Zhang","Zhiyuan Liu","Maosong Sun"],"pdf_url":"https://arxiv.org/pdf/2406.11933v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05101v1","updated":"2024-12-06T14:59:00Z","published":"2024-12-06T14:59:00Z","title":"The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven\n Image Generation","summary":" Text-to-image synthesis (T2I) has advanced remarkably with the emergence of\nlarge-scale diffusion models. In the conventional setup, the text prompt\nprovides explicit, user-defined guidance, directing the generation process by\ndenoising a randomly sampled Gaussian noise. In this work, we reveal that the\noften-overlooked noise itself encodes inherent generative tendencies, acting as\na \"silent prompt\" that implicitly guides the output. This implicit guidance,\nembedded in the noise scheduler design of diffusion model formulations and\ntheir training stages, generalizes across a wide range of T2I models and\nbackbones. Building on this insight, we introduce NoiseQuery, a novel strategy\nthat selects optimal initial noise from a pre-built noise library to meet\ndiverse user needs. Our approach not only enhances high-level semantic\nalignment with text prompts, but also allows for nuanced adjustments of\nlow-level visual attributes, such as texture, sharpness, shape, and color,\nwhich are typically challenging to control through text alone. Extensive\nexperiments across various models and target attributes demonstrate the strong\nperformance and zero-shot transferability of our approach, requiring no\nadditional optimization.\n","authors":["Ruoyu Wang","Huayang Huang","Ye Zhu","Olga Russakovsky","Yu Wu"],"pdf_url":"https://arxiv.org/pdf/2412.05101v1.pdf","comment":"18 pages, 18 figures, 6 tables"},{"id":"http://arxiv.org/abs/2309.07846v4","updated":"2024-12-06T14:53:06Z","published":"2023-09-14T16:40:44Z","title":"MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image\n Acquisition Systems","summary":" Neural Radiance Fields (NeRF) use multi-view images for 3D scene\nrepresentation, demonstrating remarkable performance. As one of the primary\nsources of multi-view images, multi-camera systems encounter challenges such as\nvarying intrinsic parameters and frequent pose changes. Most previous\nNeRF-based methods assume a unique camera and rarely consider multi-camera\nscenarios. Besides, some NeRF methods that can optimize intrinsic and extrinsic\nparameters still remain susceptible to suboptimal solutions when these\nparameters are poor initialized. In this paper, we propose MC-NeRF, a method\nthat enables joint optimization of both intrinsic and extrinsic parameters\nalongside NeRF. The method also supports each image corresponding to\nindependent camera parameters. First, we tackle coupling issue and the\ndegenerate case that arise from the joint optimization between intrinsic and\nextrinsic parameters. Second, based on the proposed solutions, we introduce an\nefficient calibration image acquisition scheme for multi-camera systems,\nincluding the design of calibration object. Finally, we present an end-to-end\nnetwork with training sequence that enables the estimation of intrinsic and\nextrinsic parameters, along with the rendering network. Furthermore,\nrecognizing that most existing datasets are designed for a unique camera, we\nconstruct a real multi-camera image acquisition system and create a\ncorresponding new dataset, which includes both simulated data and real-world\ncaptured images. Experiments confirm the effectiveness of our method when each\nimage corresponds to different camera parameters. Specifically, we use\nmulti-cameras, each with different intrinsic and extrinsic parameters in\nreal-world system, to achieve 3D scene representation without providing initial\nposes.\n","authors":["Yu Gao","Lutong Su","Hao Liang","Yufeng Yue","Yi Yang","Mengyin Fu"],"pdf_url":"https://arxiv.org/pdf/2309.07846v4.pdf","comment":"This manuscript is currently under review"},{"id":"http://arxiv.org/abs/2407.10921v5","updated":"2024-12-06T14:51:41Z","published":"2024-07-15T17:22:16Z","title":"Leveraging Bi-Focal Perspectives and Granular Feature Integration for\n Accurate Reliable Early Alzheimer's Detection","summary":" Alzheimer's disease (AD) is the most common neurodegeneration, annually\ndiagnosed in millions of patients. The present medicine scenario still finds\nchallenges in the exact diagnosis and classification of AD through neuroimaging\ndata. Traditional CNNs can extract a good amount of low-level information in an\nimage but fail to extract high-level minuscule particles, which is a\nsignificant challenge in detecting AD from MRI scans. To overcome this, we\npropose a novel Granular Feature Integration method to combine information\nextraction at different scales combined with an efficient information flow,\nenabling the model to capture both broad and fine-grained features\nsimultaneously. We also propose a Bi-Focal Perspective mechanism to highlight\nthe subtle neurofibrillary tangles and amyloid plaques in the MRI scans,\nensuring that critical pathological markers are accurately identified. Our\nmodel achieved an F1-Score of 99.31%, precision of 99.24%, and recall of\n99.51%. These scores prove that our model is significantly better than the\nstate-of-the-art (SOTA) CNNs in existence.\n","authors":["Pandiyaraju V","Shravan Venkatraman","Abeshek A","Pavan Kumar S","Aravintakshan S A"],"pdf_url":"https://arxiv.org/pdf/2407.10921v5.pdf","comment":"14 pages, 12 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.05095v1","updated":"2024-12-06T14:50:38Z","published":"2024-12-06T14:50:38Z","title":"SoPo: Text-to-Motion Generation Using Semi-Online Preference\n Optimization","summary":" Text-to-motion generation is essential for advancing the creative industry\nbut often presents challenges in producing consistent, realistic motions. To\naddress this, we focus on fine-tuning text-to-motion models to consistently\nfavor high-quality, human-preferred motions, a critical yet largely unexplored\nproblem. In this work, we theoretically investigate the DPO under both online\nand offline settings, and reveal their respective limitation: overfitting in\noffline DPO, and biased sampling in online DPO. Building on our theoretical\ninsights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based\nmethod for training text-to-motion models using \"semi-online\" data pair,\nconsisting of unpreferred motion from online distribution and preferred motion\nin offline datasets. This method leverages both online and offline DPO,\nallowing each to compensate for the other's limitations. Extensive experiments\ndemonstrate that SoPo outperforms other preference alignment methods, with an\nMM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g.\n0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model\nfine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM\nDist. Visualization results also show the efficacy of our SoPo in preference\nalignment. Our project page is https://sopo-motion.github.io.\n","authors":["Xiaofeng Tan","Hongsong Wang","Xin Geng","Pan Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.05095v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.02058v2","updated":"2024-12-06T14:49:23Z","published":"2024-06-04T07:42:33Z","title":"OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary\n Understanding","summary":" This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting\n(3DGS) capable of 3D point-level open vocabulary understanding. Our primary\nmotivation stems from observing that existing 3DGS-based open vocabulary\nmethods mainly focus on 2D pixel-level parsing. These methods struggle with 3D\npoint-level tasks due to weak feature expressiveness and inaccurate 2D-3D\nfeature associations. To ensure robust feature presentation and 3D point-level\nunderstanding, we first employ SAM masks without cross-frame associations to\ntrain instance features with 3D consistency. These features exhibit both\nintra-object consistency and inter-object distinction. Then, we propose a\ntwo-stage codebook to discretize these features from coarse to fine levels. At\nthe coarse level, we consider the positional information of 3D points to\nachieve location-based clustering, which is then refined at the fine level.\nFinally, we introduce an instance-level 3D-2D feature association method that\nlinks 3D points to 2D masks, which are further associated with 2D CLIP\nfeatures. Extensive experiments, including open vocabulary-based 3D object\nselection, 3D point cloud understanding, click-based 3D object selection, and\nablation studies, demonstrate the effectiveness of our proposed method. The\nsource code is available at our project page:\nhttps://3d-aigc.github.io/OpenGaussian\n","authors":["Yanmin Wu","Jiarui Meng","Haijie Li","Chenming Wu","Yahao Shi","Xinhua Cheng","Chen Zhao","Haocheng Feng","Errui Ding","Jingdong Wang","Jian Zhang"],"pdf_url":"https://arxiv.org/pdf/2406.02058v2.pdf","comment":"NeurIPS2024"},{"id":"http://arxiv.org/abs/2412.05084v1","updated":"2024-12-06T14:42:50Z","published":"2024-12-06T14:42:50Z","title":"Reconstructing Quantitative Cerebral Perfusion Images Directly From\n Measured Sinogram Data Acquired Using C-arm Cone-Beam CT","summary":" To shorten the door-to-puncture time for better treating patients with acute\nischemic stroke, it is highly desired to obtain quantitative cerebral perfusion\nimages using C-arm cone-beam computed tomography (CBCT) equipped in the\ninterventional suite. However, limited by the slow gantry rotation speed, the\ntemporal resolution and temporal sampling density of typical C-arm CBCT are\nmuch poorer than those of multi-detector-row CT in the diagnostic imaging\nsuite. The current quantitative perfusion imaging includes two cascaded steps:\ntime-resolved image reconstruction and perfusion parametric estimation. For\ntime-resolved image reconstruction, the technical challenge imposed by poor\ntemporal resolution and poor sampling density causes inaccurate quantification\nof the temporal variation of cerebral artery and tissue attenuation values. For\nperfusion parametric estimation, it remains a technical challenge to\nappropriately design the handcrafted regularization for better solving the\nassociated deconvolution problem. These two challenges together prevent\nobtaining quantitatively accurate perfusion images using C-arm CBCT. The\npurpose of this work is to simultaneously address these two challenges by\ncombining the two cascaded steps into a single joint optimization problem and\nreconstructing quantitative perfusion images directly from the measured\nsinogram data. In the developed direct cerebral perfusion parametric image\nreconstruction technique, TRAINER in short, the quantitative perfusion images\nhave been represented as a subject-specific conditional generative model\ntrained under the constraint of the time-resolved CT forward model, perfusion\nconvolutional model, and the subject's own measured sinogram data. Results\nshown in this paper demonstrated that using TRAINER, quantitative cerebral\nperfusion images can be accurately obtained using C-arm CBCT in the\ninterventional suite.\n","authors":["Haotian Zhao","Ruifeng Chen","Jing Yan","Juan Feng","Jun Xiang","Yang Chen","Dong Liang","Yinsheng Li"],"pdf_url":"https://arxiv.org/pdf/2412.05084v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05081v1","updated":"2024-12-06T14:39:06Z","published":"2024-12-06T14:39:06Z","title":"Spinal ligaments detection on vertebrae meshes using registration and 3D\n edge detection","summary":" Spinal ligaments are crucial elements in the complex biomechanical simulation\nmodels as they transfer forces on the bony structure, guide and limit movements\nand stabilize the spine. The spinal ligaments encompass seven major groups\nbeing responsible for maintaining functional interrelationships among the other\nspinal components. Determination of the ligament origin and insertion points on\nthe 3D vertebrae models is an essential step in building accurate and complex\nspine biomechanical models. In our paper, we propose a pipeline that is able to\ndetect 66 spinal ligament attachment points by using a step-wise approach. Our\nmethod incorporates a fast vertebra registration that strategically extracts\nonly 15 3D points to compute the transformation, and edge detection for a\nprecise projection of the registered ligaments onto any given patient-specific\nvertebra model. Our method shows high accuracy, particularly in identifying\nlandmarks on the anterior part of the vertebra with an average distance of 2.24\nmm for anterior longitudinal ligament and 1.26 mm for posterior longitudinal\nligament landmarks. The landmark detection requires approximately 3.0 seconds\nper vertebra, providing a substantial improvement over existing methods.\nClinical relevance: using the proposed method, the required landmarks that\nrepresent origin and insertion points for forces in the biomechanical spine\nmodels can be localized automatically in an accurate and time-efficient manner.\n","authors":["Ivanna Kramer","Lara Blomenkamp","Kevin Weirauch","Sabine Bauer","Dietrich Paulus"],"pdf_url":"https://arxiv.org/pdf/2412.05081v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05076v1","updated":"2024-12-06T14:34:32Z","published":"2024-12-06T14:34:32Z","title":"Improving analytical color and texture similarity estimation methods for\n dataset-agnostic person reidentification","summary":" This paper studies a combined person reidentification (re-id) method that\nuses human parsing, analytical feature extraction and similarity estimation\nschemes. One of its prominent features is its low computational requirements so\nit can be implemented on edge devices. The method allows direct comparison of\nspecific image regions using interpretable features which consist of color and\ntexture channels. It is proposed to analyze and compare colors in CIE-Lab color\nspace using histogram smoothing for noise reduction. A novel pre-configured\nlatent space (LS) supervised autoencoder (SAE) is proposed for texture analysis\nwhich encodes input textures as LS points. This allows to obtain more accurate\nsimilarity measures compared to simplistic label comparison. The proposed\nmethod also does not rely upon photos or other re-id data for training, which\nmakes it completely re-id dataset-agnostic. The viability of the proposed\nmethod is verified by computing rank-1, rank-10, and mAP re-id metrics on\nMarket1501 dataset. The results are comparable to those of conventional deep\nlearning methods and the potential ways to further improve the method are\ndiscussed.\n","authors":["Nikita Gabdullin"],"pdf_url":"https://arxiv.org/pdf/2412.05076v1.pdf","comment":"8 pages, 2 figures, 3 tables, 3 equations"},{"id":"http://arxiv.org/abs/2412.05074v1","updated":"2024-12-06T14:32:25Z","published":"2024-12-06T14:32:25Z","title":"LoFi: Vision-Aided Label Generator for Wi-Fi Localization and Tracking","summary":" Wi-Fi localization and tracking has shown immense potential due to its\nprivacy-friendliness, wide coverage, permeability, independence from lighting\nconditions, and low cost. Current methods can be broadly categorized as\nmodel-based and data-driven approaches, where data-driven methods show better\nperformance and have less requirement for specialized devices, but struggle\nwith limited datasets for training. Due to limitations in current data\ncollection methods, most datasets only provide coarse-grained ground truth (GT)\nor limited amount of label points, which greatly hinders the development of\ndata-driven methods. Even though lidar can provide accurate GT, their high cost\nmakes them inaccessible to many users. To address these challenges, we propose\nLoFi, a vision-aided label generator for Wi-Fi localization and tracking, which\ncan generate ground truth position coordinates solely based on 2D images. The\neasy and quick data collection method also helps data-driven based methods\ndeploy in practice, since Wi-Fi is a low-generalization modality and when using\nrelevant methods, it always requires fine-tuning the model using newly\ncollected data. Based on our method, we also collect a Wi-Fi tracking and\nlocalization dataset using ESP32-S3 and a webcam. To facilitate future\nresearch, we will make our code and dataset publicly available upon\npublication.\n","authors":["Zijian Zhao","Tingwei Chen","Fanyi Meng","Zhijie Cai","Hang Li","Xiaoyang Li","Guangxu Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.05074v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05066v1","updated":"2024-12-06T14:23:56Z","published":"2024-12-06T14:23:56Z","title":"BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction\n with Articulated Objects","summary":" We present BimArt, a novel generative approach for synthesizing 3D bimanual\nhand interactions with articulated objects. Unlike prior works, we do not rely\non a reference grasp, a coarse hand trajectory, or separate modes for grasping\nand articulating. To achieve this, we first generate distance-based contact\nmaps conditioned on the object trajectory with an articulation-aware feature\nrepresentation, revealing rich bimanual patterns for manipulation. The learned\ncontact prior is then used to guide our hand motion generator, producing\ndiverse and realistic bimanual motions for object movement and articulation.\nOur work offers key insights into feature representation and contact prior for\narticulated objects, demonstrating their effectiveness in taming the complex,\nhigh-dimensional space of bimanual hand-object interactions. Through\ncomprehensive quantitative experiments, we demonstrate a clear step towards\nsimplified and high-quality hand-object animations that excel over the\nstate-of-the-art in motion quality and diversity.\n","authors":["Wanyue Zhang","Rishabh Dabral","Vladislav Golyanik","Vasileios Choutas","Eduardo Alvarado","Thabo Beeler","Marc Habermann","Christian Theobalt"],"pdf_url":"https://arxiv.org/pdf/2412.05066v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05065v1","updated":"2024-12-06T14:23:42Z","published":"2024-12-06T14:23:42Z","title":"Reconstruction of 3D lumbar spine models from incomplete segmentations\n using landmark detection","summary":" Patient-specific 3D spine models serve as a foundation for spinal treatment\nand surgery planning as well as analysis of loading conditions in biomechanical\nand biomedical research. Despite advancements in imaging technologies, the\nreconstruction of complete 3D spine models often faces challenges due to\nlimitations in imaging modalities such as planar X-Ray and missing certain\nspinal structures, such as the spinal or transverse processes, in volumetric\nmedical images and resulting segmentations. In this study, we present a novel\naccurate and time-efficient method to reconstruct complete 3D lumbar spine\nmodels from incomplete 3D vertebral bodies obtained from segmented magnetic\nresonance images (MRI). In our method, we use an affine transformation to align\nartificial vertebra models with patient-specific incomplete vertebrae. The\ntransformation matrix is derived from vertebra landmarks, which are\nautomatically detected on the vertebra endplates. The results of our evaluation\ndemonstrate the high accuracy of the performed registration, achieving an\naverage point-to-model distance of 1.95 mm. Additionally, in assessing the\nmorphological properties of the vertebrae and intervertebral characteristics,\nour method demonstrated a mean absolute error (MAE) of 3.4{\\deg} in the angles\nof functional spine units (FSUs), emphasizing its effectiveness in maintaining\nimportant spinal features throughout the transformation process of individual\nvertebrae. Our method achieves the registration of the entire lumbar spine,\nspanning segments L1 to L5, in just 0.14 seconds, showcasing its\ntime-efficiency. Clinical relevance: the fast and accurate reconstruction of\nspinal models from incomplete input data such as segmentations provides a\nfoundation for many applications in spine diagnostics, treatment planning, and\nthe development of spinal healthcare solutions.\n","authors":["Lara Blomenkamp","Ivanna Kramer","Sabine Bauer","Kevin Weirauch","Dietrich Paulus"],"pdf_url":"https://arxiv.org/pdf/2412.05065v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.07057v2","updated":"2024-12-06T14:21:06Z","published":"2024-06-11T08:38:13Z","title":"MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal\n Large Language Models","summary":" Despite the superior capabilities of Multimodal Large Language Models (MLLMs)\nacross diverse tasks, they still face significant trustworthiness challenges.\nYet, current literature on the assessment of trustworthy MLLMs remains limited,\nlacking a holistic evaluation to offer thorough insights into future\nimprovements. In this work, we establish MultiTrust, the first comprehensive\nand unified benchmark on the trustworthiness of MLLMs across five primary\naspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark\nemploys a rigorous evaluation strategy that addresses both multimodal risks and\ncross-modal impacts, encompassing 32 diverse tasks with self-curated datasets.\nExtensive experiments with 21 modern MLLMs reveal some previously unexplored\ntrustworthiness issues and risks, highlighting the complexities introduced by\nthe multimodality and underscoring the necessity for advanced methodologies to\nenhance their reliability. For instance, typical proprietary models still\nstruggle with the perception of visually confusing images and are vulnerable to\nmultimodal jailbreaking and adversarial attacks; MLLMs are more inclined to\ndisclose privacy in text and reveal ideological and cultural biases even when\npaired with irrelevant images in inference, indicating that the multimodality\namplifies the internal risks from base LLMs. Additionally, we release a\nscalable toolbox for standardized trustworthiness research, aiming to\nfacilitate future advancements in this important field. Code and resources are\npublicly available at: https://multi-trust.github.io/.\n","authors":["Yichi Zhang","Yao Huang","Yitong Sun","Chang Liu","Zhe Zhao","Zhengwei Fang","Yifan Wang","Huanran Chen","Xiao Yang","Xingxing Wei","Hang Su","Yinpeng Dong","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2406.07057v2.pdf","comment":"100 pages, 84 figures, 33 tables"},{"id":"http://arxiv.org/abs/2407.04513v2","updated":"2024-12-06T14:20:26Z","published":"2024-07-05T13:54:15Z","title":"LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing\n Layer Execution Order","summary":" Due to their architecture and how they are trained, artificial neural\nnetworks are typically not robust toward pruning or shuffling layers at test\ntime. However, such properties would be desirable for different applications,\nsuch as distributed neural network architectures where the order of execution\ncannot be guaranteed or parts of the network can fail during inference. In this\nwork, we address these issues through a number of training approaches for\nvision transformers whose most important component is randomizing the execution\norder of attention modules at training time. With our proposed approaches,\nvision transformers are capable to adapt to arbitrary layer execution orders at\ntest time assuming one tolerates a reduction (about 20\\%) in accuracy at the\nsame model size. We analyse the feature representations of our trained models\nas well as how each layer contributes to the models prediction based on its\nposition during inference. Our analysis shows that layers learn to contribute\ndifferently based on their position in the network. Finally, we layer-prune our\nmodels at test time and find that their performance declines gracefully. Code\navailable at https://github.com/matfrei/layershuffle.\n","authors":["Matthias Freiberger","Peter Kun","Anders Sundnes Løvlie","Sebastian Risi"],"pdf_url":"https://arxiv.org/pdf/2407.04513v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05053v1","updated":"2024-12-06T14:08:08Z","published":"2024-12-06T14:08:08Z","title":"EvTTC: An Event Camera Dataset for Time-to-Collision Estimation","summary":" Time-to-Collision (TTC) estimation lies in the core of the forward collision\nwarning (FCW) functionality, which is key to all Automatic Emergency Braking\n(AEB) systems. Although the success of solutions using frame-based cameras\n(e.g., Mobileye's solutions) has been witnessed in normal situations, some\nextreme cases, such as the sudden variation in the relative speed of leading\nvehicles and the sudden appearance of pedestrians, still pose significant risks\nthat cannot be handled. This is due to the inherent imaging principles of\nframe-based cameras, where the time interval between adjacent exposures\nintroduces considerable system latency to AEB. Event cameras, as a novel\nbio-inspired sensor, offer ultra-high temporal resolution and can\nasynchronously report brightness changes at the microsecond level. To explore\nthe potential of event cameras in the above-mentioned challenging cases, we\npropose EvTTC, which is, to the best of our knowledge, the first multi-sensor\ndataset focusing on TTC tasks under high-relative-speed scenarios. EvTTC\nconsists of data collected using standard cameras and event cameras, covering\nvarious potential collision scenarios in daily driving and involving multiple\ncollision objects. Additionally, LiDAR and GNSS/INS measurements are provided\nfor the calculation of ground-truth TTC. Considering the high cost of testing\nTTC algorithms on full-scale mobile platforms, we also provide a small-scale\nTTC testbed for experimental validation and data augmentation. All the data and\nthe design of the testbed are open sourced, and they can serve as a benchmark\nthat will facilitate the development of vision-based TTC techniques.\n","authors":["Kaizhen Sun","Jinghang Li","Kuan Dai","Bangyan Liao","Wei Xiong","Yi Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.05053v1.pdf","comment":"8 pages, 7 figures, 5 tables"},{"id":"http://arxiv.org/abs/2411.00499v2","updated":"2024-12-06T14:02:59Z","published":"2024-11-01T10:25:25Z","title":"Cross-modal semantic segmentation for indoor environmental perception\n using single-chip millimeter-wave radar raw data","summary":" In the context of firefighting and rescue operations, a cross-modal semantic\nsegmentation model based on a single-chip millimeter-wave (mmWave) radar for\nindoor environmental perception is proposed and discussed. To efficiently\nobtain high-quality labels, an automatic label generation method utilizing\nLiDAR point clouds and occupancy grid maps is introduced. The proposed\nsegmentation model is based on U-Net. A spatial attention module is\nincorporated, which enhanced the performance of the mode. The results\ndemonstrate that cross-modal semantic segmentation provides a more intuitive\nand accurate representation of indoor environments. Unlike traditional methods,\nthe model's segmentation performance is minimally affected by azimuth. Although\nperformance declines with increasing distance, this can be mitigated by a\nwell-designed model. Additionally, it was found that using raw ADC data as\ninput is ineffective; compared to RA tensors, RD tensors are more suitable for\nthe proposed model.\n","authors":["Hairuo Hu","Haiyong Cong","Zhuyu Shao","Yubo Bi","Jinghao Liu"],"pdf_url":"https://arxiv.org/pdf/2411.00499v2.pdf","comment":"5291 words, 17 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.03517v2","updated":"2024-12-06T13:56:50Z","published":"2024-12-04T17:58:03Z","title":"NVComposer: Boosting Generative Novel View Synthesis with Multiple\n Sparse and Unposed Images","summary":" Recent advancements in generative models have significantly improved novel\nview synthesis (NVS) from multi-view data. However, existing methods depend on\nexternal multi-view alignment processes, such as explicit pose estimation or\npre-reconstruction, which limits their flexibility and accessibility,\nespecially when alignment is unstable due to insufficient overlap or occlusions\nbetween views. In this paper, we propose NVComposer, a novel approach that\neliminates the need for explicit external alignment. NVComposer enables the\ngenerative model to implicitly infer spatial and geometric relationships\nbetween multiple conditional views by introducing two key components: 1) an\nimage-pose dual-stream diffusion model that simultaneously generates target\nnovel views and condition camera poses, and 2) a geometry-aware feature\nalignment module that distills geometric priors from dense stereo models during\ntraining. Extensive experiments demonstrate that NVComposer achieves\nstate-of-the-art performance in generative multi-view NVS tasks, removing the\nreliance on external alignment and thus improving model accessibility. Our\napproach shows substantial improvements in synthesis quality as the number of\nunposed input views increases, highlighting its potential for more flexible and\naccessible generative NVS systems. Our project page is available at\nhttps://lg-li.github.io/project/nvcomposer\n","authors":["Lingen Li","Zhaoyang Zhang","Yaowei Li","Jiale Xu","Wenbo Hu","Xiaoyu Li","Weihao Cheng","Jinwei Gu","Tianfan Xue","Ying Shan"],"pdf_url":"https://arxiv.org/pdf/2412.03517v2.pdf","comment":"Project Page: https://lg-li.github.io/project/nvcomposer"},{"id":"http://arxiv.org/abs/2410.09566v2","updated":"2024-12-06T13:53:44Z","published":"2024-10-12T15:27:57Z","title":"Bridging Text and Image for Artist Style Transfer via Contrastive\n Learning","summary":" Image style transfer has attracted widespread attention in the past few\nyears. Despite its remarkable results, it requires additional style images\navailable as references, making it less flexible and inconvenient. Using text\nis the most natural way to describe the style. More importantly, text can\ndescribe implicit abstract styles, like styles of specific artists or art\nmovements. In this paper, we propose a Contrastive Learning for Artistic Style\nTransfer (CLAST) that leverages advanced image-text encoders to control\narbitrary style transfer. We introduce a supervised contrastive training\nstrategy to effectively extract style descriptions from the image-text model\n(i.e., CLIP), which aligns stylization with the text description. To this end,\nwe also propose a novel and efficient adaLN based state space models that\nexplore style-content fusion. Finally, we achieve a text-driven image style\ntransfer. Extensive experiments demonstrate that our approach outperforms the\nstate-of-the-art methods in artistic style transfer. More importantly, it does\nnot require online fine-tuning and can render a 512x512 image in 0.03s.\n","authors":["Zhi-Song Liu","Li-Wen Wang","Jun Xiao","Vicky Kalogeiton"],"pdf_url":"https://arxiv.org/pdf/2410.09566v2.pdf","comment":"18 pages, 8 figures. arXiv admin note: substantial text overlap with\n arXiv:2202.13562"},{"id":"http://arxiv.org/abs/2412.05043v1","updated":"2024-12-06T13:49:10Z","published":"2024-12-06T13:49:10Z","title":"ReF-LDM: A Latent Diffusion Model for Reference-based Face Image\n Restoration","summary":" While recent works on blind face image restoration have successfully produced\nimpressive high-quality (HQ) images with abundant details from low-quality (LQ)\ninput images, the generated content may not accurately reflect the real\nappearance of a person. To address this problem, incorporating well-shot\npersonal images as additional reference inputs could be a promising strategy.\nInspired by the recent success of the Latent Diffusion Model (LDM), we propose\nReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned\non one LQ image and multiple HQ reference images. Our model integrates an\neffective and efficient mechanism, CacheKV, to leverage the reference images\nduring the generation process. Additionally, we design a timestep-scaled\nidentity loss, enabling our LDM-based model to focus on learning the\ndiscriminating features of human faces. Lastly, we construct FFHQ-Ref, a\ndataset consisting of 20,405 high-quality (HQ) face images with corresponding\nreference images, which can serve as both training and evaluation data for\nreference-based face restoration models.\n","authors":["Chi-Wei Hsiao","Yu-Lun Liu","Cheng-Kun Yang","Sheng-Po Kuo","Kevin Jou","Chia-Ping Chen"],"pdf_url":"https://arxiv.org/pdf/2412.05043v1.pdf","comment":"NeurIPS 2024, project page\n https://chiweihsiao.github.io/refldm.github.io/"},{"id":"http://arxiv.org/abs/2412.05042v1","updated":"2024-12-06T13:48:40Z","published":"2024-12-06T13:48:40Z","title":"Improving Post-Earthquake Crack Detection using Semi-Synthetic Generated\n Images","summary":" Following an earthquake, it is vital to quickly evaluate the safety of the\nimpacted areas. Damage detection systems, powered by computer vision and deep\nlearning, can assist experts in this endeavor. However, the lack of extensive,\nlabeled datasets poses a challenge to the development of these systems. In this\nstudy, we introduce a technique for generating semi-synthetic images to be used\nas data augmentation during the training of a damage detection system. We\nspecifically aim to generate images of cracks, which are a prevalent and\nindicative form of damage. The central concept is to employ parametric\nmeta-annotations to guide the process of generating cracks on 3D models of\nreal-word structures. The governing parameters of these meta-annotations can be\nadjusted iteratively to yield images that are optimally suited for improving\ndetectors' performance. Comparative evaluations demonstrated that a crack\ndetection system trained with a combination of real and semi-synthetic images\noutperforms a system trained on real images alone.\n","authors":["Piercarlo Dondi","Alessio Gullotti","Michele Inchingolo","Ilaria Senaldi","Chiara Casarotti","Luca Lombardi","Marco Piastra"],"pdf_url":"https://arxiv.org/pdf/2412.05042v1.pdf","comment":"Accepted at ECCV2024 Workshop: SyntheticData4CV 2024"},{"id":"http://arxiv.org/abs/2303.15649v3","updated":"2024-12-06T13:43:47Z","published":"2023-03-28T00:16:45Z","title":"StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing","summary":" A significant research effort is focused on exploiting the amazing capacities\nof pretrained diffusion models for the editing of images.They either finetune\nthe model, or invert the image in the latent space of the pretrained model.\nHowever, they suffer from two problems: (1) Unsatisfying results for selected\nregions and unexpected changes in non-selected regions.(2) They require careful\ntext prompt editing where the prompt should include all visual objects in the\ninput image.To address this, we propose two improvements: (1) Only optimizing\nthe input of the value linear network in the cross-attention layers is\nsufficiently powerful to reconstruct a real image. (2) We propose attention\nregularization to preserve the object-like attention maps after reconstruction\nand editing, enabling us to obtain accurate style editing without invoking\nsignificant structural changes. We further improve the editing technique that\nis used for the unconditional branch of classifier-free guidance as used by\nP2P. Extensive experimental prompt-editing results on a variety of images\ndemonstrate qualitatively and quantitatively that our method has superior\nediting capabilities compared to existing and concurrent works. See our\naccompanying code in Stylediffusion:\n\\url{https://github.com/sen-mao/StyleDiffusion}.\n","authors":["Senmao Li","Joost van de Weijer","Taihang Hu","Fahad Shahbaz Khan","Qibin Hou","Yaxing Wang","Jian Yang","Ming-Ming Cheng"],"pdf_url":"https://arxiv.org/pdf/2303.15649v3.pdf","comment":"Accepted by Computational Visual Meda"},{"id":"http://arxiv.org/abs/2412.05035v1","updated":"2024-12-06T13:39:36Z","published":"2024-12-06T13:39:36Z","title":"SMIC: Semantic Multi-Item Compression based on CLIP dictionary","summary":" Semantic compression, a compression scheme where the distortion metric,\ntypically MSE, is replaced with semantic fidelity metrics, tends to become more\nand more popular. Most recent semantic compression schemes rely on the\nfoundation model CLIP. In this work, we extend such a scheme to image\ncollection compression, where inter-item redundancy is taken into account\nduring the coding phase. For that purpose, we first show that CLIP's latent\nspace allows for easy semantic additions and subtractions. From this property,\nwe define a dictionary-based multi-item codec that outperforms state-of-the-art\ngenerative codec in terms of compression rate, around $10^{-5}$ BPP per image,\nwhile not sacrificing semantic fidelity. We also show that the learned\ndictionary is of a semantic nature and works as a semantic projector for the\nsemantic content of images.\n","authors":["Tom Bachard","Thomas Maugey"],"pdf_url":"https://arxiv.org/pdf/2412.05035v1.pdf","comment":"12 pages, 14 figures, 3 tables, journal paper, preprint"},{"id":"http://arxiv.org/abs/2411.16740v3","updated":"2024-12-06T13:10:23Z","published":"2024-11-23T18:14:42Z","title":"Document Haystacks: Vision-Language Reasoning Over Piles of 1000+\n Documents","summary":" Large multimodal models (LMMs) have achieved impressive progress in\nvision-language understanding, yet they face limitations in real-world\napplications requiring complex reasoning over a large number of images.\nExisting benchmarks for multi-image question-answering are limited in scope,\neach question is paired with only up to 30 images, which does not fully capture\nthe demands of large-scale retrieval tasks encountered in the real-world\nusages. To reduce these gaps, we introduce two document haystack benchmarks,\ndubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on\nlarge-scale visual document retrieval and understanding. Additionally, we\npropose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG)\nframework that leverages a suite of multimodal vision encoders, each optimized\nfor specific strengths, and a dedicated question-document relevance module.\nV-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the\nchallenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively,\ncompared to the previous best baseline models. Additionally, integrating V-RAG\nwith LMMs enables them to efficiently operate across thousands of images,\nyielding significant improvements on our DocHaystack and InfoHaystack\nbenchmarks. Our code and datasets are available at\nhttps://github.com/Vision-CAIR/dochaystacks\n","authors":["Jun Chen","Dannong Xu","Junjie Fei","Chun-Mei Feng","Mohamed Elhoseiny"],"pdf_url":"https://arxiv.org/pdf/2411.16740v3.pdf","comment":"the correct arxiv version"},{"id":"http://arxiv.org/abs/2411.00769v3","updated":"2024-12-06T13:09:43Z","published":"2024-11-01T17:59:17Z","title":"GameGen-X: Interactive Open-world Game Video Generation","summary":" We introduce GameGen-X, the first diffusion transformer model specifically\ndesigned for both generating and interactively controlling open-world game\nvideos. This model facilitates high-quality, open-domain generation by\nsimulating an extensive array of game engine features, such as innovative\ncharacters, dynamic environments, complex actions, and diverse events.\nAdditionally, it provides interactive controllability, predicting and altering\nfuture content based on the current clip, thus allowing for gameplay\nsimulation. To realize this vision, we first collected and built an Open-World\nVideo Game Dataset from scratch. It is the first and largest dataset for\nopen-world game video generation and control, which comprises over a million\ndiverse gameplay video clips sampling from over 150 games with informative\ncaptions from GPT-4o. GameGen-X undergoes a two-stage training process,\nconsisting of foundation model pre-training and instruction tuning. Firstly,\nthe model was pre-trained via text-to-video generation and video continuation,\nendowing it with the capability for long-sequence, high-quality open-domain\ngame video generation. Further, to achieve interactive controllability, we\ndesigned InstructNet to incorporate game-related multi-modal control signal\nexperts. This allows the model to adjust latent representations based on user\ninputs, unifying character interaction and scene content control for the first\ntime in video generation. During instruction tuning, only the InstructNet is\nupdated while the pre-trained foundation model is frozen, enabling the\nintegration of interactive controllability without loss of diversity and\nquality of generated video content.\n","authors":["Haoxuan Che","Xuanhua He","Quande Liu","Cheng Jin","Hao Chen"],"pdf_url":"https://arxiv.org/pdf/2411.00769v3.pdf","comment":"Homepage: https://gamegen-x.github.io/ Github:\n https://github.com/GameGen-X/GameGen-X"},{"id":"http://arxiv.org/abs/2412.05012v1","updated":"2024-12-06T13:05:50Z","published":"2024-12-06T13:05:50Z","title":"SAMCL: Empowering SAM to Continually Learn from Dynamic Domains","summary":" Segment Anything Model (SAM) struggles with segmenting objects in the open\nworld, especially across diverse and dynamic domains. Continual segmentation\n(CS) is a potential technique to solve this issue, but a significant obstacle\nis the intractable balance between previous domains (stability) and new domains\n(plasticity) during CS. Furthermore, how to utilize two kinds of features of\nSAM, images and prompts, in an efficient and effective CS manner remains a\nsignificant hurdle. In this work, we propose a novel CS method, termed SAMCL,\nto address these challenges. It is the first study to empower SAM with the CS\nability across dynamic domains. SAMCL decouples stability and plasticity during\nCS by two components: $\\textit{AugModule}$ and $\\textit{Module Selector}$.\nSpecifically, SAMCL leverages individual $\\textit{AugModule}$ to effectively\nand efficiently learn new relationships between images and prompts in each\ndomain. $\\textit{Module Selector}$ selects the appropriate module during\ntesting, based on the inherent ability of SAM to distinguish between different\ndomains. These two components enable SAMCL to realize a task-agnostic method\nwithout any interference across different domains. Experimental results\ndemonstrate that SAMCL outperforms state-of-the-art methods, achieving an\nexceptionally low average forgetting of just $0.5$%, along with at least a\n$2.5$% improvement in transferring to unseen domains. Moreover, the tunable\nparameter consumption in AugModule is about $0.236$MB, marking at least a\n$23.3$% reduction compared to other fine-tuning methods.\n","authors":["Zeqing Wang","Kangye Ji","Di Wang","Fei Cheng"],"pdf_url":"https://arxiv.org/pdf/2412.05012v1.pdf","comment":"14 pages, 11 figures"},{"id":"http://arxiv.org/abs/2312.16118v2","updated":"2024-12-06T13:03:53Z","published":"2023-12-26T16:53:21Z","title":"Quantum-Hybrid Stereo Matching With Nonlinear Regularization and Spatial\n Pyramids","summary":" Quantum visual computing is advancing rapidly. This paper presents a new\nformulation for stereo matching with nonlinear regularizers and spatial\npyramids on quantum annealers as a maximum a posteriori inference problem that\nminimizes the energy of a Markov Random Field. Our approach is hybrid (i.e.,\nquantum-classical) and is compatible with modern D-Wave quantum annealers,\ni.e., it includes a quadratic unconstrained binary optimization (QUBO)\nobjective. Previous quantum annealing techniques for stereo matching are\nlimited to using linear regularizers, and thus, they do not exploit the\nfundamental advantages of the quantum computing paradigm in solving\ncombinatorial optimization problems. In contrast, our method utilizes the full\npotential of quantum annealing for stereo matching, as nonlinear regularizers\ncreate optimization problems which are NP-hard. On the Middlebury benchmark, we\nachieve an improved root mean squared accuracy over the previous state of the\nart in quantum stereo matching of 2% and 22.5% when using different solvers.\n","authors":["Cameron Braunstein","Eddy Ilg","Vladislav Golyanik"],"pdf_url":"https://arxiv.org/pdf/2312.16118v2.pdf","comment":"26 pages, 15 figures. To be published in the International Conference\n on 3D Vision (3DV) 2024"},{"id":"http://arxiv.org/abs/2412.05010v1","updated":"2024-12-06T13:03:22Z","published":"2024-12-06T13:03:22Z","title":"Backdooring Outlier Detection Methods: A Novel Attack Approach","summary":" There have been several efforts in backdoor attacks, but these have primarily\nfocused on the closed-set performance of classifiers (i.e., classification).\nThis has left a gap in addressing the threat to classifiers' open-set\nperformance, referred to as outlier detection in the literature. Reliable\noutlier detection is crucial for deploying classifiers in critical real-world\napplications such as autonomous driving and medical image analysis. First, we\nshow that existing backdoor attacks fall short in affecting the open-set\nperformance of classifiers, as they have been specifically designed to confuse\nintra-closed-set decision boundaries. In contrast, an effective backdoor attack\nfor outlier detection needs to confuse the decision boundary between the closed\nand open sets. Motivated by this, in this study, we propose BATOD, a novel\nBackdoor Attack targeting the Outlier Detection task. Specifically, we design\ntwo categories of triggers to shift inlier samples to outliers and vice versa.\nWe evaluate BATOD using various real-world datasets and demonstrate its\nsuperior ability to degrade the open-set performance of classifiers compared to\nprevious attacks, both before and after applying defenses.\n","authors":["ZeinabSadat Taghavi","Hossein Mirzaei"],"pdf_url":"https://arxiv.org/pdf/2412.05010v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05003v1","updated":"2024-12-06T12:58:58Z","published":"2024-12-06T12:58:58Z","title":"SLayR: Scene Layout Generation with Rectified Flow","summary":" We introduce SLayR, Scene Layout Generation with Rectified flow.\nState-of-the-art text-to-image models achieve impressive results. However, they\ngenerate images end-to-end, exposing no fine-grained control over the process.\nSLayR presents a novel transformer-based rectified flow model for layout\ngeneration over a token space that can be decoded into bounding boxes and\ncorresponding labels, which can then be transformed into images using existing\nmodels. We show that established metrics for generated images are inconclusive\nfor evaluating their underlying scene layout, and introduce a new benchmark\nsuite, including a carefully designed repeatable human-evaluation procedure\nthat assesses the plausibility and variety of generated layouts. In contrast to\nprevious works, which perform well in either high variety or plausibility, we\nshow that our approach performs well on both of these axes at the same time. It\nis also at least 5x times smaller in the number of parameters and 37% faster\nthan the baselines. Our complete text-to-image pipeline demonstrates the added\nbenefits of an interpretable and editable intermediate representation.\n","authors":["Cameron Braunstein","Hevra Petekkaya","Jan Eric Lenssen","Mariya Toneva","Eddy Ilg"],"pdf_url":"https://arxiv.org/pdf/2412.05003v1.pdf","comment":"34 pages, 29 figures, 5 tables"},{"id":"http://arxiv.org/abs/2406.01299v2","updated":"2024-12-06T12:53:57Z","published":"2024-06-03T13:07:29Z","title":"Enhancing Dynamic CT Image Reconstruction with Neural Fields and Optical\n Flow","summary":" In this paper, we investigate image reconstruction for dynamic Computed\nTomography. The motion of the target with respect to the measurement\nacquisition rate leads to highly resolved in time but highly undersampled in\nspace measurements. Such problems pose a major challenge: not accounting for\nthe dynamics of the process leads to a poor reconstruction with non-realistic\nmotion. Variational approaches that penalize time evolution have been proposed\nto relate subsequent frames and improve image quality based on classical\ngrid-based discretizations. Neural fields have emerged as a novel way to\nparameterize the quantity of interest using a neural network with a\nlow-dimensional input, benefiting from being lightweight, continuous, and\nbiased towards smooth representations. The latter property has been exploited\nwhen solving dynamic inverse problems with neural fields by minimizing a\ndata-fidelity term only. We investigate and show the benefits of introducing\nexplicit motion regularizers for dynamic inverse problems based on partial\ndifferential equations, namely, the optical flow equation, for the optimization\nof neural fields. We compare it against its unregularized counterpart and show\nthe improvements in the reconstruction. We also compare neural fields against a\ngrid-based solver and show that the former outperforms the latter in terms of\nPSNR in this task.\n","authors":["Pablo Arratia","Matthias Ehrhardt","Lisa Kreusser"],"pdf_url":"https://arxiv.org/pdf/2406.01299v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09454v2","updated":"2024-12-06T12:37:08Z","published":"2024-08-18T12:28:26Z","title":"Retina-Inspired Object Motion Segmentation for Event-Cameras","summary":" Event-cameras have emerged as a revolutionary technology with a high temporal\nresolution that far surpasses standard active pixel cameras. This technology\ndraws biological inspiration from photoreceptors and the initial retinal\nsynapse. This research showcases the potential of additional retinal\nfunctionalities to extract visual features. We provide a domain-agnostic and\nefficient algorithm for ego-motion compensation based on Object Motion\nSensitivity (OMS), one of the multiple features computed within the mammalian\nretina. We develop a method based on experimental neuroscience that translates\nOMS' biological circuitry to a low-overhead algorithm to suppress camera motion\nbypassing the need for deep networks and learning. Our system processes event\ndata from dynamic scenes to perform pixel-wise object motion segmentation using\na real and synthetic dataset. This paper introduces a bio-inspired computer\nvision method that dramatically reduces the number of parameters by\n$\\text{10}^\\text{3}$ to $\\text{10}^\\text{6}$ orders of magnitude compared to\nprevious approaches. Our work paves the way for robust, high-speed, and\nlow-bandwidth decision-making for in-sensor computations.\n","authors":["Victoria Clerico","Shay Snyder","Arya Lohia","Md Abdullah-Al Kaiser","Gregory Schwartz","Akhilesh Jaiswal","Maryam Parsa"],"pdf_url":"https://arxiv.org/pdf/2408.09454v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.06126v2","updated":"2024-12-06T12:28:01Z","published":"2024-10-08T15:28:33Z","title":"$\\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable\n Deepfake Detection","summary":" Detecting deepfakes has become an important task. Most existing detection\nmethods provide only real/fake predictions without offering\nhuman-comprehensible explanations. Recent studies leveraging MLLMs for deepfake\ndetection have shown improvements in explainability. However, the performance\nof pre-trained MLLMs (e.g., LLaVA) remains limited due to a lack of\nunderstanding of their capabilities for this task and strategies to enhance\nthem. In this work, we empirically assess the strengths and weaknesses of MLLMs\nspecifically in deepfake detection via forgery features analysis. Building on\nthese assessments, we propose a novel framework called ${X}^2$-DFD, consisting\nof three core modules. The first module, Model Feature Assessment (MFA),\nmeasures the detection capabilities of forgery features intrinsic to MLLMs, and\ngives a descending ranking of these features. The second module, Strong Feature\nStrengthening (SFS), enhances the detection and explanation capabilities by\nfine-tuning the MLLM on a dataset constructed based on the top-ranked features.\nThe third module, Weak Feature Supplementing (WFS), improves the fine-tuned\nMLLM's capabilities on lower-ranked features by integrating external dedicated\ndeepfake detectors. To verify the effectiveness of this framework, we further\npresent a practical implementation, where an automated forgery features\ngeneration, evaluation, and ranking procedure is designed for MFA module; an\nautomated generation procedure of the fine-tuning dataset containing real and\nfake images with explanations based on top-ranked features is developed for SFS\nmodel; an external conventional deepfake detector focusing on blending\nartifact, which corresponds to a low detection capability in the pre-trained\nMLLM, is integrated for WFS module. Experiments show that our approach enhances\nboth detection and explanation performance.\n","authors":["Yize Chen","Zhiyuan Yan","Siwei Lyu","Baoyuan Wu"],"pdf_url":"https://arxiv.org/pdf/2410.06126v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04990v1","updated":"2024-12-06T12:27:07Z","published":"2024-12-06T12:27:07Z","title":"ETLNet: An Efficient TCN-BiLSTM Network for Road Anomaly Detection Using\n Smartphone Sensors","summary":" Road anomalies can be defined as irregularities on the road surface or in the\nsurface itself. Some may be intentional (such as speedbumps), accidental (such\nas materials falling off a truck), or the result of roads' excessive use or low\nor no maintenance, such as potholes. Despite their varying origins, these\nirregularities often harm vehicles substantially. Speed bumps are intentionally\nplaced for safety but are dangerous due to their non-standard shape, size, and\nlack of proper markings. Potholes are unintentional and can also cause severe\ndamage. To address the detection of these anomalies, we need an automated road\nmonitoring system. Today, various systems exist that use visual information to\ntrack these anomalies. Still, due to poor lighting conditions and improper or\nmissing markings, they may go undetected and have severe consequences for\npublic transport, automated vehicles, etc. In this paper, the Enhanced\nTemporal-BiLSTM Network (ETLNet) is introduced as a novel approach that\nintegrates two Temporal Convolutional Network (TCN) layers with a Bidirectional\nLong Short-Term Memory (BiLSTM) layer. This combination is tailored to detect\nanomalies effectively irrespective of lighting conditions, as it depends not on\nvisuals but smartphone inertial sensor data. Our methodology employs\naccelerometer and gyroscope sensors, typically in smartphones, to gather data\non road conditions. Empirical evaluations demonstrate that the ETLNet model\nmaintains an F1-score for detecting speed bumps of 99.3%. The ETLNet model's\nrobustness and efficiency significantly advance automated road surface\nmonitoring technologies.\n","authors":["Mohd Faiz Ansari","Rakshit Sandilya","Mohammed Javed","David Doermann"],"pdf_url":"https://arxiv.org/pdf/2412.04990v1.pdf","comment":"Presented in ICPR 2024, Kolkata, December 1-5, 2024 (First Workshop\n on Intelligent Mobility in Unstructured Environments)"},{"id":"http://arxiv.org/abs/2412.04986v1","updated":"2024-12-06T12:15:11Z","published":"2024-12-06T12:15:11Z","title":"Power Plant Detection for Energy Estimation using GIS with Remote\n Sensing, CNN & Vision Transformers","summary":" In this research, we propose a hybrid model for power plant detection to\nassist energy estimation applications, by pipelining GIS (Geographical\nInformation Systems) having Remote Sensing capabilities with CNN (Convolutional\nNeural Networks) and ViT (Vision Transformers). Our proposed approach enables\nreal-time analysis with multiple data types on a common map via the GIS,\nentails feature-extraction abilities due to the CNN, and captures long-range\ndependencies through the ViT. This hybrid approach is found to enhance\nclassification, thus helping in the monitoring and operational management of\npower plants; hence assisting energy estimation and sustainable energy planning\nin the future. It exemplifies adequate deployment of machine learning methods\nin conjunction with domain-specific approaches to enhance performance.\n","authors":["Blessing Austin-Gabriel","Cristian Noriega Monsalve","Aparna S. Varde"],"pdf_url":"https://arxiv.org/pdf/2412.04986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.19248v2","updated":"2024-12-06T12:02:37Z","published":"2024-04-30T04:12:36Z","title":"Transition Rate Scheduling for Quantization-Aware Training","summary":" Quantization-aware training (QAT) simulates a quantization process during\ntraining to lower bit-precision of weights/activations. It learns quantized\nweights indirectly by updating latent weights, i.e., full-precision inputs to a\nquantizer, using gradient-based optimizers. We claim that coupling a\nuser-defined learning rate (LR) with these optimizers is sub-optimal for QAT.\nQuantized weights transit discrete levels of a quantizer, only if corresponding\nlatent weights pass transition points, where the quantizer changes discrete\nstates. This suggests that the changes of quantized weights are affected by\nboth the LR for latent weights and their distributions. It is thus difficult to\ncontrol the degree of changes for quantized weights by scheduling the LR\nmanually. We conjecture that the degree of parameter changes in QAT is related\nto the number of quantized weights transiting discrete levels. Based on this,\nwe introduce a transition rate (TR) scheduling technique that controls the\nnumber of transitions of quantized weights explicitly. Instead of scheduling a\nLR for latent weights, we schedule a target TR of quantized weights, and update\nthe latent weights with a novel transition-adaptive LR (TALR), enabling\nconsidering the degree of changes for the quantized weights during QAT.\nExperimental results demonstrate the effectiveness of our approach on standard\nbenchmarks.\n","authors":["Junghyup Lee","Jeimin Jeon","Dohyung Kim","Bumsub Ham"],"pdf_url":"https://arxiv.org/pdf/2404.19248v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.20224v3","updated":"2024-12-06T11:34:57Z","published":"2024-05-29T04:59:27Z","title":"EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry\n Images","summary":" 3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D\nscene reconstruction and novel view synthesis. However, its training heavily\ndepends on high-quality, sharp images and accurate camera poses. Fulfilling\nthese requirements can be challenging in non-ideal real-world scenarios, where\nmotion-blurred images are commonly encountered in high-speed moving cameras or\nlow-light environments that require long exposure times. To address these\nchallenges, we introduce Event Stream Assisted Gaussian Splatting\n(EvaGaussians), a novel approach that integrates event streams captured by an\nevent camera to assist in reconstructing high-quality 3D-GS from blurry images.\nCapitalizing on the high temporal resolution and dynamic range offered by the\nevent camera, we leverage the event streams to explicitly model the formation\nprocess of motion-blurred images and guide the deblurring reconstruction of\n3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion\ntrajectories during the exposure time, our method can robustly facilitate the\nacquisition of high-fidelity novel views with intricate texture details. We\ncomprehensively evaluated our method and compared it with previous\nstate-of-the-art deblurring rendering methods. Both qualitative and\nquantitative comparisons demonstrate that our method surpasses existing\ntechniques in restoring fine details from blurry images and producing\nhigh-fidelity novel views.\n","authors":["Wangbo Yu","Chaoran Feng","Jiye Tang","Jiashu Yang","Zhenyu Tang","Xu Jia","Yuchao Yang","Li Yuan","Yonghong Tian"],"pdf_url":"https://arxiv.org/pdf/2405.20224v3.pdf","comment":"Project Page: https://www.falcary.com/EvaGaussians/"},{"id":"http://arxiv.org/abs/2412.02030v2","updated":"2024-12-06T11:22:17Z","published":"2024-12-02T23:20:35Z","title":"NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic\n Adversarial Training","summary":" We introduce NitroFusion, a fundamentally different approach to single-step\ndiffusion that achieves high-quality generation through a dynamic adversarial\nframework. While one-step methods offer dramatic speed advantages, they\ntypically suffer from quality degradation compared to their multi-step\ncounterparts. Just as a panel of art critics provides comprehensive feedback by\nspecializing in different aspects like composition, color, and technique, our\napproach maintains a large pool of specialized discriminator heads that\ncollectively guide the generation process. Each discriminator group develops\nexpertise in specific quality aspects at different noise levels, providing\ndiverse feedback that enables high-fidelity one-step generation. Our framework\ncombines: (i) a dynamic discriminator pool with specialized discriminator\ngroups to improve generation quality, (ii) strategic refresh mechanisms to\nprevent discriminator overfitting, and (iii) global-local discriminator heads\nfor multi-scale quality assessment, and unconditional/conditional training for\nbalanced generation. Additionally, our framework uniquely supports flexible\ndeployment through bottom-up refinement, allowing users to dynamically choose\nbetween 1-4 denoising steps with the same model for direct quality-speed\ntrade-offs. Through comprehensive experiments, we demonstrate that NitroFusion\nsignificantly outperforms existing single-step methods across multiple\nevaluation metrics, particularly excelling in preserving fine details and\nglobal consistency.\n","authors":["Dar-Yen Chen","Hmrishav Bandyopadhyay","Kai Zou","Yi-Zhe Song"],"pdf_url":"https://arxiv.org/pdf/2412.02030v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04955v1","updated":"2024-12-06T11:17:25Z","published":"2024-12-06T11:17:25Z","title":"MixedGaussianAvatar: Realistically and Geometrically Accurate Head\n Avatar via Mixed 2D-3D Gaussian Splatting","summary":" Reconstructing high-fidelity 3D head avatars is crucial in various\napplications such as virtual reality. The pioneering methods reconstruct\nrealistic head avatars with Neural Radiance Fields (NeRF), which have been\nlimited by training and rendering speed. Recent methods based on 3D Gaussian\nSplatting (3DGS) significantly improve the efficiency of training and\nrendering. However, the surface inconsistency of 3DGS results in subpar\ngeometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy\nat the expense of rendering fidelity. To leverage the benefits of both 2DGS and\n3DGS, we propose a novel method named MixedGaussianAvatar for realistically and\ngeometrically accurate head avatar reconstruction. Our main idea is to utilize\n2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric\naccuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model\nand connect additional 3D Gaussians to those 2D Gaussians where the rendering\nquality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation.\nThese 2D-3D Gaussians can then be animated using FLAME parameters. We further\nintroduce a progressive training strategy that first trains the 2D Gaussians\nand then fine-tunes the mixed 2D-3D Gaussians. We demonstrate the superiority\nof MixedGaussianAvatar through comprehensive experiments. The code will be\nreleased at: https://github.com/ChenVoid/MGA/.\n","authors":["Peng Chen","Xiaobao Wei","Qingpo Wuwu","Xinyi Wang","Xingyu Xiao","Ming Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04955v1.pdf","comment":"Project: https://chenvoid.github.io/MGA/"},{"id":"http://arxiv.org/abs/2412.04954v1","updated":"2024-12-06T11:14:03Z","published":"2024-12-06T11:14:03Z","title":"Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for\n Radiology Report Generation","summary":" We introduce a radiology-focused visual language model designed to generate\nradiology reports from chest X-rays. Building on previous findings that large\nlanguage models (LLMs) can acquire multimodal capabilities when aligned with\npretrained vision encoders, we demonstrate similar potential with chest X-ray\nimages. This integration enhances the ability of model to understand and\ndescribe chest X-ray images. Our model combines an image encoder with a\nfine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate\ndifferent sections of a radiology report with notable accuracy. The training\nprocess involves a two-stage approach: (i) initial alignment of chest X-ray\nfeatures with the LLM (ii) followed by fine-tuning for radiology report\ngeneration.\n","authors":["Xi Zhang","Zaiqiao Meng","Jake Lever","Edmond S. L. Ho"],"pdf_url":"https://arxiv.org/pdf/2412.04954v1.pdf","comment":"Accepted by BioNLP@ACL 2024"},{"id":"http://arxiv.org/abs/2412.04945v1","updated":"2024-12-06T11:05:30Z","published":"2024-12-06T11:05:30Z","title":"HOLa: HoloLens Object Labeling","summary":" In the context of medical Augmented Reality (AR) applications, object\ntracking is a key challenge and requires a significant amount of annotation\nmasks. As segmentation foundation models like the Segment Anything Model (SAM)\nbegin to emerge, zero-shot segmentation requires only minimal human\nparticipation obtaining high-quality object masks. We introduce a\nHoloLens-Object-Labeling (HOLa) Unity and Python application based on the\nSAM-Track algorithm that offers fully automatic single object annotation for\nHoloLens 2 while requiring minimal human participation. HOLa does not have to\nbe adjusted to a specific image appearance and could thus alleviate AR research\nin any application field. We evaluate HOLa for different degrees of image\ncomplexity in open liver surgery and in medical phantom experiments. Using HOLa\nfor image annotation can increase the labeling speed by more than 500 times\nwhile providing Dice scores between 0.875 and 0.982, which are comparable to\nhuman annotators. Our code is publicly available at:\nhttps://github.com/mschwimmbeck/HOLa\n","authors":["Michael Schwimmbeck","Serouj Khajarian","Konstantin Holzapfel","Johannes Schmidt","Stefanie Remmele"],"pdf_url":"https://arxiv.org/pdf/2412.04945v1.pdf","comment":"accepted by BMT 2024"},{"id":"http://arxiv.org/abs/2411.19050v2","updated":"2024-12-06T10:58:53Z","published":"2024-11-28T10:55:09Z","title":"I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt\n Generation for Text-Guided Multi-Mask Inpainting","summary":" Inpainting focuses on filling missing or corrupted regions of an image to\nblend seamlessly with its surrounding content and style. While conditional\ndiffusion models have proven effective for text-guided inpainting, we introduce\nthe novel task of multi-mask inpainting, where multiple regions are\nsimultaneously inpainted using distinct prompts. Furthermore, we design a\nfine-tuning procedure for multimodal LLMs, such as LLaVA, to generate\nmulti-mask prompts automatically using corrupted images as inputs. These models\ncan generate helpful and detailed prompt suggestions for filling the masked\nregions. The generated prompts are then fed to Stable Diffusion, which is\nfine-tuned for the multi-mask inpainting problem using rectified\ncross-attention, enforcing prompts onto their designated regions for filling.\nExperiments on digitized paintings from WikiArt and the Densely Captioned\nImages dataset demonstrate that our pipeline delivers creative and accurate\ninpainting results. Our code, data, and trained models are available at\nhttps://cilabuniba.github.io/i-dream-my-painting.\n","authors":["Nicola Fanelli","Gennaro Vessio","Giovanna Castellano"],"pdf_url":"https://arxiv.org/pdf/2411.19050v2.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.04939v1","updated":"2024-12-06T10:53:47Z","published":"2024-12-06T10:53:47Z","title":"Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in\n Multimodal Large Language Models","summary":" Multimodal Large Language Models (MLLMs) have garnered significant attention\nrecently and demonstrate outstanding capabilities in various tasks such as OCR,\nVQA, captioning, $\\textit{etc}$. However, hallucination remains a persistent\nissue. While numerous methods have been proposed to mitigate hallucinations,\nachieving notable improvements, these methods primarily focus on mitigating\nhallucinations about $\\textbf{object/noun-related}$ concepts. Verb concepts,\ncrucial for understanding human actions, have been largely overlooked. In this\npaper, to the best of our knowledge, we are the $\\textbf{first}$ to investigate\nthe $\\textbf{verb hallucination}$ phenomenon of MLLMs from various\nperspectives. Our findings reveal that most state-of-the-art MLLMs suffer from\nsevere verb hallucination. To assess the effectiveness of existing mitigation\nmethods for object concept hallucination on verb hallucination, we evaluated\nthese methods and found that they do not effectively address verb\nhallucination. To address this issue, we propose a novel rich verb\nknowledge-based tuning method to mitigate verb hallucination. The experiment\nresults demonstrate that our method significantly reduces hallucinations\nrelated to verbs. $\\textit{Our code and data will be made publicly available}$.\n","authors":["Zehao Wang","Xinpeng Liu","Xiaoqian Wu","Yudonglin Zhang","Zhou Fang","Yifan Fang","Junfu Pu","Cewu Lu","Yong-Lu Li"],"pdf_url":"https://arxiv.org/pdf/2412.04939v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17531v2","updated":"2024-12-06T10:45:24Z","published":"2024-05-27T17:40:00Z","title":"Evolutive Rendering Models","summary":" The landscape of computer graphics has undergone significant transformations\nwith the recent advances of differentiable rendering models. These rendering\nmodels often rely on heuristic designs that may not fully align with the final\nrendering objectives. We address this gap by pioneering \\textit{evolutive\nrendering models}, a methodology where rendering models possess the ability to\nevolve and adapt dynamically throughout the rendering process. In particular,\nwe present a comprehensive learning framework that enables the optimization of\nthree principal rendering elements, including the gauge transformations, the\nray sampling mechanisms, and the primitive organization. Central to this\nframework is the development of differentiable versions of these rendering\nelements, allowing for effective gradient backpropagation from the final\nrendering objectives. A detailed analysis of gradient characteristics is\nperformed to facilitate a stable and goal-oriented elements evolution. Our\nextensive experiments demonstrate the large potential of evolutive rendering\nmodels for enhancing the rendering performance across various domains,\nincluding static and dynamic scene representations, generative modeling, and\ntexture mapping.\n","authors":["Fangneng Zhan","Hanxue Liang","Yifan Wang","Michael Niemeyer","Michael Oechsle","Adam Kortylewski","Cengiz Oztireli","Gordon Wetzstein","Christian Theobalt"],"pdf_url":"https://arxiv.org/pdf/2405.17531v2.pdf","comment":"Project page: https://fnzhan.com/Evolutive-Rendering-Models/"},{"id":"http://arxiv.org/abs/2408.09869v4","updated":"2024-12-06T10:44:56Z","published":"2024-08-19T10:20:06Z","title":"Docling Technical Report","summary":" We introduce Docling, an easy-to-use, self-contained, MIT-licensed,\nopen-source toolkit for document conversion, that can parse several types of\npopular document formats into a unified, richly structured representation. It\nis powered by state-of-the-art specialized AI models for layout analysis\n(DocLayNet) and table structure recognition (TableFormer), and runs efficiently\non commodity hardware in a small resource budget. Docling is released as a\nPython package and can be used as a Python API or as a CLI tool. Docling's\nmodular architecture and efficient document representation %, known as\nDoclingDocument, make it easy to implement extensions, new features, models,\nand customizations. Docling has been already integrated in other popular\nopen-source frameworks (e.g., LlamaIndex, LangChain, spaCy), making it a\nnatural fit for the processing of documents and the development of high-end\napplications. The open-source community has fully engaged in using, promoting,\nand developing for Docling, which gathered 10k stars on GitHub in less than a\nmonth and was reported as the No. 1 trending repository in GitHub worldwide in\nNovember 2024.\n","authors":["Nikolaos Livathinos","Christoph Auer","Maksym Lysak","Ahmed Nassar","Michele Dolfi","Panos Vagenas","Cesar Berrospi Ramis","Matteo Omenetti","Kasper Dinkla","Yusik Kim","Shubham Gupta","Rafael Teixeira de Lima","Valery Weber","Lucas Morin","Ingmar Meijer","Viktor Kuropiatnyk","Peter W. J. Staar"],"pdf_url":"https://arxiv.org/pdf/2408.09869v4.pdf","comment":"Submitted to AAAI 25: Workshop on Open-Source AI for Mainstream Use"},{"id":"http://arxiv.org/abs/2412.04935v1","updated":"2024-12-06T10:44:11Z","published":"2024-12-06T10:44:11Z","title":"Uncertainty-aware retinal layer segmentation in OCT through\n probabilistic signed distance functions","summary":" In this paper, we present a new approach for uncertainty-aware retinal layer\nsegmentation in Optical Coherence Tomography (OCT) scans using probabilistic\nsigned distance functions (SDF). Traditional pixel-wise and regression-based\nmethods primarily encounter difficulties in precise segmentation and lack of\ngeometrical grounding respectively. To address these shortcomings, our\nmethodology refines the segmentation by predicting a signed distance function\n(SDF) that effectively parameterizes the retinal layer shape via level set. We\nfurther enhance the framework by integrating probabilistic modeling, applying\nGaussian distributions to encapsulate the uncertainty in the shape\nparameterization. This ensures a robust representation of the retinal layer\nmorphology even in the presence of ambiguous input, imaging noise, and\nunreliable segmentations. Both quantitative and qualitative evaluations\ndemonstrate superior performance when compared to other methods. Additionally,\nwe conducted experiments on artificially distorted datasets with various noise\ntypes-shadowing, blinking, speckle, and motion-common in OCT scans to showcase\nthe effectiveness of our uncertainty estimation. Our findings demonstrate the\npossibility to obtain reliable segmentation of retinal layers, as well as an\ninitial step towards the characterization of layer integrity, a key biomarker\nfor disease progression. Our code is available at\n\\url{https://github.com/niazoys/RLS_PSDF}.\n","authors":["Mohammad Mohaiminul Islam","Coen de Vente","Bart Liefers","Caroline Klaver","Erik J Bekkers","Clara I. Sánchez"],"pdf_url":"https://arxiv.org/pdf/2412.04935v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04931v1","updated":"2024-12-06T10:39:11Z","published":"2024-12-06T10:39:11Z","title":"DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object\n Detection","summary":" Object detection in poor-illumination environments is a challenging task as\nobjects are usually not clearly visible in RGB images. As infrared images\nprovide additional clear edge information that complements RGB images, fusing\nRGB and infrared images has potential to enhance the detection ability in\npoor-illumination environments. However, existing works involving both visible\nand infrared images only focus on image fusion, instead of object detection.\nMoreover, they directly fuse the two kinds of image modalities, which ignores\nthe mutual interference between them. To fuse the two modalities to maximize\nthe advantages of cross-modality, we design a dual-enhancement-based\ncross-modality object detection network DEYOLO, in which semantic-spatial cross\nmodality and novel bi-directional decoupled focus modules are designed to\nachieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR).\nSpecifically, a dual semantic enhancing channel weight assignment module (DECA)\nand a dual spatial enhancing pixel weight assignment module (DEPA) are firstly\nproposed to aggregate cross-modality information in the feature space to\nimprove the feature representation ability, such that feature fusion can aim at\nthe object detection task. Meanwhile, a dual-enhancement mechanism, including\nenhancements for two-modality fusion and single modality, is designed in both\nDECAand DEPAto reduce interference between the two kinds of image modalities.\nThen, a novel bi-directional decoupled focus is developed to enlarge the\nreceptive field of the backbone network in different directions, which improves\nthe representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP\nshow that our approach outperforms SOTA object detection algorithms by a clear\nmargin. Our code is available at https://github.com/chips96/DEYOLO.\n","authors":["Yishuo Chen","Boran Wang","Xinyu Guo","Wenbin Zhu","Jiasheng He","Xiaobin Liu","Jing Yuan"],"pdf_url":"https://arxiv.org/pdf/2412.04931v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02865v3","updated":"2024-12-06T10:38:02Z","published":"2024-12-03T22:00:12Z","title":"Memory-efficient Continual Learning with Neural Collapse Contrastive","summary":" Contrastive learning has significantly improved representation quality,\nenhancing knowledge transfer across tasks in continual learning (CL). However,\ncatastrophic forgetting remains a key challenge, as contrastive based methods\nprimarily focus on \"soft relationships\" or \"softness\" between samples, which\nshift with changing data distributions and lead to representation overlap\nacross tasks. Recently, the newly identified Neural Collapse phenomenon has\nshown promise in CL by focusing on \"hard relationships\" or \"hardness\" between\nsamples and fixed prototypes. However, this approach overlooks \"softness\",\ncrucial for capturing intra-class variability, and this rigid focus can also\npull old class representations toward current ones, increasing forgetting.\nBuilding on these insights, we propose Focal Neural Collapse Contrastive\n(FNC^2), a novel representation learning loss that effectively balances both\nsoft and hard relationships. Additionally, we introduce the Hardness-Softness\nDistillation (HSD) loss to progressively preserve the knowledge gained from\nthese relationships across tasks. Our method outperforms state-of-the-art\napproaches, particularly in minimizing memory reliance. Remarkably, even\nwithout the use of memory, our approach rivals rehearsal-based methods,\noffering a compelling solution for data privacy concerns.\n","authors":["Trung-Anh Dang","Vincent Nguyen","Ngoc-Son Vu","Christel Vrain"],"pdf_url":"https://arxiv.org/pdf/2412.02865v3.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.04930v1","updated":"2024-12-06T10:35:45Z","published":"2024-12-06T10:35:45Z","title":"Video Decomposition Prior: A Methodology to Decompose Videos into Layers","summary":" In the evolving landscape of video enhancement and editing methodologies, a\nmajority of deep learning techniques often rely on extensive datasets of\nobserved input and ground truth sequence pairs for optimal performance. Such\nreliance often falters when acquiring data becomes challenging, especially in\ntasks like video dehazing and relighting, where replicating identical motions\nand camera angles in both corrupted and ground truth sequences is complicated.\nMoreover, these conventional methodologies perform best when the test\ndistribution closely mirrors the training distribution. Recognizing these\nchallenges, this paper introduces a novel video decomposition prior\n`\\texttt{VDP}' framework which derives inspiration from professional video\nediting practices. Our methodology does not mandate task-specific external data\ncorpus collection, instead pivots to utilizing the motion and appearance of the\ninput video. \\texttt{VDP} framework decomposes a video sequence into a set of\nmultiple RGB layers and associated opacity levels. These set of layers are then\nmanipulated individually to obtain the desired results. We addresses tasks such\nas video object segmentation, dehazing, and relighting. Moreover, we introduce\na novel logarithmic video decomposition formulation for video relighting tasks,\nsetting a new benchmark over the existing methodologies. We observe the\nproperty of relighting emerge as we optimize for our novel relighting\ndecomposition formulation. We evaluate our approach on standard video datasets\nlike DAVIS, REVIDE, \\& SDSD and show qualitative results on a diverse array of\ninternet videos. Project Page -\nhttps://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video\nresults.\n","authors":["Gaurav Shrivastava","Ser-Nam Lim","Abhinav Shrivastava"],"pdf_url":"https://arxiv.org/pdf/2412.04930v1.pdf","comment":"Project Page -\n https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video\n results. Extended version of ICLR publication"},{"id":"http://arxiv.org/abs/2412.04929v1","updated":"2024-12-06T10:34:50Z","published":"2024-12-06T10:34:50Z","title":"Continuous Video Process: Modeling Videos as Continuous\n Multi-Dimensional Processes for Video Prediction","summary":" Diffusion models have made significant strides in image generation, mastering\ntasks such as unconditional image synthesis, text-image translation, and\nimage-to-image conversions. However, their capability falls short in the realm\nof video prediction, mainly because they treat videos as a collection of\nindependent images, relying on external constraints such as temporal attention\nmechanisms to enforce temporal coherence. In our paper, we introduce a novel\nmodel class, that treats video as a continuous multi-dimensional process rather\nthan a series of discrete frames. We also report a reduction of 75\\% sampling\nsteps required to sample a new frame thus making our framework more efficient\nduring the inference time. Through extensive experimentation, we establish\nstate-of-the-art performance in video prediction, validated on benchmark\ndatasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project\npage https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.}\n","authors":["Gaurav Shrivastava","Abhinav Shrivastava"],"pdf_url":"https://arxiv.org/pdf/2412.04929v1.pdf","comment":"Navigate to the project page\n https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.\n Extended version of published CVPR paper"},{"id":"http://arxiv.org/abs/2402.12185v3","updated":"2024-12-06T10:34:47Z","published":"2024-02-19T14:48:23Z","title":"ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for\n Complicated Chart Reasoning","summary":" Recently, many versatile Multi-modal Large Language Models (MLLMs) have\nemerged continuously. However, their capacity to query information depicted in\nvisual charts and engage in reasoning based on the queried contents remains\nunder-explored. In this paper, to comprehensively and rigorously benchmark the\nability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a\nmulti-modal evaluation set covering 18 chart types, 7 chart tasks, 22\ndisciplinary topics, and high-quality chart data. Besides, we develop ChartVLM\nto offer a new perspective on handling multi-modal tasks that strongly depend\non interpretable patterns, such as reasoning tasks in the field of charts or\ngeometric images. We evaluate the chart-related ability of mainstream MLLMs and\nour ChartVLM on the proposed ChartX evaluation set. Extensive experiments\ndemonstrate that ChartVLM surpasses both versatile and chart-related large\nmodels, achieving results comparable to GPT-4V. We believe that our study can\npave the way for further exploration in creating a more comprehensive chart\nevaluation set and developing more interpretable multi-modal models. Both\nChartX and ChartVLM are available at:\nhttps://github.com/UniModal4Reasoning/ChartVLM\n","authors":["Renqiu Xia","Bo Zhang","Hancheng Ye","Xiangchao Yan","Qi Liu","Hongbin Zhou","Zijun Chen","Min Dou","Botian Shi","Junchi Yan","Yu Qiao"],"pdf_url":"https://arxiv.org/pdf/2402.12185v3.pdf","comment":"Code and dataset are available for downloading at:\n https://github.com/UniModal4Reasoning/ChartVLM 25 pages, 15 figures"},{"id":"http://arxiv.org/abs/2407.09392v3","updated":"2024-12-06T10:32:41Z","published":"2024-07-12T16:16:24Z","title":"Open-Canopy: A Country-Scale Benchmark for Canopy Height Estimation at\n Very High Resolution","summary":" Estimating canopy height and its changes at meter resolution from satellite\nimagery is a significant challenge in computer vision with critical\nenvironmental applications. However, the lack of open-access datasets at this\nresolution hinders the reproducibility and evaluation of models. We introduce\nOpen-Canopy, the first open-access, country-scale benchmark for very\nhigh-resolution (1.5 m) canopy height estimation, covering over 87,000 km$^2$\nacross France with 1.5 m resolution satellite imagery and aerial LiDAR data.\nAdditionally, we present Open-Canopy-$\\Delta$, a benchmark for canopy height\nchange detection between images from different years at tree level-a\nchallenging task for current computer vision models. We evaluate\nstate-of-the-art architectures on these benchmarks, highlighting significant\nchallenges and opportunities for improvement. Our datasets and code are\npublicly available at https://github.com/fajwel/Open-Canopy.\n","authors":["Fajwel Fogel","Yohann Perron","Nikola Besic","Laurent Saint-André","Agnès Pellissier-Tanon","Martin Schwartz","Thomas Boudras","Ibrahim Fayad","Alexandre d'Aspremont","Loic Landrieu","Philippe Ciais"],"pdf_url":"https://arxiv.org/pdf/2407.09392v3.pdf","comment":"25 pages, 6+6 figures, Submitted to CVPR25"},{"id":"http://arxiv.org/abs/2412.04925v1","updated":"2024-12-06T10:26:51Z","published":"2024-12-06T10:26:51Z","title":"$S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization\n of Vision-Language Models","summary":" Recently, many studies have been conducted to enhance the zero-shot\ngeneralization ability of vision-language models (e.g., CLIP) by addressing the\nsemantic misalignment between image and text embeddings in downstream tasks.\nAlthough many efforts have been made, existing methods barely consider the fact\nthat a class of images can be described by notably different textual concepts\ndue to well-known lexical variation in natural language processing, which\nheavily affects the zero-shot generalization of CLIP. Therefore, this paper\nproposes a \\textbf{S}ynonymous \\textbf{S}emantic \\textbf{S}pace ($S^3$) for\neach image class, rather than relying on a single textual concept, achieving\nmore stable semantic alignment and improving the zero-shot generalization of\nCLIP. Specifically, our $S^3$ method first generates several synonymous\nconcepts based on the label of each class by using large language models, and\nconstructs a continuous yet compact synonymous semantic space based on the\nVietoris-Rips complex of the generated synonymous concepts. Furthermore, we\nexplore the effect of several point-to-space metrics on our $S^3$, while\npresenting a point-to-local-center metric to compute similarity between image\nembeddings and the synonymous semantic space of each class, accomplishing\neffective zero-shot predictions. Extensive experiments are conducted across 17\nbenchmarks, including fine-grained zero-shot classification, natural\ndistribution zero-shot classification, and open-vocabulary segmentation, and\nthe results show that our $S^3$ outperforms state-of-the-art methods.\n","authors":["Xiaojie Yin","Qilong Wang","Bing Cao","Qinghua Hu"],"pdf_url":"https://arxiv.org/pdf/2412.04925v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2304.02278v5","updated":"2024-12-06T10:13:10Z","published":"2023-04-05T07:50:16Z","title":"SCMM: Calibrating Cross-modal Representations for Text-Based Person\n Search","summary":" Text-Based Person Search (TBPS) is a crucial task in the Internet of Things\n(IoT) domain that enables accurate retrieval of target individuals from\nlarge-scale galleries with only given textual caption. For cross-modal TBPS\ntasks, it is critical to obtain well-distributed representation in the common\nembedding space to reduce the inter-modal gap. Furthermore, learning detailed\nimage-text correspondences is essential to discriminate similar targets and\nenable fine-grained search. To address these challenges, we present a simple\nyet effective method named Sew Calibration and Masked Modeling (SCMM) that\ncalibrates cross-modal representations by learning compact and well-aligned\nembeddings. SCMM introduces two novel losses for fine-grained cross-modal\nrepresentations: Sew calibration loss that aligns image and text features based\non textual caption quality, and Masked Caption Modeling (MCM) loss that\nestablishes detailed relationships between textual and visual parts. This\ndual-pronged strategy enhances feature alignment and cross-modal\ncorrespondences, enabling accurate distinction of similar individuals while\nmaintaining a streamlined dual-encoder architecture for real-time inference,\nwhich is essential for resource-limited sensors and IoT systems. Extensive\nexperiments on three popular TBPS benchmarks demonstrate the superiority of\nSCMM, achieving 73.81%, 64.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES,\nICFG-PEDES, and RSTPReID, respectively.\n","authors":["Jing Liu","Donglai Wei","Yang Liu","Sipeng Zhang","Tong Yang","Victor C. M. Leung"],"pdf_url":"https://arxiv.org/pdf/2304.02278v5.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2412.04915v1","updated":"2024-12-06T10:12:10Z","published":"2024-12-06T10:12:10Z","title":"Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video\n Object Detection","summary":" The primary challenge in Video Object Detection (VOD) is effectively\nexploiting temporal information to enhance object representations. Traditional\nstrategies, such as aggregating region proposals, often suffer from feature\nvariance due to the inclusion of background information. We introduce a novel\ninstance mask-based feature aggregation approach, significantly refining this\nprocess and deepening the understanding of object dynamics across video frames.\nWe present FAIM, a new VOD method that enhances temporal Feature Aggregation by\nleveraging Instance Mask features. In particular, we propose the lightweight\nInstance Feature Extraction Module (IFEM) to learn instance mask features and\nthe Temporal Instance Classification Aggregation Module (TICAM) to aggregate\ninstance mask and classification features across video frames. Using YOLOX as a\nbase detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on\na single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off.\nAdditional experiments on multiple datasets validate that our approach is\nrobust, method-agnostic, and effective in multi-object tracking, demonstrating\nits broader applicability to video understanding tasks.\n","authors":["Khurram Azeem Hashmi","Talha Uddin Sheikh","Didier Stricker","Muhammad Zeshan Afzal"],"pdf_url":"https://arxiv.org/pdf/2412.04915v1.pdf","comment":"To appear in WACV 2025"},{"id":"http://arxiv.org/abs/2412.04912v1","updated":"2024-12-06T10:08:55Z","published":"2024-12-06T10:08:55Z","title":"UniMIC: Towards Universal Multi-modality Perceptual Image Compression","summary":" We present UniMIC, a universal multi-modality image compression framework,\nintending to unify the rate-distortion-perception (RDP) optimization for\nmultiple image codecs simultaneously through excavating cross-modality\ngenerative priors. Unlike most existing works that need to design and optimize\nimage codecs from scratch, our UniMIC introduces the visual codec repository,\nwhich incorporates amounts of representative image codecs and directly uses\nthem as the basic codecs for various practical applications. Moreover, we\npropose multi-grained textual coding, where variable-length content prompt and\ncompression prompt are designed and encoded to assist the perceptual\nreconstruction through the multi-modality conditional generation. In\nparticular, a universal perception compensator is proposed to improve the\nperception quality of decoded images from all basic codecs at the decoder side\nby reusing text-assisted diffusion priors from stable diffusion. With the\ncooperation of the above three strategies, our UniMIC achieves a significant\nimprovement of RDP optimization for different compression codecs, e.g.,\ntraditional and learnable codecs, and different compression costs, e.g.,\nultra-low bitrates. The code will be available in\nhttps://github.com/Amygyx/UniMIC .\n","authors":["Yixin Gao","Xin Li","Xiaohan Pan","Runsen Feng","Zongyu Guo","Yiting Lu","Yulin Ren","Zhibo Chen"],"pdf_url":"https://arxiv.org/pdf/2412.04912v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04903v1","updated":"2024-12-06T09:59:47Z","published":"2024-12-06T09:59:47Z","title":"EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation","summary":" Multimodal large language models (MLLMs) have achieved remarkable progress on\nvarious visual question answering and reasoning tasks leveraging instruction\nfine-tuning specific datasets. They can also learn from preference data\nannotated by human to enhance their reasoning ability and mitigate\nhallucinations. Most of preference data is generated from the model itself.\nHowever, existing methods require high-quality critical labels, which are\ncostly and rely on human or proprietary models like GPT-4V. In this work, we\npropose Enhancing Alignment in MLLMs via Critical Observation (EACO), which\naligns MLLMs by self-generated preference data using only 5k images\neconomically. Our approach begins with collecting and refining a Scoring\nEvaluation Instruction-tuning dataset to train a critical evaluation model,\ntermed the Critic. This Critic observes model responses across multiple\ndimensions, selecting preferred and non-preferred outputs for refined Direct\nPreference Optimization (DPO) tuning. To further enhance model performance, we\nemploy an additional supervised fine-tuning stage after preference tuning. EACO\nreduces the overall hallucinations by 65.6% on HallusionBench and improves the\nreasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement\nover LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also\nshows the potential critical ability in open-source MLLMs, demonstrating that\nEACO is a viable path to boost the competence of MLLMs.\n","authors":["Yongxin Wang","Meng Cao","Haokun Lin","Mingfei Han","Liang Ma","Jin Jiang","Yuhao Cheng","Xiaodan Liang"],"pdf_url":"https://arxiv.org/pdf/2412.04903v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2412.04898v1","updated":"2024-12-06T09:56:49Z","published":"2024-12-06T09:56:49Z","title":"Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised\n Pretraining with Pseudo-Label Refinement","summary":" Deep learning models rely heavily on large volumes of labeled data to achieve\nhigh performance. However, real-world datasets often contain noisy labels due\nto human error, ambiguity, or resource constraints during the annotation\nprocess. Instance-dependent label noise (IDN), where the probability of a label\nbeing corrupted depends on the input features, poses a significant challenge\nbecause it is more prevalent and harder to address than instance-independent\nnoise. In this paper, we propose a novel hybrid framework that combines\nself-supervised learning using SimCLR with iterative pseudo-label refinement to\nmitigate the effects of IDN. The self-supervised pre-training phase enables the\nmodel to learn robust feature representations without relying on potentially\nnoisy labels, establishing a noise-agnostic foundation. Subsequently, we employ\nan iterative training process with pseudo-label refinement, where confidently\npredicted samples are identified through a multistage approach and their labels\nare updated to improve label quality progressively. We evaluate our method on\nthe CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent\nnoise at varying noise levels. Experimental results demonstrate that our\napproach significantly outperforms several state-of-the-art methods,\nparticularly under high noise conditions, achieving notable improvements in\nclassification accuracy and robustness. Our findings suggest that integrating\nself-supervised learning with iterative pseudo-label refinement offers an\neffective strategy for training deep neural networks on noisy datasets\nafflicted by instance-dependent label noise.\n","authors":["Gouranga Bala","Anuj Gupta","Subrat Kumar Behera","Amit Sethi"],"pdf_url":"https://arxiv.org/pdf/2412.04898v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04896v1","updated":"2024-12-06T09:55:37Z","published":"2024-12-06T09:55:37Z","title":"Comprehensive Analysis and Improvements in Pansharpening Using Deep\n Learning","summary":" Pansharpening is a crucial task in remote sensing, enabling the generation of\nhigh-resolution multispectral images by fusing low-resolution multispectral\ndata with high-resolution panchromatic images. This paper provides a\ncomprehensive analysis of traditional and deep learning-based pansharpening\nmethods. While state-of-the-art deep learning methods have significantly\nimproved image quality, issues like spectral distortions persist. To address\nthis, we propose enhancements to the PSGAN framework by introducing novel\nregularization techniques for the generator loss function. Experimental results\non images from the Worldview-3 dataset demonstrate that the proposed\nmodifications improve spectral fidelity and achieve superior performance across\nmultiple quantitative metrics while delivering visually superior results.\n","authors":["Mahek Kantharia","Neeraj Badal","Zankhana Shah"],"pdf_url":"https://arxiv.org/pdf/2412.04896v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2211.04700v3","updated":"2024-12-06T09:46:09Z","published":"2022-11-09T06:18:18Z","title":"Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light\n Images Without Task-Related Data","summary":" Deep learning-based low-light image enhancement (LLIE) is a task of\nleveraging deep neural networks to enhance the image illumination while keeping\nthe image content unchanged. From the perspective of training data, existing\nmethods complete the LLIE task driven by one of the following three data types:\npaired data, unpaired data and zero-reference data. Each type of these\ndata-driven methods has its own advantages, e.g., zero-reference data-based\nmethods have very low requirements on training data and can meet the human\nneeds in many scenarios. In this paper, we leverage pure Gaussian noise to\ncomplete the LLIE task, which further reduces the requirements for training\ndata in LLIE tasks and can be used as another alternative in practical use.\nSpecifically, we propose Noise SElf-Regression (NoiSER) without access to any\ntask-related data, simply learns a convolutional neural network equipped with\nan instance-normalization layer by taking a random noise image,\n$\\mathcal{N}(0,\\sigma^2)$ for each pixel, as both input and output for each\ntraining pair, and then the low-light image is fed to the trained network for\npredicting the normal-light image. Technically, an intuitive explanation for\nits effectiveness is as follows: 1) the self-regression reconstructs the\ncontrast between adjacent pixels of the input image, 2) the\ninstance-normalization layer may naturally remediate the overall\nmagnitude/lighting of the input image, and 3) the $\\mathcal{N}(0,\\sigma^2)$\nassumption for each pixel enforces the output image to follow the well-known\ngray-world hypothesis when the image size is big enough. Compared to current\nstate-of-the-art LLIE methods with access to different task-related data,\nNoiSER is highly competitive in enhancement quality, yet with a much smaller\nmodel size, and much lower training and inference cost. Besides, NoiSER also\nexcels in mitigating overexposure and handling joint tasks.\n","authors":["Zhao Zhang","Suiyi Zhao","Xiaojie Jin","Mingliang Xu","Yi Yang","Shuicheng Yan","Meng Wang"],"pdf_url":"https://arxiv.org/pdf/2211.04700v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04887v1","updated":"2024-12-06T09:31:12Z","published":"2024-12-06T09:31:12Z","title":"Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large\n Scene Reconstruction","summary":" 3D Gaussian Splatting has demonstrated notable success in large-scale scene\nreconstruction, but challenges persist due to high training memory consumption\nand storage overhead. Hybrid representations that integrate implicit and\nexplicit features offer a way to mitigate these limitations. However, when\napplied in parallelized block-wise training, two critical issues arise since\nreconstruction accuracy deteriorates due to reduced data diversity when\ntraining each block independently, and parallel training restricts the number\nof divided blocks to the available number of GPUs. To address these issues, we\npropose Momentum-GS, a novel approach that leverages momentum-based\nself-distillation to promote consistency and accuracy across the blocks while\ndecoupling the number of blocks from the physical GPU count. Our method\nmaintains a teacher Gaussian decoder updated with momentum, ensuring a stable\nreference during training. This teacher provides each block with global\nguidance in a self-distillation manner, promoting spatial consistency in\nreconstruction. To further ensure consistency across the blocks, we incorporate\nblock weighting, dynamically adjusting each block's weight according to its\nreconstruction accuracy. Extensive experiments on large-scale scenes show that\nour method consistently outperforms existing techniques, achieving a 12.8%\nimprovement in LPIPS over CityGaussian with much fewer divided blocks and\nestablishing a new state of the art. Project page:\nhttps://jixuan-fan.github.io/Momentum-GS_Page/\n","authors":["Jixuan Fan","Wanhua Li","Yifei Han","Yansong Tang"],"pdf_url":"https://arxiv.org/pdf/2412.04887v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04884v1","updated":"2024-12-06T09:26:22Z","published":"2024-12-06T09:26:22Z","title":"AI-Driven Non-Invasive Detection and Staging of Steatosis in Fatty Liver\n Disease Using a Novel Cascade Model and Information Fusion Techniques","summary":" Non-alcoholic fatty liver disease (NAFLD) is one of the most widespread liver\ndisorders on a global scale, posing a significant threat of progressing to more\nsevere conditions like nonalcoholic steatohepatitis (NASH), liver fibrosis,\ncirrhosis, and hepatocellular carcinoma. Diagnosing and staging NAFLD presents\nchallenges due to its non-specific symptoms and the invasive nature of liver\nbiopsies. Our research introduces a novel artificial intelligence cascade model\nemploying ensemble learning and feature fusion techniques. We developed a\nnon-invasive, robust, and reliable diagnostic artificial intelligence tool that\nutilizes anthropometric and laboratory parameters, facilitating early detection\nand intervention in NAFLD progression. Our novel artificial intelligence\nachieved an 86% accuracy rate for the NASH steatosis staging task (non-NASH,\nsteatosis grade 1, steatosis grade 2, and steatosis grade 3) and an impressive\n96% AUC-ROC for distinguishing between NASH (steatosis grade 1, grade 2, and\ngrade3) and non-NASH cases, outperforming current state-of-the-art models. This\nnotable improvement in diagnostic performance underscores the potential\napplication of artificial intelligence in the early diagnosis and treatment of\nNAFLD, leading to better patient outcomes and a reduced healthcare burden\nassociated with advanced liver disease.\n","authors":["Niloufar Delfan","Pardis Ketabi Moghadam","Mohammad Khoshnevisan","Mehdi Hosseini Chagahi","Behzad Hatami","Melika Asgharzadeh","Mohammadreza Zali","Behzad Moshiri","Amin Momeni Moghaddam","Mohammad Amin Khalafi","Khosrow Dehnad"],"pdf_url":"https://arxiv.org/pdf/2412.04884v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04880v1","updated":"2024-12-06T09:23:31Z","published":"2024-12-06T09:23:31Z","title":"MozzaVID: Mozzarella Volumetric Image Dataset","summary":" Influenced by the complexity of volumetric imaging, there is a shortage of\nestablished datasets useful for benchmarking volumetric deep-learning models.\nAs a consequence, new and existing models are not easily comparable, limiting\nthe development of architectures optimized specifically for volumetric data. To\ncounteract this trend, we introduce MozzaVID - a large, clean, and versatile\nvolumetric classification dataset. Our dataset contains X-ray computed\ntomography (CT) images of mozzarella microstructure and enables the\nclassification of 25 cheese types and 149 cheese samples. We provide data in\nthree different resolutions, resulting in three dataset instances containing\nfrom 591 to 37,824 images. While being general-purpose, the dataset also\nfacilitates investigating mozzarella structure properties. The structure of\nfood directly affects its functional properties and thus its consumption\nexperience. Understanding food structure helps tune the production and\nmimicking it enables sustainable alternatives to animal-derived food products.\nThe complex and disordered nature of food structures brings a unique challenge,\nwhere a choice of appropriate imaging method, scale, and sample size is not\ntrivial. With this dataset we aim to address these complexities, contributing\nto more robust structural analysis models. The dataset can be downloaded from:\nhttps://archive.compute.dtu.dk/files/public/projects/MozzaVID/.\n","authors":["Pawel Tomasz Pieta","Peter Winkel Rasmussen","Anders Bjorholm Dahl","Jeppe Revall Frisvad","Siavash Arjomand Bigdeli","Carsten Gundlach","Anders Nymark Christensen"],"pdf_url":"https://arxiv.org/pdf/2412.04880v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04879v1","updated":"2024-12-06T09:20:59Z","published":"2024-12-06T09:20:59Z","title":"Automatic Tissue Differentiation in Parotidectomy using Hyperspectral\n Imaging","summary":" In head and neck surgery, continuous intraoperative tissue differentiation is\nof great importance to avoid injury to sensitive structures such as nerves and\nvessels. Hyperspectral imaging (HSI) with neural network analysis could support\nthe surgeon in tissue differentiation. A 3D Convolutional Neural Network with\nhyperspectral data in the range of $400-1000$ nm is used in this work. The\nacquisition system consisted of two multispectral snapshot cameras creating a\nstereo-HSI-system. For the analysis, 27 images with annotations of glandular\ntissue, nerve, muscle, skin and vein in 18 patients undergoing parotidectomy\nare included. Three patients are removed for evaluation following the\nleave-one-subject-out principle. The remaining images are used for training,\nwith the data randomly divided into a training group and a validation group. In\nthe validation, an overall accuracy of $98.7\\%$ is achieved, indicating robust\ntraining. In the evaluation on the excluded patients, an overall accuracy of\n$83.4\\%$ has been achieved showing good detection and identification abilities.\nThe results clearly show that it is possible to achieve robust intraoperative\ntissue differentiation using hyperspectral imaging. Especially the high\nsensitivity in parotid or nerve tissue is of clinical importance. It is\ninteresting to note that vein was often confused with muscle. This requires\nfurther analysis and shows that a very good and comprehensive data basis is\nessential. This is a major challenge, especially in surgery.\n","authors":["Eric L. Wisotzky","Alexander Schill","Anna Hilsmann","Peter Eisert","Michael Knoke"],"pdf_url":"https://arxiv.org/pdf/2412.04879v1.pdf","comment":"Accepted and presented at 58th Annual Conference of the German\n Society for Biomedical Engineering in press at Current Directions in\n Biomedical Engineering"},{"id":"http://arxiv.org/abs/2408.15038v2","updated":"2024-12-06T09:16:51Z","published":"2024-08-27T13:07:09Z","title":"Interactive Occlusion Boundary Estimation through Exploitation of\n Synthetic Data","summary":" Occlusion boundaries (OBs) geometrically localize the occlusion events in a\n2D image, and contain useful information for addressing various scene\nunderstanding problems. To advance their study, we have led the investigation\nin the following three aspects. Firstly, we have studied interactive estimation\nof OBs, which is the first in the literature, and proposed an efficient\ndeep-network-based method using multiple-scribble intervention, named DNMMSI,\nwhich significantly improves the performance over the state-of-the-art\nfully-automatic methods. Secondly, we propose to exploit the synthetic\nbenchmark for the training, thanks to the particularity that OBs are determined\ngeometrically and unambiguously from the 3D scene. To this end, we have\ndeveloped an efficient tool, named Mesh2OB, for the automatic generation of 2D\nimages together with their ground-truth OBs, using which we have constructed a\nsynthetic benchmark, named OB-FUTURE. Abundant experimental results demonstrate\nthat leveraging such a synthetic benchmark for training achieves promising\nperformance, even without the use of domain adaptation techniques. Finally, to\nachieve a more compelling and robust evaluation in OB-related research, we have\ncreated a real-world benchmark OB-LabName, consisting of 120 high-resolution\nimages together with their ground-truth OBs, with precision surpassing that of\nprevious benchmarks. We will release DNMMSI with pre-trained parameters,\nMesh2OB, OB-FUTURE, and OB-LabName to support further research.\n","authors":["Lintao Xu","Chaohui Wang"],"pdf_url":"https://arxiv.org/pdf/2408.15038v2.pdf","comment":"11 pages, 4 figures, 8 tables"},{"id":"http://arxiv.org/abs/2409.18017v3","updated":"2024-12-06T09:14:41Z","published":"2024-09-26T16:25:48Z","title":"Transferring disentangled representations: bridging the gap between\n synthetic and real images","summary":" Developing meaningful and efficient representations that separate the\nfundamental structure of the data generation mechanism is crucial in\nrepresentation learning. However, Disentangled Representation Learning has not\nfully shown its potential on real images, because of correlated generative\nfactors, their resolution and limited access to ground truth labels.\nSpecifically on the latter, we investigate the possibility of leveraging\nsynthetic data to learn general-purpose disentangled representations applicable\nto real data, discussing the effect of fine-tuning and what properties of\ndisentanglement are preserved after the transfer. We provide an extensive\nempirical study to address these issues. In addition, we propose a new\ninterpretable intervention-based metric, to measure the quality of factors\nencoding in the representation. Our results indicate that some level of\ndisentanglement, transferring a representation from synthetic to real data, is\npossible and effective.\n","authors":["Jacopo Dapueto","Nicoletta Noceti","Francesca Odone"],"pdf_url":"https://arxiv.org/pdf/2409.18017v3.pdf","comment":"Accepted to NeurIPS, 2024"},{"id":"http://arxiv.org/abs/2412.04867v1","updated":"2024-12-06T09:01:10Z","published":"2024-12-06T09:01:10Z","title":"MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection\n Dataset for Tiny Objects","summary":" We present MANTA, a visual-text anomaly detection dataset for tiny objects.\nThe visual component comprises over 137.3K images across 38 object categories\nspanning five typical domains, of which 8.6K images are labeled as anomalous\nwith pixel-level annotations. Each image is captured from five distinct\nviewpoints to ensure comprehensive object coverage. The text component consists\nof two subsets: Declarative Knowledge, including 875 words that describe common\nanomalies across various domains and specific categories, with detailed\nexplanations for < what, why, how>, including causes and visual\ncharacteristics; and Constructivist Learning, providing 2K multiple-choice\nquestions with varying levels of difficulty, each paired with images and\ncorresponded answer explanations. We also propose a baseline for visual-text\ntasks and conduct extensive benchmarking experiments to evaluate advanced\nmethods across different settings, highlighting the challenges and efficacy of\nour dataset.\n","authors":["Lei Fan","Dongdong Fan","Zhiguang Hu","Yiwen Ding","Donglin Di","Kai Yi","Maurice Pagnucco","Yang Song"],"pdf_url":"https://arxiv.org/pdf/2412.04867v1.pdf","comment":"https://grainnet.github.io/MANTA"},{"id":"http://arxiv.org/abs/2412.04086v2","updated":"2024-12-06T09:00:39Z","published":"2024-12-05T11:48:54Z","title":"BodyMetric: Evaluating the Realism of Human Bodies in Text-to-Image\n Generation","summary":" Accurately generating images of human bodies from text remains a challenging\nproblem for state of the art text-to-image models. Commonly observed\nbody-related artifacts include extra or missing limbs, unrealistic poses,\nblurred body parts, etc. Currently, evaluation of such artifacts relies heavily\non time-consuming human judgments, limiting the ability to benchmark models at\nscale. We address this by proposing BodyMetric, a learnable metric that\npredicts body realism in images. BodyMetric is trained on realism labels and\nmulti-modal signals including 3D body representations inferred from the input\nimage, and textual descriptions. In order to facilitate this approach, we\ndesign an annotation pipeline to collect expert ratings on human body realism\nleading to a new dataset for this task, namely, BodyRealism. Ablation studies\nsupport our architectural choices for BodyMetric and the importance of\nleveraging a 3D human body prior in capturing body-related artifacts in 2D\nimages. In comparison to concurrent metrics which evaluate general user\npreference in images, BodyMetric specifically reflects body-related artifacts.\nWe demonstrate the utility of BodyMetric through applications that were\npreviously infeasible at scale. In particular, we use BodyMetric to benchmark\nthe generation ability of text-to-image models to produce realistic human\nbodies. We also demonstrate the effectiveness of BodyMetric in ranking\ngenerated images based on the predicted realism scores.\n","authors":["Nefeli Andreou","Varsha Vivek","Ying Wang","Alex Vorobiov","Tiffany Deng","Raja Bala","Larry Davis","Betty Mohler Tesch"],"pdf_url":"https://arxiv.org/pdf/2412.04086v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04855v1","updated":"2024-12-06T08:47:14Z","published":"2024-12-06T08:47:14Z","title":"GS-Matching: Reconsidering Feature Matching task in Point Cloud\n Registration","summary":" Traditional point cloud registration (PCR) methods for feature matching often\nemploy the nearest neighbor policy. This leads to many-to-one matches and\nnumerous potential inliers without any corresponding point. Recently, some\napproaches have framed the feature matching task as an assignment problem to\nachieve optimal one-to-one matches. We argue that the transition to the\nAssignment problem is not reliable for general correspondence-based PCR. In\nthis paper, we propose a heuristics stable matching policy called GS-matching,\ninspired by the Gale-Shapley algorithm. Compared to the other matching\npolicies, our method can perform efficiently and find more non-repetitive\ninliers under low overlapping conditions. Furthermore, we employ the\nprobability theory to analyze the feature matching task, providing new insights\ninto this research problem. Extensive experiments validate the effectiveness of\nour matching policy, achieving better registration recall on multiple datasets.\n","authors":["Yaojie Zhang","Tianlun Huang","Weijun Wang","Wei Feng"],"pdf_url":"https://arxiv.org/pdf/2412.04855v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04852v1","updated":"2024-12-06T08:44:18Z","published":"2024-12-06T08:44:18Z","title":"SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image\n Diffusion Models","summary":" Recent advances in large-scale text-to-image (T2I) diffusion models have\nenabled a variety of downstream applications, including style customization,\nsubject-driven personalization, and conditional generation. As T2I models\nrequire extensive data and computational resources for training, they\nconstitute highly valued intellectual property (IP) for their legitimate\nowners, yet making them incentive targets for unauthorized fine-tuning by\nadversaries seeking to leverage these models for customized, usually profitable\napplications. Existing IP protection methods for diffusion models generally\ninvolve embedding watermark patterns and then verifying ownership through\ngenerated outputs examination, or inspecting the model's feature space.\nHowever, these techniques are inherently ineffective in practical scenarios\nwhen the watermarked model undergoes fine-tuning, and the feature space is\ninaccessible during verification ((i.e., black-box setting). The model is prone\nto forgetting the previously learned watermark knowledge when it adapts to a\nnew task. To address this challenge, we propose SleeperMark, a novel framework\ndesigned to embed resilient watermarks into T2I diffusion models. SleeperMark\nexplicitly guides the model to disentangle the watermark information from the\nsemantic concepts it learns, allowing the model to retain the embedded\nwatermark while continuing to be fine-tuned to new downstream tasks. Our\nextensive experiments demonstrate the effectiveness of SleeperMark across\nvarious types of diffusion models, including latent diffusion models (e.g.,\nStable Diffusion) and pixel diffusion models (e.g., DeepFloyd-IF), showing\nrobustness against downstream fine-tuning and various attacks at both the image\nand model levels, with minimal impact on the model's generative capability. The\ncode is available at https://github.com/taco-group/SleeperMark.\n","authors":["Zilan Wang","Junfeng Guo","Jiacheng Zhu","Yiming Li","Heng Huang","Muhao Chen","Zhengzhong Tu"],"pdf_url":"https://arxiv.org/pdf/2412.04852v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04842v1","updated":"2024-12-06T08:27:53Z","published":"2024-12-06T08:27:53Z","title":"UniMLVG: Unified Framework for Multi-view Long Video Generation with\n Comprehensive Control Capabilities for Autonomous Driving","summary":" The creation of diverse and realistic driving scenarios has become essential\nto enhance perception and planning capabilities of the autonomous driving\nsystem. However, generating long-duration, surround-view consistent driving\nvideos remains a significant challenge. To address this, we present UniMLVG, a\nunified framework designed to generate extended street multi-perspective videos\nunder precise control. By integrating single- and multi-view driving videos\ninto the training data, our approach updates cross-frame and cross-view modules\nacross three stages with different training objectives, substantially boosting\nthe diversity and quality of generated visual content. Additionally, we employ\nthe explicit viewpoint modeling in multi-view video generation to effectively\nimprove motion transition consistency. Capable of handling various input\nreference formats (e.g., text, images, or video), our UniMLVG generates\nhigh-quality multi-view videos according to the corresponding condition\nconstraints such as 3D bounding boxes or frame-level text descriptions.\nCompared to the best models with similar capabilities, our framework achieves\nimprovements of 21.4% in FID and 36.5% in FVD.\n","authors":["Rui Chen","Zehuan Wu","Yichen Liu","Yuxin Guo","Jingcheng Ni","Haifeng Xia","Siyu Xia"],"pdf_url":"https://arxiv.org/pdf/2412.04842v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.02293v4","updated":"2024-12-06T08:22:41Z","published":"2024-11-04T17:21:42Z","title":"Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and\n Image-to-3D Generation","summary":" While 3D generative models have greatly improved artists' workflows, the\nexisting diffusion models for 3D generation suffer from slow generation and\npoor generalization. To address this issue, we propose a two-stage approach\nnamed Hunyuan3D-1.0 including a lite version and a standard version, that both\nsupport text- and image-conditioned generation. In the first stage, we employ a\nmulti-view diffusion model that efficiently generates multi-view RGB in\napproximately 4 seconds. These multi-view images capture rich details of the 3D\nasset from different viewpoints, relaxing the tasks from single-view to\nmulti-view reconstruction. In the second stage, we introduce a feed-forward\nreconstruction model that rapidly and faithfully reconstructs the 3D asset\ngiven the generated multi-view images in approximately 7 seconds. The\nreconstruction network learns to handle noises and in-consistency introduced by\nthe multi-view diffusion and leverages the available information from the\ncondition image to efficiently recover the 3D structure. Our framework involves\nthe text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to\nsupport both text- and image-conditioned 3D generation. Our standard version\nhas 3x more parameters than our lite and other existing model. Our\nHunyuan3D-1.0 achieves an impressive balance between speed and quality,\nsignificantly reducing generation time while maintaining the quality and\ndiversity of the produced assets.\n","authors":["Xianghui Yang","Huiwen Shi","Bowen Zhang","Fan Yang","Jiacheng Wang","Hongxu Zhao","Xinhai Liu","Xinzhou Wang","Qingxiang Lin","Jiaao Yu","Lifu Wang","Zhuo Chen","Sicong Liu","Yuhong Liu","Yong Yang","Di Wang","Jie Jiang","Chunchao Guo"],"pdf_url":"https://arxiv.org/pdf/2411.02293v4.pdf","comment":"Technical Report; 3D Generation"},{"id":"http://arxiv.org/abs/2412.04835v1","updated":"2024-12-06T08:04:02Z","published":"2024-12-06T08:04:02Z","title":"Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards\n for Visuomotor Robot Policy Alignment","summary":" Visuomotor robot policies, increasingly pre-trained on large-scale datasets,\npromise significant advancements across robotics domains. However, aligning\nthese policies with end-user preferences remains a challenge, particularly when\nthe preferences are hard to specify. While reinforcement learning from human\nfeedback (RLHF) has become the predominant mechanism for alignment in\nnon-embodied domains like large language models, it has not seen the same\nsuccess in aligning visuomotor policies due to the prohibitive amount of human\nfeedback required to learn visual reward functions. To address this limitation,\nwe propose Representation-Aligned Preference-based Learning (RAPL), an\nobservation-only method for learning visual rewards from significantly less\nhuman preference feedback. Unlike traditional RLHF, RAPL focuses human feedback\non fine-tuning pre-trained vision encoders to align with the end-user's visual\nrepresentation and then constructs a dense visual reward via feature matching\nin this aligned representation space. We first validate RAPL through simulation\nexperiments in the X-Magical benchmark and Franka Panda robotic manipulation,\ndemonstrating that it can learn rewards aligned with human preferences, more\nefficiently uses preference data, and generalizes across robot embodiments.\nFinally, our hardware experiments align pre-trained Diffusion Policies for\nthree object manipulation tasks. We find that RAPL can fine-tune these policies\nwith 5x less real human preference data, taking the first step towards\nminimizing human feedback while maximizing visuomotor robot policy alignment.\n","authors":["Ran Tian","Yilin Wu","Chenfeng Xu","Masayoshi Tomizuka","Jitendra Malik","Andrea Bajcsy"],"pdf_url":"https://arxiv.org/pdf/2412.04835v1.pdf","comment":"Submitted to IJRR, this paper is an extended journal version of the\n conference paper arXiv:2310.07932 with new results and discussion. arXiv\n admin note: substantial text overlap with arXiv:2310.07932"},{"id":"http://arxiv.org/abs/2410.13370v2","updated":"2024-12-06T07:58:07Z","published":"2024-10-17T09:22:53Z","title":"MagicTailor: Component-Controllable Personalization in Text-to-Image\n Diffusion Models","summary":" Recent text-to-image models generate high-quality images from text prompts\nbut lack precise control over specific components within visual concepts.\nTherefore, we introduce component-controllable personalization, a new task that\nallows users to customize and reconfigure individual components within\nconcepts. This task faces two challenges: semantic pollution, where undesirable\nelements distort the concept, and semantic imbalance, which leads to\ndisproportionate learning of the target concept and component. To address\nthese, we design MagicTailor, a framework that uses Dynamic Masked Degradation\nto adaptively perturb unwanted visual semantics and Dual-Stream Balancing for\nmore balanced learning of desired visual semantics. The experimental results\nshow that MagicTailor outperforms existing methods in this task and enables\nmore personalized, nuanced, and creative image generation.\n","authors":["Donghao Zhou","Jiancheng Huang","Jinbin Bai","Jiaze Wang","Hao Chen","Guangyong Chen","Xiaowei Hu","Pheng-Ann Heng"],"pdf_url":"https://arxiv.org/pdf/2410.13370v2.pdf","comment":"Project page: https://correr-zhou.github.io/MagicTailor"},{"id":"http://arxiv.org/abs/2412.04831v1","updated":"2024-12-06T07:54:34Z","published":"2024-12-06T07:54:34Z","title":"Customized Generation Reimagined: Fidelity and Editability Harmonized","summary":" Customized generation aims to incorporate a novel concept into a pre-trained\ntext-to-image model, enabling new generations of the concept in novel contexts\nguided by textual prompts. However, customized generation suffers from an\ninherent trade-off between concept fidelity and editability, i.e., between\nprecisely modeling the concept and faithfully adhering to the prompts. Previous\nmethods reluctantly seek a compromise and struggle to achieve both high concept\nfidelity and ideal prompt alignment simultaneously. In this paper, we propose a\nDivide, Conquer, then Integrate (DCI) framework, which performs a surgical\nadjustment in the early stage of denoising to liberate the fine-tuned model\nfrom the fidelity-editability trade-off at inference. The two conflicting\ncomponents in the trade-off are decoupled and individually conquered by two\ncollaborative branches, which are then selectively integrated to preserve high\nconcept fidelity while achieving faithful prompt adherence. To obtain a better\nfine-tuned model, we introduce an Image-specific Context Optimization} (ICO)\nstrategy for model customization. ICO replaces manual prompt templates with\nlearnable image-specific contexts, providing an adaptive and precise\nfine-tuning direction to promote the overall performance. Extensive experiments\ndemonstrate the effectiveness of our method in reconciling the\nfidelity-editability trade-off.\n","authors":["Jian Jin","Yang Shen","Zhenyong Fu","Jian Yang"],"pdf_url":"https://arxiv.org/pdf/2412.04831v1.pdf","comment":"18 pages, 12 figures, ECCV 2024"},{"id":"http://arxiv.org/abs/2412.02545v2","updated":"2024-12-06T07:46:47Z","published":"2024-12-03T16:37:23Z","title":"ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer","summary":" Shadows introduce challenges such as reduced brightness, texture\ndeterioration, and color distortion in images, complicating a holistic\nsolution. This study presents ShadowHack, a divide-and-conquer strategy that\ntackles these complexities by decomposing the original task into luminance\nrecovery and color remedy. To brighten shadow regions and repair the corrupted\ntextures in the luminance space, we customize LRNet, a U-shaped network with a\nrectified outreach attention module, to enhance information interaction and\nrecalibrate contaminated attention maps. With luminance recovered, CRNet then\nleverages cross-attention mechanisms to revive vibrant colors, producing\nvisually compelling results. Extensive experiments on multiple datasets are\nconducted to demonstrate the superiority of ShadowHack over existing\nstate-of-the-art solutions both quantitatively and qualitatively, highlighting\nthe effectiveness of our design. Our code will be made publicly available at\nhttps://github.com/lime-j/ShadowHack\n","authors":["Jin Hu","Mingjia Li","Xiaojie Guo"],"pdf_url":"https://arxiv.org/pdf/2412.02545v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2407.04346v3","updated":"2024-12-06T07:43:34Z","published":"2024-07-05T08:37:10Z","title":"MobileFlow: A Multimodal LLM For Mobile GUI Agent","summary":" Currently, the integration of mobile Graphical User Interfaces (GUIs) is\nubiquitous in most people's daily lives. And the ongoing evolution of\nmultimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly\nbolstered the capabilities of GUI comprehension and user action analysis,\nshowcasing the potentiality of intelligent GUI assistants. However, current GUI\nAgents often need to access page layout information through calling system\nAPIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a\ncertain low resolution might result in the loss of fine-grained image details.\nAt the same time, the multimodal large models built for GUI Agents currently\nhave poor understanding and decision-making abilities for Chinese GUI\ninterfaces, making them difficult to apply to a large number of Chinese apps.\nThis paper introduces MobileFlow, a multimodal large language model\nmeticulously crafted for mobile GUI agents. Transforming from the open-source\nmodel Qwen-VL-Chat into GUI domain, MobileFlow contains approximately 21\nbillion parameters and is equipped with novel hybrid visual encoders, making it\npossible for variable resolutions of image inputs and good support for\nmultilingual GUI. By incorporating Mixture of Experts (MoE) expansions and\npioneering alignment training strategies, MobileFlow has the capacity to fully\ninterpret image data and comprehend user instructions for GUI interaction\ntasks. Finally, MobileFlow outperforms Qwen-VL-Max and GPT-4v in terms of task\nexecution by GUI agents on both public and our proposed evaluation metrics, and\nhas been successfully deployed in real-world business contexts, proving its\neffectiveness for practical applications.\n","authors":["Songqin Nong","Jiali Zhu","Rui Wu","Jiongchao Jin","Shuo Shan","Xiutian Huang","Wenhao Xu"],"pdf_url":"https://arxiv.org/pdf/2407.04346v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04828v1","updated":"2024-12-06T07:43:28Z","published":"2024-12-06T07:43:28Z","title":"DAug: Diffusion-based Channel Augmentation for Radiology Image Retrieval\n and Classification","summary":" Medical image understanding requires meticulous examination of fine visual\ndetails, with particular regions requiring additional attention. While\nradiologists build such expertise over years of experience, it is challenging\nfor AI models to learn where to look with limited amounts of training data.\nThis limitation results in unsatisfying robustness in medical image\nunderstanding. To address this issue, we propose Diffusion-based Feature\nAugmentation (DAug), a portable method that improves a perception model's\nperformance with a generative model's output. Specifically, we extend a\nradiology image to multiple channels, with the additional channels being the\nheatmaps of regions where diseases tend to develop. A diffusion-based\nimage-to-image translation model was used to generate such heatmaps conditioned\non selected disease classes. Our method is motivated by the fact that\ngenerative models learn the distribution of normal and abnormal images, and\nsuch knowledge is complementary to image understanding tasks. In addition, we\npropose the Image-Text-Class Hybrid Contrastive learning to utilize both text\nand class labels. With two novel approaches combined, our method surpasses\nbaseline models without changing the model architecture, and achieves\nstate-of-the-art performance on both medical image retrieval and classification\ntasks.\n","authors":["Ying Jin","Zhuoran Zhou","Haoquan Fang","Jenq-Neng Hwang"],"pdf_url":"https://arxiv.org/pdf/2412.04828v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04827v1","updated":"2024-12-06T07:42:48Z","published":"2024-12-06T07:42:48Z","title":"PanoDreamer: 3D Panorama Synthesis from a Single Image","summary":" In this paper, we present PanoDreamer, a novel method for producing a\ncoherent 360$^\\circ$ 3D scene from a single input image. Unlike existing\nmethods that generate the scene sequentially, we frame the problem as\nsingle-image panorama and depth estimation. Once the coherent panoramic image\nand its corresponding depth are obtained, the scene can be reconstructed by\ninpainting the small occluded regions and projecting them into 3D space. Our\nkey contribution is formulating single-image panorama and depth estimation as\ntwo optimization tasks and introducing alternating minimization strategies to\neffectively solve their objectives. We demonstrate that our approach\noutperforms existing techniques in single-image 360$^\\circ$ scene\nreconstruction in terms of consistency and overall quality.\n","authors":["Avinash Paliwal","Xilong Zhou","Andrii Tsarov","Nima Khademi Kalantari"],"pdf_url":"https://arxiv.org/pdf/2412.04827v1.pdf","comment":"Project page: https://people.engr.tamu.edu/nimak/Papers/PanoDreamer,\n Code: https://github.com/avinashpaliwal/PanoDreamer"},{"id":"http://arxiv.org/abs/2412.04826v1","updated":"2024-12-06T07:42:47Z","published":"2024-12-06T07:42:47Z","title":"Pushing Rendering Boundaries: Hard Gaussian Splatting","summary":" 3D Gaussian Splatting (3DGS) has demonstrated impressive Novel View Synthesis\n(NVS) results in a real-time rendering manner. During training, it relies\nheavily on the average magnitude of view-space positional gradients to grow\nGaussians to reduce rendering loss. However, this average operation smooths the\npositional gradients from different viewpoints and rendering errors from\ndifferent pixels, hindering the growth and optimization of many defective\nGaussians. This leads to strong spurious artifacts in some areas. To address\nthis problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers\nmulti-view significant positional gradients and rendering errors to grow hard\nGaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus\nachieving superior NVS results. In detail, we present positional gradient\ndriven HGS, which leverages multi-view significant positional gradients to\nuncover hard Gaussians. Moreover, we propose rendering error guided HGS, which\nidentifies noticeable pixel rendering errors and potentially over-large\nGaussians to jointly mine hard Gaussians. By growing and optimizing these hard\nGaussians, our method helps to resolve blurring and needle-like artifacts.\nExperiments on various datasets demonstrate that our method achieves\nstate-of-the-art rendering quality while maintaining real-time efficiency.\n","authors":["Qingshan Xu","Jiequan Cui","Xuanyu Yi","Yuxuan Wang","Yuan Zhou","Yew-Soon Ong","Hanwang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.04826v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19772v2","updated":"2024-12-06T07:24:10Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v2.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.04814v1","updated":"2024-12-06T07:16:14Z","published":"2024-12-06T07:16:14Z","title":"LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment","summary":" Recent advancements in text-to-video (T2V) generative models have shown\nimpressive capabilities. However, these models are still inadequate in aligning\nsynthesized videos with human preferences (e.g., accurately reflecting text\ndescriptions), which is particularly difficult to address, as human preferences\nare inherently subjective and challenging to formalize as objective functions.\nTherefore, this paper proposes LiFT, a novel fine-tuning method leveraging\nhuman feedback for T2V model alignment. Specifically, we first construct a\nHuman Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k\nhuman annotations, each including a score and its corresponding rationale.\nBased on this, we train a reward model LiFT-Critic to learn reward function\neffectively, which serves as a proxy for human judgment, measuring the\nalignment between given videos and human expectations. Lastly, we leverage the\nlearned reward function to align the T2V model by maximizing the\nreward-weighted likelihood. As a case study, we apply our pipeline to\nCogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B\nacross all 16 metrics, highlighting the potential of human feedback in\nimproving the alignment and quality of synthesized videos.\n","authors":["Yibin Wang","Zhiyu Tan","Junyan Wang","Xiaomeng Yang","Cheng Jin","Hao Li"],"pdf_url":"https://arxiv.org/pdf/2412.04814v1.pdf","comment":"project page: https://codegoat24.github.io/LiFT"},{"id":"http://arxiv.org/abs/2309.07668v2","updated":"2024-12-06T07:11:33Z","published":"2023-09-14T12:30:48Z","title":"ChromaDistill: Colorizing Monochrome Radiance Fields with Knowledge\n Distillation","summary":" Colorization is a well-explored problem in the domains of image and video\nprocessing. However, extending colorization to 3D scenes presents significant\nchallenges. Recent Neural Radiance Field (NeRF) and Gaussian-Splatting(3DGS)\nmethods enable high-quality novel-view synthesis for multi-view images.\nHowever, the question arises: How can we colorize these 3D representations?\nThis work presents a method for synthesizing colorized novel views from input\ngrayscale multi-view images. Using image or video colorization methods to\ncolorize novel views from these 3D representations naively will yield output\nwith severe inconsistencies. We introduce a novel method to use powerful image\ncolorization models for colorizing 3D representations. We propose a\ndistillation-based method that transfers color from these networks trained on\nnatural images to the target 3D representation. Notably, this strategy does not\nadd any additional weights or computational overhead to the original\nrepresentation during inference. Extensive experiments demonstrate that our\nmethod produces high-quality colorized views for indoor and outdoor scenes,\nshowcasing significant cross-view consistency advantages over baseline\napproaches. Our method is agnostic to the underlying 3D representation and\neasily generalizable to NeRF and 3DGS methods. Further, we validate the\nefficacy of our approach in several diverse applications: 1.) Infra-Red (IR)\nmulti-view images and 2.) Legacy grayscale multi-view image sequences. Project\nWebpage: https://val.cds.iisc.ac.in/chroma-distill.github.io/\n","authors":["Ankit Dhiman","R Srinath","Srinjay Sarkar","Lokesh R Boregowda","R Venkatesh Babu"],"pdf_url":"https://arxiv.org/pdf/2309.07668v2.pdf","comment":"WACV 2025, AI3DCC @ ICCV 2023"},{"id":"http://arxiv.org/abs/2412.02565v2","updated":"2024-12-06T07:08:56Z","published":"2024-12-03T16:53:58Z","title":"SJTU:Spatial judgments in multimodal models towards unified segmentation\n through coordinate detection","summary":" Despite significant advances in vision-language understanding, implementing\nimage segmentation within multimodal architectures remains a fundamental\nchallenge in modern artificial intelligence systems. Existing vision-language\nmodels, which primarily rely on backbone architectures or CLIP-based embedding\nlearning, demonstrate inherent limitations in fine-grained spatial localization\nand operational capabilities. This paper introduces SJTU: Spatial Judgments in\nMultimodal Models - Towards Unified Segmentation through Coordinate Detection,\na framework that leverages spatial coordinate understanding to bridge\nvision-language interaction and precise segmentation, enabling accurate target\nidentification through natural language instructions. The framework presents an\napproach for integrating segmentation techniques with vision-language models\nthrough spatial inference in multimodal space. By utilizing normalized\ncoordinate detection for bounding boxes and transforming them into actionable\nsegmentation outputs, we establish a connection between spatial and language\nrepresentations in multimodal architectures. Experimental results demonstrate\nsuperior performance across benchmark datasets, achieving IoU scores of 0.5958\non COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU\nwith 512x512 resolution images yields an average inference time of 7 seconds\nper image, demonstrating the framework's effectiveness in both accuracy and\npractical deployability. The project code is available at\nhttps://github.com/jw-chae/SJTU\n","authors":["Joongwon Chae","Zhenyu Wang","Peiwu Qin"],"pdf_url":"https://arxiv.org/pdf/2412.02565v2.pdf","comment":"15 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.04812v1","updated":"2024-12-06T07:06:21Z","published":"2024-12-06T07:06:21Z","title":"Automatic Prediction of Stroke Treatment Outcomes: Latest Advances and\n Perspectives","summary":" Stroke is a major global health problem that causes mortality and morbidity.\nPredicting the outcomes of stroke intervention can facilitate clinical\ndecision-making and improve patient care. Engaging and developing deep learning\ntechniques can help to analyse large and diverse medical data, including brain\nscans, medical reports and other sensor information, such as EEG, ECG, EMG and\nso on. Despite the common data standardisation challenge within medical image\nanalysis domain, the future of deep learning in stroke outcome prediction lie\nin using multimodal information, including final infarct data, to achieve\nbetter prediction of long-term functional outcomes. This article provides a\nbroad review of recent advances and applications of deep learning in the\nprediction of stroke outcomes, including (i) the data and models used, (ii) the\nprediction tasks and measures of success, (iii) the current challenges and\nlimitations, and (iv) future directions and potential benefits. This\ncomprehensive review aims to provide researchers, clinicians, and policy makers\nwith an up-to-date understanding of this rapidly evolving and promising field.\n","authors":["Zeynel A. Samak","Philip Clatworthy","Majid Mirmehdi"],"pdf_url":"https://arxiv.org/pdf/2412.04812v1.pdf","comment":"The paper is under consideration at Biomedical Engineering Letters\n (Springer)"},{"id":"http://arxiv.org/abs/2407.07504v4","updated":"2024-12-06T07:05:16Z","published":"2024-07-10T09:42:41Z","title":"Pan-cancer Histopathology WSI Pre-training with Position-aware Masked\n Autoencoder","summary":" Large-scale pre-training models have promoted the development of\nhistopathology image analysis. However, existing self-supervised methods for\nhistopathology images primarily focus on learning patch features, while there\nis a notable gap in the availability of pre-training models specifically\ndesigned for WSI-level feature learning. In this paper, we propose a novel\nself-supervised learning framework for pan-cancer WSI-level representation\npre-training with the designed position-aware masked autoencoder (PAMA).\nMeanwhile, we propose the position-aware cross-attention (PACA) module with a\nkernel reorientation (KRO) strategy and an anchor dropout (AD) mechanism. The\nKRO strategy can capture the complete semantic structure and eliminate\nambiguity in WSIs, and the AD contributes to enhancing the robustness and\ngeneralization of the model. We evaluated our method on 7 large-scale datasets\nfrom multiple organs for pan-cancer classification tasks. The results have\ndemonstrated the effectiveness and generalization of PAMA in discriminative WSI\nrepresentation learning and pan-cancer WSI pre-training. The proposed method\nwas also compared with 8 WSI analysis methods. The experimental results have\nindicated that our proposed PAMA is superior to the state-of-the-art methods.\nThe code and checkpoints are available at https://github.com/WkEEn/PAMA.\n","authors":["Kun Wu","Zhiguo Jiang","Kunming Tang","Jun Shi","Fengying Xie","Wei Wang","Haibo Wu","Yushan Zheng"],"pdf_url":"https://arxiv.org/pdf/2407.07504v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.17777v3","updated":"2024-12-06T06:58:30Z","published":"2024-09-26T12:15:13Z","title":"Harnessing Shared Relations via Multimodal Mixup Contrastive Learning\n for Multimodal Classification","summary":" Deep multimodal learning has shown remarkable success by leveraging\ncontrastive learning to capture explicit one-to-one relations across\nmodalities. However, real-world data often exhibits shared relations beyond\nsimple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive\nLearning approach to capture nuanced shared relations inherent in multimodal\ndata. Our key contribution is a Mixup-based contrastive loss that learns robust\nrepresentations by aligning mixed samples from one modality with their\ncorresponding samples from other modalities thereby capturing shared relations\nbetween them. For multimodal classification tasks, we introduce a framework\nthat integrates a fusion module with unimodal prediction modules for auxiliary\nsupervision during training, complemented by our proposed Mixup-based\ncontrastive loss. Through extensive experiments on diverse datasets (N24News,\nROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures\nshared multimodal relations and generalizes across domains. It outperforms\nstate-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving\ncomparable performance on Food-101. Our work highlights the significance of\nlearning shared relations for robust multimodal learning, opening up promising\navenues for future research. Our code is publicly available at\nhttps://github.com/RaghavSinghal10/M3CoL.\n","authors":["Raja Kumar","Raghav Singhal","Pranamya Kulkarni","Deval Mehta","Kshitij Jadhav"],"pdf_url":"https://arxiv.org/pdf/2409.17777v3.pdf","comment":"RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9\n Tables. Another version of the paper accepted at NeurIPS 2024 Workshop on\n Unifying Representations in Neural Models (UniReps)"},{"id":"http://arxiv.org/abs/2408.09181v2","updated":"2024-12-06T06:41:47Z","published":"2024-08-17T12:11:22Z","title":"PADetBench: Towards Benchmarking Physical Attacks against Object\n Detection","summary":" Physical attacks against object detection have gained increasing attention\ndue to their significant practical implications. However, conducting physical\nexperiments is extremely time-consuming and labor-intensive. Moreover, physical\ndynamics and cross-domain transformation are challenging to strictly regulate\nin the real world, leading to unaligned evaluation and comparison, severely\nhindering the development of physically robust models. To accommodate these\nchallenges, we explore utilizing realistic simulation to thoroughly and\nrigorously benchmark physical attacks with fairness under controlled physical\ndynamics and cross-domain transformation. This resolves the problem of\ncapturing identical adversarial images that cannot be achieved in the real\nworld. Our benchmark includes 20 physical attack methods, 48 object detectors,\ncomprehensive physical dynamics, and evaluation metrics. We also provide\nend-to-end pipelines for dataset generation, detection, evaluation, and further\nanalysis. In addition, we perform 8064 groups of evaluation based on our\nbenchmark, which includes both overall evaluation and further detailed ablation\nstudies for controlled physical dynamics. Through these experiments, we provide\nin-depth analyses of physical attack performance and physical adversarial\nrobustness, draw valuable observations, and discuss potential directions for\nfuture research.\n Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack\n","authors":["Jiawei Lian","Jianhong Pan","Lefan Wang","Yi Wang","Lap-Pui Chau","Shaohui Mei"],"pdf_url":"https://arxiv.org/pdf/2408.09181v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04802v1","updated":"2024-12-06T06:22:43Z","published":"2024-12-06T06:22:43Z","title":"Modality Decoupling is All You Need: A Simple Solution for Unsupervised\n Hyperspectral Image Fusion","summary":" Hyperspectral Image Fusion (HIF) aims to fuse low-resolution hyperspectral\nimages (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to\nreconstruct high spatial and high spectral resolution images. Current methods\ntypically apply direct fusion from the two modalities without valid\nsupervision, failing to fully perceive the deep modality-complementary\ninformation and hence, resulting in a superficial understanding of\ninter-modality connections. To bridge this gap, we propose a simple and\neffective solution for unsupervised HIF with an assumption that modality\ndecoupling is essential for HIF. We introduce the modality clustering loss that\nensures clear guidance of the modality, decoupling towards modality-shared\nfeatures while steering clear of modality-complementary ones. Also, we propose\nan end-to-end Modality-Decoupled Spatial-Spectral Fusion (MossFuse) framework\nthat decouples shared and complementary information across modalities and\naggregates a concise representation of the LR-HSI and HR-MSI to reduce the\nmodality redundancy. Systematic experiments over multiple datasets demonstrate\nthat our simple and effective approach consistently outperforms the existing\nHIF methods while requiring considerably fewer parameters with reduced\ninference time.\n","authors":["Songcheng Du","Yang Zou","Zixu Wang","Xingyuan Li","Ying Li","Qiang Shen"],"pdf_url":"https://arxiv.org/pdf/2412.04802v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01160v3","updated":"2024-12-06T05:48:04Z","published":"2024-12-02T06:00:27Z","title":"ControlFace: Harnessing Facial Parametric Control for Face Rigging","summary":" Manipulation of facial images to meet specific controls such as pose,\nexpression, and lighting, also known as face rigging, is a complex task in\ncomputer vision. Existing methods are limited by their reliance on image\ndatasets, which necessitates individual-specific fine-tuning and limits their\nability to retain fine-grained identity and semantic details, reducing\npractical usability. To overcome these limitations, we introduce ControlFace, a\nnovel face rigging method conditioned on 3DMM renderings that enables flexible,\nhigh-fidelity control. We employ a dual-branch U-Nets: one, referred to as\nFaceNet, captures identity and fine details, while the other focuses on\ngeneration. To enhance control precision, the control mixer module encodes the\ncorrelated features between the target-aligned control and reference-aligned\ncontrol, and a novel guidance method, reference control guidance, steers the\ngeneration process for better control adherence. By training on a facial video\ndataset, we fully utilize FaceNet's rich representations while ensuring control\nadherence. Extensive experiments demonstrate ControlFace's superior performance\nin identity preservation and control precision, highlighting its practicality.\nPlease see the project website: https://cvlab-kaist.github.io/ControlFace/.\n","authors":["Wooseok Jang","Youngjun Hong","Geonho Cha","Seungryong Kim"],"pdf_url":"https://arxiv.org/pdf/2412.01160v3.pdf","comment":"project website: https://cvlab-kaist.github.io/ControlFace/"},{"id":"http://arxiv.org/abs/2412.04789v1","updated":"2024-12-06T05:47:55Z","published":"2024-12-06T05:47:55Z","title":"DrIFT: Autonomous Drone Dataset with Integrated Real and Synthetic Data,\n Flexible Views, and Transformed Domains","summary":" Dependable visual drone detection is crucial for the secure integration of\ndrones into the airspace. However, drone detection accuracy is significantly\naffected by domain shifts due to environmental changes, varied points of view,\nand background shifts. To address these challenges, we present the DrIFT\ndataset, specifically developed for visual drone detection under domain shifts.\nDrIFT includes fourteen distinct domains, each characterized by shifts in point\nof view, synthetic-to-real data, season, and adverse weather. DrIFT uniquely\nemphasizes background shift by providing background segmentation maps to enable\nbackground-wise metrics and evaluation. Our new uncertainty estimation metric,\nMCDO-map, features lower postprocessing complexity, surpassing traditional\nmethods. We use the MCDO-map in our uncertainty-aware unsupervised domain\nadaptation method, demonstrating superior performance to SOTA unsupervised\ndomain adaptation techniques. The dataset is available at:\nhttps://github.com/CARG-uOttawa/DrIFT.git.\n","authors":["Fardad Dadboud","Hamid Azad","Varun Mehta","Miodrag Bolic","Iraj Mntegh"],"pdf_url":"https://arxiv.org/pdf/2412.04789v1.pdf","comment":"WACV2025"},{"id":"http://arxiv.org/abs/2412.04786v1","updated":"2024-12-06T05:31:42Z","published":"2024-12-06T05:31:42Z","title":"Slicing Vision Transformer for Flexible Inference","summary":" Vision Transformers (ViT) is known for its scalability. In this work, we\ntarget to scale down a ViT to fit in an environment with dynamic-changing\nresource constraints. We observe that smaller ViTs are intrinsically the\nsub-networks of a larger ViT with different widths. Thus, we propose a general\nframework, named Scala, to enable a single network to represent multiple\nsmaller ViTs with flexible inference capability, which aligns with the inherent\ndesign of ViT to vary from widths. Concretely, Scala activates several subnets\nduring training, introduces Isolated Activation to disentangle the smallest\nsub-network from other subnets, and leverages Scale Coordination to ensure each\nsub-network receives simplified, steady, and accurate learning objectives.\nComprehensive empirical validations on different tasks demonstrate that with\nonly one-shot training, Scala learns slimmable representation without modifying\nthe original ViT structure and matches the performance of Separate Training.\nCompared with the prior art, Scala achieves an average improvement of 1.6% on\nImageNet-1K with fewer parameters.\n","authors":["Yitian Zhang","Huseyin Coskun","Xu Ma","Huan Wang","Ke Ma"," Xi"," Chen","Derek Hao Hu","Yun Fu"],"pdf_url":"https://arxiv.org/pdf/2412.04786v1.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.03225v2","updated":"2024-12-06T05:24:39Z","published":"2024-12-04T11:23:15Z","title":"MaterialPicker: Multi-Modal Material Generation with Diffusion\n Transformers","summary":" High-quality material generation is key for virtual environment authoring and\ninverse rendering. We propose MaterialPicker, a multi-modal material generator\nleveraging a Diffusion Transformer (DiT) architecture, improving and\nsimplifying the creation of high-quality materials from text prompts and/or\nphotographs. Our method can generate a material based on an image crop of a\nmaterial sample, even if the captured surface is distorted, viewed at an angle\nor partially occluded, as is often the case in photographs of natural scenes.\nWe further allow the user to specify a text prompt to provide additional\nguidance for the generation. We finetune a pre-trained DiT-based video\ngenerator into a material generator, where each material map is treated as a\nframe in a video sequence. We evaluate our approach both quantitatively and\nqualitatively and show that it enables more diverse material generation and\nbetter distortion correction than previous work.\n","authors":["Xiaohe Ma","Valentin Deschaintre","Miloš Hašan","Fujun Luan","Kun Zhou","Hongzhi Wu","Yiwei Hu"],"pdf_url":"https://arxiv.org/pdf/2412.03225v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04783v1","updated":"2024-12-06T05:20:08Z","published":"2024-12-06T05:20:08Z","title":"KNN-MMD: Cross Domain Wi-Fi Sensing Based on Local Distribution\n Alignment","summary":" As a key technology in Integrated Sensing and Communications (ISAC), Wi-Fi\nsensing has gained widespread application in various settings such as homes,\noffices, and public spaces. By analyzing the patterns of Channel State\nInformation (CSI), we can obtain information about people's actions for tasks\nlike person identification, gesture recognition, and fall detection. However,\nthe CSI is heavily influenced by the environment, such that even minor\nenvironmental changes can significantly alter the CSI patterns. This will cause\nthe performance deterioration and even failure when applying the Wi-Fi sensing\nmodel trained in one environment to another. To address this problem, we\nintroduce a K-Nearest Neighbors Maximum Mean Discrepancy (KNN-MMD) model, a\nfew-shot method for cross-domain Wi-Fi sensing. We propose a local distribution\nalignment method within each category, which outperforms traditional Domain\nAdaptation (DA) methods based on global alignment. Besides, our method can\ndetermine when to stop training, which cannot be realized by most DA methods.\nAs a result, our method is more stable and can be better used in practice. The\neffectiveness of our method are evaluated in several cross-domain Wi-Fi sensing\ntasks, including gesture recognition, person identification, fall detection,\nand action recognition, using both a public dataset and a self-collected\ndataset. In one-shot scenario, our method achieves accuracy of 93.26%, 81.84%,\n77.62%, and 75.30% in the four tasks respectively. To facilitate future\nresearch, we will make our code and dataset publicly available upon\npublication.\n","authors":["Zijian Zhao","Zhijie Cai","Tingwei Chen","Xiaoyang Li","Hang Li","Guangxu Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.04783v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04237v2","updated":"2024-12-06T05:16:57Z","published":"2024-12-05T15:17:06Z","title":"VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction","summary":" Large language models (LLMs) have proven effective for layout generation due\nto their ability to produce structure-description languages, such as HTML or\nJSON, even without access to visual information. Recently, LLM providers have\nevolved these models into large vision-language models (LVLM), which shows\nprominent multi-modal understanding capabilities. Then, how can we leverage\nthis multi-modal power for layout generation? To answer this, we propose\nVisual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based\ncontent-aware layout generation. In our method, LVLMs iteratively refine their\noutputs with reference to rendered layout images, which are visualized as\ncolored bounding boxes on poster backgrounds. In experiments, we demonstrate\nthat our method combined with the Gemini. Without any additional training,\nVASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming\nboth existing layout-specific generative models and other LLM-based methods.\n","authors":["Jiahao Zhang","Ryota Yoshihashi","Shunsuke Kitada","Atsuki Osanai","Yuta Nakashima"],"pdf_url":"https://arxiv.org/pdf/2412.04237v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04776v1","updated":"2024-12-06T04:39:41Z","published":"2024-12-06T04:39:41Z","title":"Megatron: Evasive Clean-Label Backdoor Attacks against Vision\n Transformer","summary":" Vision transformers have achieved impressive performance in various\nvision-related tasks, but their vulnerability to backdoor attacks is\nunder-explored. A handful of existing works focus on dirty-label attacks with\nwrongly-labeled poisoned training samples, which may fail if a benign model\ntrainer corrects the labels. In this paper, we propose Megatron, an evasive\nclean-label backdoor attack against vision transformers, where the attacker\ninjects the backdoor without manipulating the data-labeling process. To\ngenerate an effective trigger, we customize two loss terms based on the\nattention mechanism used in transformer networks, i.e., latent loss and\nattention diffusion loss. The latent loss aligns the last attention layer\nbetween triggered samples and clean samples of the target label. The attention\ndiffusion loss emphasizes the attention diffusion area that encompasses the\ntrigger. A theoretical analysis is provided to underpin the rationale behind\nthe attention diffusion loss. Extensive experiments on CIFAR-10, GTSRB,\nCIFAR-100, and Tiny ImageNet demonstrate the effectiveness of Megatron.\nMegatron can achieve attack success rates of over 90% even when the position of\nthe trigger is slightly shifted during testing. Furthermore, Megatron achieves\nbetter evasiveness than baselines regarding both human visual inspection and\ndefense strategies (i.e., DBAVT, BAVT, Beatrix, TeCo, and SAGE).\n","authors":["Xueluan Gong","Bowei Tian","Meng Xue","Shuike Li","Yanjiao Chen","Qian Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04776v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09614v4","updated":"2024-12-06T04:35:45Z","published":"2023-11-16T06:58:46Z","title":"Comprehensive framework for evaluation of deep neural networks in\n detection and quantification of lymphoma from PET/CT images: clinical\n insights, pitfalls, and observer agreement analyses","summary":" This study addresses critical gaps in automated lymphoma segmentation from\nPET/CT images, focusing on issues often overlooked in existing literature.\nWhile deep learning has been applied for lymphoma lesion segmentation, few\nstudies incorporate out-of-distribution testing, raising concerns about model\ngeneralizability across diverse imaging conditions and patient populations. We\nhighlight the need to compare model performance with expert human annotators,\nincluding intra- and inter-observer variability, to understand task difficulty\nbetter. Most approaches focus on overall segmentation accuracy but overlook\nlesion-specific measures important for precise lesion detection and disease\nquantification. To address these gaps, we propose a clinically relevant\nframework for evaluating deep segmentation networks. Using this lesion\nmeasure-specific evaluation, we assess the performance of four deep networks\n(ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from\nmulti-institutional datasets, covering various lymphoma subtypes and lesion\ncharacteristics. Beyond standard metrics like the Dice similarity coefficient,\nwe evaluate clinical lesion measures and their prediction errors. We also\nintroduce detection criteria for lesion localization and propose a new\ndetection Criterion 3 based on metabolic characteristics. We show that networks\nperform better on large, intense lesions with higher metabolic activity.\nFinally, we compare network performance to physicians via intra- and\ninter-observer variability analyses, demonstrating that network errors closely\nresemble those made by experts, i.e., the small and faint lesions remain\nchallenging for both humans and networks. This study aims to improve automated\nlesion segmentation's clinical relevance, supporting better treatment decisions\nfor lymphoma patients. The code is available at:\nhttps://github.com/microsoft/lymphoma-segmentation-dnn.\n","authors":["Shadab Ahamed","Yixi Xu","Sara Kurkowska","Claire Gowdy","Joo H. O","Ingrid Bloise","Don Wilson","Patrick Martineau","François Bénard","Fereshteh Yousefirizi","Rahul Dodhia","Juan M. Lavista","William B. Weeks","Carlos F. Uribe","Arman Rahmim"],"pdf_url":"https://arxiv.org/pdf/2311.09614v4.pdf","comment":"32 pages, 15 figures, 5 tables"},{"id":"http://arxiv.org/abs/2412.04769v1","updated":"2024-12-06T04:31:09Z","published":"2024-12-06T04:31:09Z","title":"Revitalizing Reconstruction Models for Multi-class Anomaly Detection via\n Class-Aware Contrastive Learning","summary":" For anomaly detection (AD), early approaches often train separate models for\nindividual classes, yielding high performance but posing challenges in\nscalability and resource management. Recent efforts have shifted toward\ntraining a single model capable of handling multiple classes. However, directly\nextending early AD methods to multi-class settings often results in degraded\nperformance. In this paper, we analyze this degradation observed in\nreconstruction-based methods, identifying two key issues: catastrophic\nforgetting and inter-class confusion. To this end, we propose a plug-and-play\nmodification by incorporating class-aware contrastive learning (CL). By\nexplicitly leveraging raw object category information (e.g., carpet or wood) as\nsupervised signals, we apply local CL to fine-tune multiscale features and\nglobal CL to learn more compact feature representations of normal patterns,\nthereby effectively adapting the models to multi-class settings. Experiments\nacross four datasets (over 60 categories) verify the effectiveness of our\napproach, yielding significant improvements and superior performance compared\nto advanced methods. Notably, ablation studies show that even using\npseudo-class labels can achieve comparable performance.\n","authors":["Lei Fan","Junjie Huang","Donglin Di","Anyang Su","Maurice Pagnucco","Yang Song"],"pdf_url":"https://arxiv.org/pdf/2412.04769v1.pdf","comment":"https://lgc-ad.github.io/"},{"id":"http://arxiv.org/abs/2407.02794v2","updated":"2024-12-06T04:24:02Z","published":"2024-07-03T03:42:33Z","title":"Euler's Elastica Based Cartoon-Smooth-Texture Image Decomposition","summary":" We propose a novel model for decomposing grayscale images into three distinct\ncomponents: the structural part, representing sharp boundaries and regions with\nstrong light-to-dark transitions; the smooth part, capturing soft shadows and\nshades; and the oscillatory part, characterizing textures and noise. To capture\nthe homogeneous structures, we introduce a combination of $L^0$-gradient and\ncurvature regularization on level lines. This new regularization term enforces\nstrong sparsity on the image gradient while reducing the undesirable staircase\neffects as well as preserving the geometry of contours. For the smoothly\nvarying component, we utilize the $L^2$-norm of the Laplacian that favors\nisotropic smoothness. To capture the oscillation, we use the inverse Sobolev\nseminorm. To solve the associated minimization problem, we design an efficient\noperator-splitting algorithm. Our algorithm effectively addresses the\nchallenging non-convex non-smooth problem by separating it into sub-problems.\nEach sub-problem can be solved either directly using closed-form solutions or\nefficiently using the Fast Fourier Transform (FFT). We provide systematic\nexperiments, including ablation and comparison studies, to analyze our model's\nbehaviors and demonstrate its effectiveness as well as efficiency.\n","authors":["Roy Y. He","Hao Liu"],"pdf_url":"https://arxiv.org/pdf/2407.02794v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04766v1","updated":"2024-12-06T04:18:49Z","published":"2024-12-06T04:18:49Z","title":"DAWN-SI: Data-Aware and Noise-Informed Stochastic Interpolation for\n Solving Inverse Problems","summary":" Inverse problems, which involve estimating parameters from incomplete or\nnoisy observations, arise in various fields such as medical imaging,\ngeophysics, and signal processing. These problems are often ill-posed,\nrequiring regularization techniques to stabilize the solution. In this work, we\nemploy $\\textit{Stochastic Interpolation}$ (SI), a generative framework that\nintegrates both deterministic and stochastic processes to map a simple\nreference distribution, such as a Gaussian, to the target distribution. Our\nmethod $\\textbf{DAWN-SI}$: $\\textbf{D}$ata-$\\textbf{AW}$are and\n$\\textbf{N}$oise-informed $\\textbf{S}$tochastic $\\textbf{I}$nterpolation\nincorporates data and noise embedding, allowing the model to access\nrepresentations about the measured data explicitly and also account for noise\nin the observations, making it particularly robust in scenarios where data is\nnoisy or incomplete. By learning a time-dependent velocity field, SI not only\nprovides accurate solutions but also enables uncertainty quantification by\ngenerating multiple plausible outcomes. Unlike pre-trained diffusion models,\nwhich may struggle in highly ill-posed settings, our approach is trained\nspecifically for each inverse problem and adapts to varying noise levels. We\nvalidate the effectiveness and robustness of our method through extensive\nnumerical experiments on tasks such as image deblurring and tomography.\n","authors":["Shadab Ahamed","Eldad Haber"],"pdf_url":"https://arxiv.org/pdf/2412.04766v1.pdf","comment":"20 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.03962v2","updated":"2024-12-06T04:11:24Z","published":"2024-12-05T08:26:13Z","title":"Local Curvature Smoothing with Stein's Identity for Efficient Score\n Matching","summary":" The training of score-based diffusion models (SDMs) is based on score\nmatching. The challenge of score matching is that it includes a computationally\nexpensive Jacobian trace. While several methods have been proposed to avoid\nthis computation, each has drawbacks, such as instability during training and\napproximating the learning as learning a denoising vector field rather than a\ntrue score. We propose a novel score matching variant, local curvature\nsmoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by\napplying Stein's identity, enabling regularization effectiveness and efficient\ncomputation. We show that LCSS surpasses existing methods in sample generation\nperformance and matches the performance of denoising score matching, widely\nadopted by most SDMs, in evaluations such as FID, Inception score, and bits per\ndimension. Furthermore, we show that LCSS enables realistic image generation\neven at a high resolution of $1024 \\times 1024$.\n","authors":["Genki Osada","Makoto Shing","Takashi Nishide"],"pdf_url":"https://arxiv.org/pdf/2412.03962v2.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2405.01124v4","updated":"2024-12-06T03:53:11Z","published":"2024-05-02T09:38:07Z","title":"Investigating Self-Supervised Image Denoising with Denaturation","summary":" Self-supervised learning for image denoising problems in the presence of\ndenaturation for noisy data is a crucial approach in machine learning. However,\ntheoretical understanding of the performance of the approach that uses\ndenatured data is lacking. To provide better understanding of the approach, in\nthis paper, we analyze a self-supervised denoising algorithm that uses\ndenatured data in depth through theoretical analysis and numerical experiments.\nThrough the theoretical analysis, we discuss that the algorithm finds desired\nsolutions to the optimization problem with the population risk, while the\nguarantee for the empirical risk depends on the hardness of the denoising task\nin terms of denaturation levels. We also conduct several experiments to\ninvestigate the performance of an extended algorithm in practice. The results\nindicate that the algorithm training with denatured images works, and the\nempirical performance aligns with the theoretical results. These results\nsuggest several insights for further improvement of self-supervised image\ndenoising that uses denatured data in future directions.\n","authors":["Hiroki Waida","Kimihiro Yamazaki","Atsushi Tokuhisa","Mutsuyo Wada","Yuichiro Wada"],"pdf_url":"https://arxiv.org/pdf/2405.01124v4.pdf","comment":"The PDF v3 has a wrong license, while v4 has a correct license"},{"id":"http://arxiv.org/abs/2412.04755v1","updated":"2024-12-06T03:40:21Z","published":"2024-12-06T03:40:21Z","title":"Latent Space Characterization of Autoencoder Variants","summary":" Understanding the latent spaces learned by deep learning models is crucial in\nexploring how they represent and generate complex data. Autoencoders (AEs) have\nplayed a key role in the area of representation learning, with numerous\nregularization techniques and training principles developed not only to enhance\ntheir ability to learn compact and robust representations, but also to reveal\nhow different architectures influence the structure and smoothness of the\nlower-dimensional non-linear manifold. We strive to characterize the structure\nof the latent spaces learned by different autoencoders including convolutional\nautoencoders (CAEs), denoising autoencoders (DAEs), and variational\nautoencoders (VAEs) and how they change with the perturbations in the input. By\ncharacterizing the matrix manifolds corresponding to the latent spaces, we\nprovide an explanation for the well-known observation that the latent spaces of\nCAE and DAE form non-smooth manifolds, while that of VAE forms a smooth\nmanifold. We also map the points of the matrix manifold to a Hilbert space\nusing distance preserving transforms and provide an alternate view in terms of\nthe subspaces generated in the Hilbert space as a function of the distortion in\nthe input. The results show that the latent manifolds of CAE and DAE are\nstratified with each stratum being a smooth product manifold, while the\nmanifold of VAE is a smooth product manifold of two symmetric positive definite\nmatrices and a symmetric positive semi-definite matrix.\n","authors":["Anika Shrivastava","Renu Rameshan","Samar Agnihotri"],"pdf_url":"https://arxiv.org/pdf/2412.04755v1.pdf","comment":"8 pages, 6 figures, and 1 table"},{"id":"http://arxiv.org/abs/2412.04749v1","updated":"2024-12-06T03:25:01Z","published":"2024-12-06T03:25:01Z","title":"Machine learning algorithms to predict the risk of rupture of\n intracranial aneurysms: a systematic review","summary":" Purpose: Subarachnoid haemorrhage is a potentially fatal consequence of\nintracranial aneurysm rupture, however, it is difficult to predict if aneurysms\nwill rupture. Prophylactic treatment of an intracranial aneurysm also involves\nrisk, hence identifying rupture-prone aneurysms is of substantial clinical\nimportance. This systematic review aims to evaluate the performance of machine\nlearning algorithms for predicting intracranial aneurysm rupture risk.\n Methods: MEDLINE, Embase, Cochrane Library and Web of Science were searched\nuntil December 2023. Studies incorporating any machine learning algorithm to\npredict the risk of rupture of an intracranial aneurysm were included. Risk of\nbias was assessed using the Prediction Model Risk of Bias Assessment Tool\n(PROBAST). PROSPERO registration: CRD42023452509. Results: Out of 10,307\nrecords screened, 20 studies met the eligibility criteria for this review\nincorporating a total of 20,286 aneurysm cases. The machine learning models\ngave a 0.66-0.90 range for performance accuracy. The models were compared to\ncurrent clinical standards in six studies and gave mixed results. Most studies\nposed high or unclear risks of bias and concerns for applicability, limiting\nthe inferences that can be drawn from them. There was insufficient homogenous\ndata for a meta-analysis.\n Conclusions: Machine learning can be applied to predict the risk of rupture\nfor intracranial aneurysms. However, the evidence does not comprehensively\ndemonstrate superiority to existing practice, limiting its role as a clinical\nadjunct. Further prospective multicentre studies of recent machine learning\ntools are needed to prove clinical validation before they are implemented in\nthe clinic.\n","authors":["Karan Daga","Siddharth Agarwal","Zaeem Moti","Matthew BK Lee","Munaib Din","David Wood","Marc Modat","Thomas C Booth"],"pdf_url":"https://arxiv.org/pdf/2412.04749v1.pdf","comment":"Clin Neuroradiol (2024)"},{"id":"http://arxiv.org/abs/2412.04748v1","updated":"2024-12-06T03:20:36Z","published":"2024-12-06T03:20:36Z","title":"Decomposed Distribution Matching in Dataset Condensation","summary":" Dataset Condensation (DC) aims to reduce deep neural networks training\nefforts by synthesizing a small dataset such that it will be as effective as\nthe original large dataset. Conventionally, DC relies on a costly bi-level\noptimization which prohibits its practicality. Recent research formulates DC as\na distribution matching problem which circumvents the costly bi-level\noptimization. However, this efficiency sacrifices the DC performance. To\ninvestigate this performance degradation, we decomposed the dataset\ndistribution into content and style. Our observations indicate two major\nshortcomings of: 1) style discrepancy between original and condensed data, and\n2) limited intra-class diversity of condensed dataset. We present a simple yet\neffective method to match the style information between original and condensed\ndata, employing statistical moments of feature maps as well-established style\nindicators. Moreover, we enhance the intra-class diversity by maximizing the\nKullback-Leibler divergence within each synthetic class, i.e., content. We\ndemonstrate the efficacy of our method through experiments on diverse datasets\nof varying size and resolution, achieving improvements of up to 4.1% on\nCIFAR10, 4.2% on CIFAR100, 4.3% on TinyImageNet, 2.0% on ImageNet-1K, 3.3% on\nImageWoof, 2.5% on ImageNette, and 5.5% in continual learning accuracy.\n","authors":["Sahar Rahimi Malakshan","Mohammad Saeed Ebrahimi Saadabadi","Ali Dabouei","Nasser M. Nasrabadi"],"pdf_url":"https://arxiv.org/pdf/2412.04748v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04739v1","updated":"2024-12-06T02:59:36Z","published":"2024-12-06T02:59:36Z","title":"Fair Diagnosis: Leveraging Causal Modeling to Mitigate Medical Bias","summary":" In medical image analysis, model predictions can be affected by sensitive\nattributes, such as race and gender, leading to fairness concerns and potential\nbiases in diagnostic outcomes. To mitigate this, we present a causal modeling\nframework, which aims to reduce the impact of sensitive attributes on\ndiagnostic predictions. Our approach introduces a novel fairness criterion,\n\\textbf{Diagnosis Fairness}, and a unique fairness metric, leveraging\npath-specific fairness to control the influence of demographic attributes,\nensuring that predictions are primarily informed by clinically relevant\nfeatures rather than sensitive attributes. By incorporating adversarial\nperturbation masks, our framework directs the model to focus on critical image\nregions, suppressing bias-inducing information. Experimental results across\nmultiple datasets demonstrate that our framework effectively reduces bias\ndirectly associated with sensitive attributes while preserving diagnostic\naccuracy. Our findings suggest that causal modeling can enhance both fairness\nand interpretability in AI-powered clinical decision support systems.\n","authors":["Bowei Tian","Yexiao He","Meng Liu","Yucong Dai","Ziyao Wang","Shwai He","Guoheng Sun","Zheyu Shen","Wanghao Ye","Yongkai Wu","Ang Li"],"pdf_url":"https://arxiv.org/pdf/2412.04739v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.11299v3","updated":"2024-12-06T02:45:11Z","published":"2024-09-17T15:52:40Z","title":"TTT-Unet: Enhancing U-Net with Test-Time Training Layers for Biomedical\n Image Segmentation","summary":" Biomedical image segmentation is crucial for accurately diagnosing and\nanalyzing various diseases. However, Convolutional Neural Networks (CNNs) and\nTransformers, the most commonly used architectures for this task, struggle to\neffectively capture long-range dependencies due to the inherent locality of\nCNNs and the computational complexity of Transformers. To address this\nlimitation, we introduce TTT-Unet, a novel framework that integrates Test-Time\nTraining (TTT) layers into the traditional U-Net architecture for biomedical\nimage segmentation. TTT-Unet dynamically adjusts model parameters during the\ntesting time, enhancing the model's ability to capture both local and\nlong-range features. We evaluate TTT-Unet on multiple medical imaging datasets,\nincluding 3D abdominal organ segmentation in CT and MR images, instrument\nsegmentation in endoscopy images, and cell segmentation in microscopy images.\nThe results demonstrate that TTT-Unet consistently outperforms state-of-the-art\nCNN-based and Transformer-based segmentation models across all tasks. The code\nis available at https://github.com/rongzhou7/TTT-Unet.\n","authors":["Rong Zhou","Zhengqing Yuan","Zhiling Yan","Weixiang Sun","Kai Zhang","Yiwei Li","Yanfang Ye","Xiang Li","Lifang He","Lichao Sun"],"pdf_url":"https://arxiv.org/pdf/2409.11299v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04729v1","updated":"2024-12-06T02:39:50Z","published":"2024-12-06T02:39:50Z","title":"Espresso: High Compression For Rich Extraction From Videos for Your\n Vision-Language Model","summary":" Most of the current vision-language models (VLMs) for videos struggle to\nunderstand videos longer than a few seconds. This is primarily due to the fact\nthat they do not scale to utilizing a large number of frames. In order to\naddress this limitation, we propose Espresso, a novel method that extracts and\ncompresses spatial and temporal information separately. Through extensive\nevaluations, we show that spatial and temporal compression in Espresso each\nhave a positive impact on the long-form video understanding capabilities; when\ncombined, their positive impact increases. Furthermore, we show that Espresso's\nperformance scales well with more training data, and that Espresso is far more\neffective than the existing projectors for VLMs in long-form video\nunderstanding. Moreover, we devise a more difficult evaluation setting for\nEgoSchema called \"needle-in-a-haystack\" that multiplies the lengths of the\ninput videos. Espresso achieves SOTA performance on this task, outperforming\nthe SOTA VLMs that have been trained on much more training data.\n","authors":["Keunwoo Peter Yu","Achal Dave","Rares Ambrus","Jean Mercat"],"pdf_url":"https://arxiv.org/pdf/2412.04729v1.pdf","comment":"11 pages"},{"id":"http://arxiv.org/abs/2412.04727v1","updated":"2024-12-06T02:35:44Z","published":"2024-12-06T02:35:44Z","title":"Learning to Translate Noise for Robust Image Denoising","summary":" Deep learning-based image denoising techniques often struggle with poor\ngeneralization performance to out-of-distribution real-world noise. To tackle\nthis challenge, we propose a novel noise translation framework that performs\ndenoising on an image with translated noise rather than directly denoising an\noriginal noisy image. Specifically, our approach translates complex, unknown\nreal-world noise into Gaussian noise, which is spatially uncorrelated and\nindependent of image content, through a noise translation network. The\ntranslated noisy images are then processed by an image denoising network\npretrained to effectively remove Gaussian noise, enabling robust and consistent\ndenoising performance. We also design well-motivated loss functions and\narchitectures for the noise translation network by leveraging the mathematical\nproperties of Gaussian noise. Experimental results demonstrate that the\nproposed method substantially improves robustness and generalizability,\noutperforming state-of-the-art methods across diverse benchmarks. Visualized\ndenoising results and the source code are available on our project page.\n","authors":["Inju Ha","Donghun Ryou","Seonguk Seo","Bohyung Han"],"pdf_url":"https://arxiv.org/pdf/2412.04727v1.pdf","comment":"The project page is available at\n https://hij1112.github.io/learning-to-translate-noise/"},{"id":"http://arxiv.org/abs/2411.16169v2","updated":"2024-12-06T02:32:23Z","published":"2024-11-25T07:55:57Z","title":"Local and Global Feature Attention Fusion Network for Face Recognition","summary":" Recognition of low-quality face images remains a challenge due to invisible\nor deformation in partial facial regions. For low-quality images dominated by\nmissing partial facial regions, local region similarity contributes more to\nface recognition (FR). Conversely, in cases dominated by local face\ndeformation, excessive attention to local regions may lead to misjudgments,\nwhile global features exhibit better robustness. However, most of the existing\nFR methods neglect the bias in feature quality of low-quality images introduced\nby different factors. To address this issue, we propose a Local and Global\nFeature Attention Fusion (LGAF) network based on feature quality. The network\nadaptively allocates attention between local and global features according to\nfeature quality and obtains more discriminative and high-quality face features\nthrough local and global information complementarity. In addition, to\neffectively obtain fine-grained information at various scales and increase the\nseparability of facial features in high-dimensional space, we introduce a\nMulti-Head Multi-Scale Local Feature Extraction (MHMS) module. Experimental\nresults demonstrate that the LGAF achieves the best average performance on $4$\nvalidation sets (CFP-FP, CPLFW, AgeDB, and CALFW), and the performance on\nTinyFace and SCFace outperforms the state-of-the-art methods (SoTA).\n","authors":["Wang Yu","Wei Wei"],"pdf_url":"https://arxiv.org/pdf/2411.16169v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00091v2","updated":"2024-12-06T02:26:29Z","published":"2024-11-27T12:41:23Z","title":"Graph Canvas for Controllable 3D Scene Generation","summary":" Spatial intelligence is foundational to AI systems that interact with the\nphysical world, particularly in 3D scene generation and spatial comprehension.\nCurrent methodologies for 3D scene generation often rely heavily on predefined\ndatasets, and struggle to adapt dynamically to changing spatial relationships.\nIn this paper, we introduce GraphCanvas3D, a programmable, extensible, and\nadaptable framework for controllable 3D scene generation. Leveraging in-context\nlearning, GraphCanvas3D enables dynamic adaptability without the need for\nretraining, supporting flexible and customizable scene creation. Our framework\nemploys hierarchical, graph-driven scene descriptions, representing spatial\nelements as graph nodes and establishing coherent relationships among objects\nin 3D environments. Unlike conventional approaches, which are constrained in\nadaptability and often require predefined input masks or retraining for\nmodifications, GraphCanvas3D allows for seamless object manipulation and scene\nadjustments on the fly. Additionally, GraphCanvas3D supports 4D scene\ngeneration, incorporating temporal dynamics to model changes over time.\nExperimental results and user studies demonstrate that GraphCanvas3D enhances\nusability, flexibility, and adaptability for scene generation. Our code and\nmodels are available on the project website:\nhttps://github.com/ILGLJ/Graph-Canvas.\n","authors":["Libin Liu","Shen Chen","Sen Jia","Jingzhe Shi","Zhongyu Jiang","Can Jin","Wu Zongkai","Jenq-Neng Hwang","Lei Li"],"pdf_url":"https://arxiv.org/pdf/2412.00091v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03704v2","updated":"2024-12-06T02:21:48Z","published":"2024-12-04T20:35:07Z","title":"Scaling Inference-Time Search with Vision Value Model for Improved\n Visual Comprehension","summary":" Despite significant advancements in vision-language models (VLMs), there\nlacks effective approaches to enhance response quality by scaling\ninference-time computation. This capability is known to be a core step towards\nthe self-improving models in recent large language model studies. In this\npaper, we present Vision Value Model (VisVM) that can guide VLM inference-time\nsearch to generate responses with better visual comprehension. Specifically,\nVisVM not only evaluates the generated sentence quality in the current search\nstep, but also anticipates the quality of subsequent sentences that may result\nfrom the current step, thus providing a long-term value. In this way, VisVM\nsteers VLMs away from generating sentences prone to hallucinations or\ninsufficient detail, thereby producing higher quality responses. Experimental\nresults demonstrate that VisVM-guided search significantly enhances VLMs'\nability to generate descriptive captions with richer visual details and fewer\nhallucinations, compared with greedy decoding and search methods with other\nvisual reward signals. Furthermore, we find that self-training the model with\nthe VisVM-guided captions improve VLM's performance across a wide range of\nmultimodal benchmarks, indicating the potential for developing self-improving\nVLMs. Our value model and code are available at\nhttps://github.com/si0wang/VisVM.\n","authors":["Xiyao Wang","Zhengyuan Yang","Linjie Li","Hongjin Lu","Yuancheng Xu","Chung-Ching Lin","Kevin Lin","Furong Huang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03704v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.05248v1","updated":"2024-12-06T18:27:15Z","published":"2024-12-06T18:27:15Z","title":"Enhancing FKG.in: automating Indian food composition analysis","summary":" This paper presents a novel approach to compute food composition data for\nIndian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The\nprimary focus is to provide a broad overview of an automated food composition\nanalysis workflow and describe its core functionalities: nutrition data\naggregation, food composition analysis, and LLM-augmented information\nresolution. This workflow aims to complement FKG.in and iteratively supplement\nfood composition data from verified knowledge bases. Additionally, this paper\nhighlights the challenges of representing Indian food and accessing food\ncomposition data digitally. It also reviews three key sources of food\ncomposition data: the Indian Food Composition Tables, the Indian Nutrient\nDatabank, and the Nutritionix API. Furthermore, it briefly outlines how users\ncan interact with the workflow to obtain diet-based health recommendations and\ndetailed food composition information for numerous recipes. We then explore the\ncomplex challenges of analyzing Indian recipe information across dimensions\nsuch as structure, multilingualism, and uncertainty as well as present our\nongoing work on LLM-based solutions to address these issues. The methods\nproposed in this workshop paper for AI-driven knowledge curation and\ninformation resolution are application-agnostic, generalizable, and replicable\nfor any domain.\n","authors":["Saransh Kumar Gupta","Lipika Dey","Partha Pratim Das","Geeta Trilok-Kumar","Ramesh Jain"],"pdf_url":"https://arxiv.org/pdf/2412.05248v1.pdf","comment":"15 pages, 3 figures, 30 references, International Conference on\n Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop"},{"id":"http://arxiv.org/abs/2412.05206v1","updated":"2024-12-06T17:35:52Z","published":"2024-12-06T17:35:52Z","title":"ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented\n Argumentation with LLM Judges","summary":" Computational argumentation, which involves generating answers or summaries\nfor controversial topics like abortion bans and vaccination, has become\nincreasingly important in today's polarized environment. Sophisticated LLM\ncapabilities offer the potential to provide nuanced, evidence-based answers to\nsuch questions through Retrieval-Augmented Argumentation (RAArg), leveraging\nreal-world evidence for high-quality, grounded arguments. However, evaluating\nRAArg remains challenging, as human evaluation is costly and difficult for\ncomplex, lengthy answers on complicated topics. At the same time, re-using\nexisting argumentation datasets is no longer sufficient, as they lack long,\ncomplex arguments and realistic evidence from potentially misleading sources,\nlimiting holistic evaluation of retrieval effectiveness and argument quality.\nTo address these gaps, we investigate automated evaluation methods using\nmultiple fine-grained LLM judges, providing better and more interpretable\nassessments than traditional single-score metrics and even previously reported\nhuman crowdsourcing. To validate the proposed techniques, we introduce ConQRet,\na new benchmark featuring long and complex human-authored arguments on debated\ntopics, grounded in real-world websites, allowing an exhaustive evaluation\nacross retrieval effectiveness, argument quality, and groundedness. We validate\nour LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed\nLLM Judges and the ConQRet benchmark can enable rapid progress in computational\nargumentation and can be naturally extended to other complex\nretrieval-augmented generation tasks.\n","authors":["Kaustubh D. Dhole","Kai Shu","Eugene Agichtein"],"pdf_url":"https://arxiv.org/pdf/2412.05206v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.09439v2","updated":"2024-12-06T12:09:15Z","published":"2024-08-18T11:07:38Z","title":"Towards Boosting LLMs-driven Relevance Modeling with Progressive\n Retrieved Behavior-augmented Prompting","summary":" Relevance modeling is a critical component for enhancing user experience in\nsearch engines, with the primary objective of identifying items that align with\nusers' queries. Traditional models only rely on the semantic congruence between\nqueries and items to ascertain relevance. However, this approach represents\nmerely one aspect of the relevance judgement, and is insufficient in isolation.\nEven powerful Large Language Models (LLMs) still cannot accurately judge the\nrelevance of a query and an item from a semantic perspective. To augment\nLLMs-driven relevance modeling, this study proposes leveraging user\ninteractions recorded in search logs to yield insights into users' implicit\nsearch intentions. The challenge lies in the effective prompting of LLMs to\ncapture dynamic search intentions, which poses several obstacles in real-world\nrelevance scenarios, i.e., the absence of domain-specific knowledge, the\ninadequacy of an isolated prompt, and the prohibitive costs associated with\ndeploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved\nBehavior-augmented Prompting framework for integrating search scenario-oriented\nknowledge with LLMs effectively. Specifically, we perform the user-driven\nbehavior neighbors retrieval from the daily search logs to obtain\ndomain-specific knowledge in time, retrieving candidates that users consider to\nmeet their expectations. Then, we guide LLMs for relevance modeling by\nemploying advanced prompting techniques that progressively improve the outputs\nof the LLMs, followed by a progressive aggregation with comprehensive\nconsideration of diverse aspects. For online serving, we have developed an\nindustrial application framework tailored for the deployment of LLMs in\nrelevance modeling. Experiments on real-world industry data and online A/B\ntesting demonstrate our proposal achieves promising performance.\n","authors":["Zeyuan Chen","Haiyan Wu","Kaixin Wu","Wei Chen","Mingjie Zhong","Jia Xu","Zhongyi Liu","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2408.09439v2.pdf","comment":"Accepted By COLING 2025"},{"id":"http://arxiv.org/abs/2412.04846v1","updated":"2024-12-06T08:33:49Z","published":"2024-12-06T08:33:49Z","title":"eXpath: Explaining Knowledge Graph Link Prediction with Ontological\n Closed Path Rules","summary":" Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but\ncommonly suffers from interpretability issues. While several methods have been\nproposed to explain embedding-based LP models, they are generally limited to\nlocal explanations on KG and are deficient in providing human interpretable\nsemantics. Based on real-world observations of the characteristics of KGs from\nmultiple domains, we propose to explain LP models in KG with path-based\nexplanations. An integrated framework, namely eXpath, is introduced which\nincorporates the concept of relation path with ontological closed path rules to\nenhance both the efficiency and effectiveness of LP interpretation. Notably,\nthe eXpath explanations can be fused with other single-link explanation\napproaches to achieve a better overall solution. Extensive experiments across\nbenchmark datasets and LP models demonstrate that introducing eXpath can boost\nthe quality of resulting explanations by about 20% on two key metrics and\nreduce the required explanation time by 61.4%, in comparison to the best\nexisting method. Case studies further highlight eXpath's ability to provide\nmore semantically meaningful explanations through path-based evidence.\n","authors":["Ye Sun","Lei Shi","Yongxin Tong"],"pdf_url":"https://arxiv.org/pdf/2412.04846v1.pdf","comment":"13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201"},{"id":"http://arxiv.org/abs/2401.13509v2","updated":"2024-12-06T05:54:55Z","published":"2024-01-24T15:06:44Z","title":"TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient\n and Effective Retrieval","summary":" This paper considers Pseudo-Relevance Feedback (PRF) methods for dense\nretrievers in a resource constrained environment such as that of cheap cloud\ninstances or embedded systems (e.g., smartphones and smartwatches), where\nmemory and CPU are limited and GPUs are not present. For this, we propose a\ntransformer-based PRF method (TPRF), which has a much smaller memory footprint\nand faster inference time compared to other deep language models that employ\nPRF mechanisms, with a marginal effectiveness loss. TPRF learns how to\neffectively combine the relevance feedback signals from dense passage\nrepresentations. Specifically, TPRF provides a mechanism for modelling\nrelationships and weights between the query and the relevance feedback signals.\nThe method is agnostic to the specific dense representation used and thus can\nbe generally applied to any dense retriever.\n","authors":["Hang Li","Chuting Yu","Ahmed Mourad","Bevan Koopman","Guido Zuccon"],"pdf_url":"https://arxiv.org/pdf/2401.13509v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.17740v3","updated":"2024-12-06T05:26:40Z","published":"2024-03-26T14:29:34Z","title":"All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating\n Prediction","summary":" Cold-start rating prediction is a fundamental problem in recommender systems\nthat has been extensively studied. Many methods have been proposed that exploit\nexplicit relations among existing data, such as collaborative filtering, social\nrecommendations and heterogeneous information network, to alleviate the data\ninsufficiency issue for cold-start users and items. However, the explicit\nrelations constructed based on data between different roles may be unreliable\nand irrelevant, which limits the performance ceiling of the specific\nrecommendation task. Motivated by this, in this paper, we propose a flexible\nframework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not\nsolely rely on the pre-defined interaction pattern or the manually constructed\nheterogeneous information network. Instead, we devise a Heterogeneous\nInteraction Module (HIM) to jointly model the heterogeneous interactions and\ndirectly infer the important interactions via the observed data. In the\nexperiments, we evaluate our model under three cold-start settings on three\nreal-world datasets. The experimental results show that HIRE outperforms other\nbaselines by a large margin. Furthermore, we visualize the inferred\ninteractions of HIRE to confirm the contribution of our model.\n","authors":["Shuheng Fang","Kangfei Zhao","Yu Rong","Zhixun Li","Jeffrey Xu Yu"],"pdf_url":"https://arxiv.org/pdf/2403.17740v3.pdf","comment":"14 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04746v1","updated":"2024-12-06T03:18:18Z","published":"2024-12-06T03:18:18Z","title":"Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval\n with Semantic Guidance","summary":" Modern music retrieval systems often rely on fixed representations of user\npreferences, limiting their ability to capture users' diverse and uncertain\nretrieval needs. To address this limitation, we introduce Diff4Steer, a novel\ngenerative retrieval framework that employs lightweight diffusion models to\nsynthesize diverse seed embeddings from user queries that represent potential\ndirections for music exploration. Unlike deterministic methods that map user\nquery to a single point in embedding space, Diff4Steer provides a statistical\nprior on the target modality (audio) for retrieval, effectively capturing the\nuncertainty and multi-faceted nature of user preferences. Furthermore,\nDiff4Steer can be steered by image or text inputs, enabling more flexible and\ncontrollable music discovery combined with nearest neighbor search. Our\nframework outperforms deterministic regression methods and LLM-based generative\nretrieval baseline in terms of retrieval and ranking metrics, demonstrating its\neffectiveness in capturing user preferences, leading to more diverse and\nrelevant recommendations. Listening examples are available at\ntinyurl.com/diff4steer.\n","authors":["Xuchan Bao","Judith Yue Li","Zhong Yi Wan","Kun Su","Timo Denk","Joonseok Lee","Dima Kuzmin","Fei Sha"],"pdf_url":"https://arxiv.org/pdf/2412.04746v1.pdf","comment":"NeurIPS 2024 Creative AI Track"}],"Machine Learning":[{"id":"http://arxiv.org/abs/2412.05280v1","updated":"2024-12-06T18:59:56Z","published":"2024-12-06T18:59:56Z","title":"Stag-1: Towards Realistic 4D Driving Simulation with Video Generation\n Model","summary":" 4D driving simulation is essential for developing realistic autonomous\ndriving simulators. Despite advancements in existing methods for generating\ndriving scenes, significant challenges remain in view transformation and\nspatial-temporal dynamic modeling. To address these limitations, we propose a\nSpatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct\nreal-world scenes and design a controllable generative network to achieve 4D\nsimulation. Stag-1 constructs continuous 4D point cloud scenes using\nsurround-view data from autonomous vehicles. It decouples spatial-temporal\nrelationships and produces coherent keyframe videos. Additionally, Stag-1\nleverages video generation models to obtain photo-realistic and controllable 4D\ndriving simulation videos from any perspective. To expand the range of view\ngeneration, we train vehicle motion videos based on decomposed camera poses,\nenhancing modeling capabilities for distant scenes. Furthermore, we reconstruct\nvehicle camera trajectories to integrate 3D points across consecutive views,\nenabling comprehensive scene understanding along the temporal dimension.\nFollowing extensive multi-level scene training, Stag-1 can simulate from any\ndesired viewpoint and achieve a deep understanding of scene evolution under\nstatic spatial-temporal conditions. Compared to existing methods, our approach\nshows promising performance in multi-view scene consistency, background\ncoherence, and accuracy, and contributes to the ongoing advancements in\nrealistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag.\n","authors":["Lening Wang","Wenzhao Zheng","Dalong Du","Yunpeng Zhang","Yilong Ren","Han Jiang","Zhiyong Cui","Haiyang Yu","Jie Zhou","Jiwen Lu","Shanghang Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.05280v1.pdf","comment":"Code is available at: https://github.com/wzzheng/Stag"},{"id":"http://arxiv.org/abs/2412.05276v1","updated":"2024-12-06T18:59:51Z","published":"2024-12-06T18:59:51Z","title":"Sparse autoencoders reveal selective remapping of visual concepts during\n adaptation","summary":" Adapting foundation models for specific purposes has become a standard\napproach to build machine learning systems for downstream applications. Yet, it\nis an open question which mechanisms take place during adaptation. Here we\ndevelop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named\nPatchSAE, to extract interpretable concepts at granular levels (e.g. shape,\ncolor, or semantics of an object) and their patch-wise spatial attributions. We\nexplore how these concepts influence the model output in downstream image\nclassification tasks and investigate how recent state-of-the-art prompt-based\nadaptation techniques change the association of model inputs to these concepts.\nWhile activations of concepts slightly change between adapted and non-adapted\nmodels, we find that the majority of gains on common adaptation tasks can be\nexplained with the existing concepts already present in the non-adapted\nfoundation model. This work provides a concrete framework to train and use SAEs\nfor Vision Transformers and provides insights into explaining adaptation\nmechanisms.\n","authors":["Hyesu Lim","Jinho Choi","Jaegul Choo","Steffen Schneider"],"pdf_url":"https://arxiv.org/pdf/2412.05276v1.pdf","comment":"A demo is available at github.com/dynamical-inference/patchsae"},{"id":"http://arxiv.org/abs/2406.06818v5","updated":"2024-12-06T18:56:05Z","published":"2024-06-10T22:01:34Z","title":"Conformal Prediction for Class-wise Coverage via Augmented Label Rank\n Calibration","summary":" Conformal prediction (CP) is an emerging uncertainty quantification framework\nthat allows us to construct a prediction set to cover the true label with a\npre-specified marginal or conditional probability. Although the valid coverage\nguarantee has been extensively studied for classification problems, CP often\nproduces large prediction sets which may not be practically useful. This issue\nis exacerbated for the setting of class-conditional coverage on imbalanced\nclassification tasks with many and/or imbalanced classes. This paper proposes\nthe Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the\nprediction set sizes to achieve class-conditional coverage, where the valid\ncoverage holds for each class. In contrast to the standard class-conditional CP\n(CCP) method that uniformly thresholds the class-wise conformity score for each\nclass, the augmented label rank calibration step allows RC3P to selectively\niterate this class-wise thresholding subroutine only for a subset of classes\nwhose class-wise top-k error is small. We prove that agnostic to the classifier\nand data distribution, RC3P achieves class-wise coverage. We also show that\nRC3P reduces the size of prediction sets compared to the CCP method.\nComprehensive experiments on multiple real-world datasets demonstrate that RC3P\nachieves class-wise coverage and 26.25% reduction in prediction set sizes on\naverage.\n","authors":["Yuanjie Shi","Subhankar Ghosh","Taha Belkhouja","Janardhan Rao Doppa","Yan Yan"],"pdf_url":"https://arxiv.org/pdf/2406.06818v5.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05270v1","updated":"2024-12-06T18:55:34Z","published":"2024-12-06T18:55:34Z","title":"APOLLO: SGD-like Memory, AdamW-level Performance","summary":" Large language models (LLMs) are notoriously memory-intensive during\ntraining, particularly with the popular AdamW optimizer. This memory burden\nnecessitates using more or higher-end GPUs or reducing batch sizes, limiting\ntraining scalability and throughput. To address this, various memory-efficient\noptimizers have been proposed to reduce optimizer memory usage. However, they\nface critical challenges: (i) reliance on costly SVD operations; (ii)\nsignificant performance trade-offs compared to AdamW; and (iii) still\nsubstantial optimizer memory overhead to maintain competitive performance.\n In this work, we identify that AdamW's learning rate adaptation rule can be\neffectively coarsened as a structured learning rate update. Based on this\ninsight, we propose Approximated Gradient Scaling for Memory-Efficient LLM\nOptimization (APOLLO), which approximates learning rate scaling using an\nauxiliary low-rank optimizer state based on pure random projection. This\nstructured learning rate update rule makes APOLLO highly tolerant to further\nmemory reductions while delivering comparable pre-training performance. Even\nits rank-1 variant, APOLLO-Mini, achieves superior pre-training performance\ncompared to AdamW with SGD-level memory costs.\n Extensive experiments demonstrate that the APOLLO series performs on-par with\nor better than AdamW, while achieving greater memory savings by nearly\neliminating the optimization states of AdamW. These savings provide significant\nsystem-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB\nsetup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model\nScalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without\nsystem-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training\nLLaMA-7B on a single GPU using less than 12 GB of memory with weight\nquantization.\n","authors":["Hanqing Zhu","Zhenyu Zhang","Wenyan Cong","Xi Liu","Sem Park","Vikas Chandra","Bo Long","David Z. Pan","Zhangyang Wang","Jinwon Lee"],"pdf_url":"https://arxiv.org/pdf/2412.05270v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.05269v1","updated":"2024-12-06T18:55:19Z","published":"2024-12-06T18:55:19Z","title":"Chimera: Accurate retrosynthesis prediction by ensembling models with\n diverse inductive biases","summary":" Planning and conducting chemical syntheses remains a major bottleneck in the\ndiscovery of functional small molecules, and prevents fully leveraging\ngenerative AI for molecular inverse design. While early work has shown that\nML-based retrosynthesis models can predict reasonable routes, their low\naccuracy for less frequent, yet important reactions has been pointed out. As\nmulti-step search algorithms are limited to reactions suggested by the\nunderlying model, the applicability of those tools is inherently constrained by\nthe accuracy of retrosynthesis prediction. Inspired by how chemists use\ndifferent strategies to ideate reactions, we propose Chimera: a framework for\nbuilding highly accurate reaction models that combine predictions from diverse\nsources with complementary inductive biases using a learning-based ensembling\nstrategy. We instantiate the framework with two newly developed models, which\nalready by themselves achieve state of the art in their categories. Through\nexperiments across several orders of magnitude in data scale and time-splits,\nwe show Chimera outperforms all major models by a large margin, owing both to\nthe good individual performance of its constituents, but also to the\nscalability of our ensembling strategy. Moreover, we find that PhD-level\norganic chemists prefer predictions from Chimera over baselines in terms of\nquality. Finally, we transfer the largest-scale checkpoint to an internal\ndataset from a major pharmaceutical company, showing robust generalization\nunder distribution shift. With the new dimension that our framework unlocks, we\nanticipate further acceleration in the development of even more accurate\nmodels.\n","authors":["Krzysztof Maziarz","Guoqing Liu","Hubert Misztela","Aleksei Kornev","Piotr Gaiński","Holger Hoefling","Mike Fortunato","Rishi Gupta","Marwin Segler"],"pdf_url":"https://arxiv.org/pdf/2412.05269v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05265v1","updated":"2024-12-06T18:53:49Z","published":"2024-12-06T18:53:49Z","title":"Reinforcement Learning: An Overview","summary":" This manuscript gives a big-picture, up-to-date overview of the field of\n(deep) reinforcement learning and sequential decision making, covering\nvalue-based RL, policy-gradient methods, model-based methods, and various other\ntopics (including a very brief discussion of RL+LLMs).\n","authors":["Kevin Murphy"],"pdf_url":"https://arxiv.org/pdf/2412.05265v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.17647v2","updated":"2024-12-06T18:52:25Z","published":"2024-10-23T08:04:12Z","title":"Entity-based Reinforcement Learning for Autonomous Cyber Defence","summary":" A significant challenge for autonomous cyber defence is ensuring a defensive\nagent's ability to generalise across diverse network topologies and\nconfigurations. This capability is necessary for agents to remain effective\nwhen deployed in dynamically changing environments, such as an enterprise\nnetwork where devices may frequently join and leave. Standard approaches to\ndeep reinforcement learning, where policies are parameterised using a\nfixed-input multi-layer perceptron (MLP) expect fixed-size observation and\naction spaces. In autonomous cyber defence, this makes it hard to develop\nagents that generalise to environments with network topologies different from\nthose trained on, as the number of nodes affects the natural size of the\nobservation and action spaces. To overcome this limitation, we reframe the\nproblem of autonomous network defence using entity-based reinforcement\nlearning, where the observation and action space of an agent are decomposed\ninto a collection of discrete entities. This framework enables the use of\npolicy parameterisations specialised in compositional generalisation. We train\na Transformer-based policy on the Yawning Titan cyber-security simulation\nenvironment and test its generalisation capabilities across various network\ntopologies. We demonstrate that this approach significantly outperforms an\nMLP-based policy when training across fixed-size networks of varying\ntopologies, and matches performance when training on a single network. We also\ndemonstrate the potential for zero-shot generalisation to networks of a\ndifferent size to those seen in training. These findings highlight the\npotential for entity-based reinforcement learning to advance the field of\nautonomous cyber defence by providing more generalisable policies capable of\nhandling variations in real-world network environments.\n","authors":["Isaac Symes Thompson","Alberto Caron","Chris Hicks","Vasilios Mavroudis"],"pdf_url":"https://arxiv.org/pdf/2410.17647v2.pdf","comment":"Material also appearing in the proceedings of the 1st International\n Workshop on Autonomous Cybersecurity at ACM CCS 2024"},{"id":"http://arxiv.org/abs/2406.15881v2","updated":"2024-12-06T18:41:56Z","published":"2024-06-22T16:05:34Z","title":"Fast Tree-Field Integrators: From Low Displacement Rank to Topological\n Transformers","summary":" We present a new class of fast polylog-linear algorithms based on the theory\nof structured matrices (in particular low displacement rank) for integrating\ntensor fields defined on weighted trees. Several applications of the resulting\nfast tree-field integrators (FTFIs) are presented, including (a) approximation\nof graph metrics with tree metrics, (b) graph classification, (c) modeling on\nmeshes, and finally (d) Topological Transformers (TTs) (Choromanski et al.,\n2022) for images. For Topological Transformers, we propose new relative\nposition encoding (RPE) masking mechanisms with as few as three extra learnable\nparameters per Transformer layer, leading to 1.0-1.5%+ accuracy gains.\nImportantly, most of FTFIs are exact methods, thus numerically equivalent to\ntheir brute-force counterparts. When applied to graphs with thousands of nodes,\nthose exact algorithms provide 5.7-13x speedups. We also provide an extensive\ntheoretical analysis of our methods.\n","authors":["Krzysztof Choromanski","Arijit Sehanobish","Somnath Basu Roy Chowdhury","Han Lin","Avinava Dubey","Tamas Sarlos","Snigdha Chaturvedi"],"pdf_url":"https://arxiv.org/pdf/2406.15881v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.05256v1","updated":"2024-12-06T18:41:39Z","published":"2024-12-06T18:41:39Z","title":"Extrapolated Urban View Synthesis Benchmark","summary":" Photorealistic simulators are essential for the training and evaluation of\nvision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis\n(NVS), a crucial capability that generates diverse unseen viewpoints to\naccommodate the broad and continuous pose distribution of AVs. Recent advances\nin radiance fields, such as 3D Gaussian Splatting, achieve photorealistic\nrendering at real-time speeds and have been widely used in modeling large-scale\ndriving scenes. However, their performance is commonly evaluated using an\ninterpolated setup with highly correlated training and test views. In contrast,\nextrapolation, where test views largely deviate from training views, remains\nunderexplored, limiting progress in generalizable simulation technology. To\naddress this gap, we leverage publicly available AV datasets with multiple\ntraversals, multiple vehicles, and multiple cameras to build the first\nExtrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct\nquantitative and qualitative evaluations of state-of-the-art Gaussian Splatting\nmethods across different difficulty levels. Our results show that Gaussian\nSplatting is prone to overfitting to training views. Besides, incorporating\ndiffusion priors and improving geometry cannot fundamentally improve NVS under\nlarge view changes, highlighting the need for more robust approaches and\nlarge-scale training. We have released our data to help advance self-driving\nand urban robotics simulation technology.\n","authors":["Xiangyu Han","Zhen Jia","Boyi Li","Yan Wang","Boris Ivanovic","Yurong You","Lingjie Liu","Yue Wang","Marco Pavone","Chen Feng","Yiming Li"],"pdf_url":"https://arxiv.org/pdf/2412.05256v1.pdf","comment":"Project page: https://ai4ce.github.io/EUVS-Benchmark/"},{"id":"http://arxiv.org/abs/2412.05252v1","updated":"2024-12-06T18:32:54Z","published":"2024-12-06T18:32:54Z","title":"From classical techniques to convolution-based models: A review of\n object detection algorithms","summary":" Object detection is a fundamental task in computer vision and image\nunderstanding, with the goal of identifying and localizing objects of interest\nwithin an image while assigning them corresponding class labels. Traditional\nmethods, which relied on handcrafted features and shallow models, struggled\nwith complex visual data and showed limited performance. These methods combined\nlow-level features with contextual information and lacked the ability to\ncapture high-level semantics. Deep learning, especially Convolutional Neural\nNetworks (CNNs), addressed these limitations by automatically learning rich,\nhierarchical features directly from data. These features include both semantic\nand high-level representations essential for accurate object detection. This\npaper reviews object detection frameworks, starting with classical computer\nvision methods. We categorize object detection approaches into two groups: (1)\nclassical computer vision techniques and (2) CNN-based detectors. We compare\nmajor CNN models, discussing their strengths and limitations. In conclusion,\nthis review highlights the significant advancements in object detection through\ndeep learning and identifies key areas for further research to improve\nperformance.\n","authors":["Fnu Neha","Deepshikha Bhati","Deepak Kumar Shukla","Md Amiruzzaman"],"pdf_url":"https://arxiv.org/pdf/2412.05252v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05251v1","updated":"2024-12-06T18:31:51Z","published":"2024-12-06T18:31:51Z","title":"Uncertainty Quantification for Transformer Models for Dark-Pattern\n Detection","summary":" The opaque nature of transformer-based models, particularly in applications\nsusceptible to unethical practices such as dark-patterns in user interfaces,\nrequires models that integrate uncertainty quantification to enhance trust in\npredictions. This study focuses on dark-pattern detection, deceptive design\nchoices that manipulate user decisions, undermining autonomy and consent. We\npropose a differential fine-tuning approach implemented at the final\nclassification head via uncertainty quantification with transformer-based\npre-trained models. Employing a dense neural network (DNN) head architecture as\na baseline, we examine two methods capable of quantifying uncertainty:\nSpectral-normalized Neural Gaussian Processes (SNGPs) and Bayesian Neural\nNetworks (BNNs). These methods are evaluated on a set of open-source\nfoundational models across multiple dimensions: model performance, variance in\ncertainty of predictions and environmental impact during training and inference\nphases. Results demonstrate that integrating uncertainty quantification\nmaintains performance while providing insights into challenging instances\nwithin the models. Moreover, the study reveals that the environmental impact\ndoes not uniformly increase with the incorporation of uncertainty\nquantification techniques. The study's findings demonstrate that uncertainty\nquantification enhances transparency and provides measurable confidence in\npredictions, improving the explainability and clarity of black-box models. This\nfacilitates informed decision-making and mitigates the influence of\ndark-patterns on user interfaces. These results highlight the importance of\nincorporating uncertainty quantification techniques in developing machine\nlearning models, particularly in domains where interpretability and\ntrustworthiness are critical.\n","authors":["Javier Muñoz","Álvaro Huertas-García","Carlos Martí-González","Enrique De Miguel Ambite"],"pdf_url":"https://arxiv.org/pdf/2412.05251v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.03459v3","updated":"2024-12-06T18:31:30Z","published":"2024-08-06T22:11:00Z","title":"On the Generalization of Preference Learning with DPO","summary":" Large language models (LLMs) have demonstrated remarkable capabilities but\noften struggle to align with human preferences, leading to harmful or\nundesirable outputs. Preference learning, which trains models to distinguish\nbetween preferred and non-preferred responses based on human feedback, has\nbecome a crucial component for ensuring that LLMs align with human values.\nDespite the widespread adoption in real-world systems, a thorough theoretical\nunderstanding of the generalization guarantees for these models remain lacking.\nThis paper bridges that gap by introducing a new theoretical framework to\nanalyze the generalization guarantees of models trained with direct preference\noptimization (DPO). While existing generalization theory often focuses on\noverparameterized models achieving near-optimal loss or models independent of\nthe training process, our framework rigorously assesses how well models\ngeneralize after a finite number of gradient steps, reflecting real-world LLM\ntraining practices. By analyzing the reward margin associated with each sample\nand its trajectory throughout training, we can effectively bound the\ngeneralization error. We derive learning guarantees showing that, under\nspecific conditions, models trained with DPO can correctly discern preferred\nresponses on unseen data with high probability. These insights are empirically\nvalidated on contemporary LLMs, underscoring the practical relevance of our\ntheoretical findings.\n","authors":["Shawn Im","Yixuan Li"],"pdf_url":"https://arxiv.org/pdf/2408.03459v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.01317v3","updated":"2024-12-06T18:25:17Z","published":"2024-06-03T13:29:36Z","title":"The Intelligible and Effective Graph Neural Additive Networks","summary":" Graph Neural Networks (GNNs) have emerged as the predominant approach for\nlearning over graph-structured data. However, most GNNs operate as black-box\nmodels and require post-hoc explanations, which may not suffice in high-stakes\nscenarios where transparency is crucial. In this paper, we present a GNN that\nis interpretable by design. Our model, Graph Neural Additive Network (GNAN), is\na novel extension of the interpretable class of Generalized Additive Models,\nand can be visualized and fully understood by humans. GNAN is designed to be\nfully interpretable, offering both global and local explanations at the feature\nand graph levels through direct visualization of the model. These\nvisualizations describe exactly how the model uses the relationships between\nthe target variable, the features, and the graph. We demonstrate the\nintelligibility of GNANs in a series of examples on different tasks and\ndatasets. In addition, we show that the accuracy of GNAN is on par with\nblack-box GNNs, making it suitable for critical applications where transparency\nis essential, alongside high accuracy.\n","authors":["Maya Bechler-Speicher","Amir Globerson","Ran Gilad-Bachrach"],"pdf_url":"https://arxiv.org/pdf/2406.01317v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.11112v4","updated":"2024-12-06T18:23:05Z","published":"2024-10-14T21:43:48Z","title":"Differentiable Weightless Neural Networks","summary":" We introduce the Differentiable Weightless Neural Network (DWN), a model\nbased on interconnected lookup tables. Training of DWNs is enabled by a novel\nExtended Finite Difference technique for approximate differentiation of binary\nvalues. We propose Learnable Mapping, Learnable Reduction, and Spectral\nRegularization to further improve the accuracy and efficiency of these models.\nWe evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware\naccelerator, where they demonstrate superior latency, throughput, energy\nefficiency, and model area compared to state-of-the-art solutions, (2) a\nlow-power microcontroller, where they achieve preferable accuracy to XGBoost\nwhile subject to stringent memory constraints, and (3) ultra-low-cost chips,\nwhere they consistently outperform small models in both accuracy and projected\nhardware area. DWNs also compare favorably against leading approaches for\ntabular datasets, with higher average rank. Overall, our work positions DWNs as\na pioneering solution for edge-compatible high-throughput neural networks.\n","authors":["Alan T. L. Bacellar","Zachary Susskind","Mauricio Breternitz Jr.","Eugene John","Lizy K. John","Priscila M. V. Lima","Felipe M. G. França"],"pdf_url":"https://arxiv.org/pdf/2410.11112v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05244v1","updated":"2024-12-06T18:22:59Z","published":"2024-12-06T18:22:59Z","title":"Enhancing Foundation Models for Time Series Forecasting via\n Wavelet-based Tokenization","summary":" How to best develop foundational models for time series forecasting remains\nan important open question. Tokenization is a crucial consideration in this\neffort: what is an effective discrete vocabulary for a real-valued sequential\ninput? To address this question, we develop WaveToken, a wavelet-based\ntokenizer that allows models to learn complex representations directly in the\nspace of time-localized frequencies. Our method first scales and decomposes the\ninput time series, then thresholds and quantizes the wavelet coefficients, and\nfinally pre-trains an autoregressive model to forecast coefficients for the\nforecast horizon. By decomposing coarse and fine structures in the inputs,\nwavelets provide an eloquent and compact language for time series forecasting\nthat simplifies learning. Empirical results on a comprehensive benchmark,\nincluding 42 datasets for both in-domain and zero-shot settings, show that\nWaveToken: i) provides better accuracy than recently proposed foundation models\nfor forecasting while using a much smaller vocabulary (1024 tokens), and\nperforms on par or better than modern deep learning models trained specifically\non each dataset; and ii) exhibits superior generalization capabilities,\nachieving the best average rank across all datasets for three complementary\nmetrics. In addition, we show that our method can easily capture complex\ntemporal patterns of practical relevance that are challenging for other recent\npre-trained models, including trends, sparse spikes, and non-stationary time\nseries with varying frequencies evolving over time.\n","authors":["Luca Masserano","Abdul Fatir Ansari","Boran Han","Xiyuan Zhang","Christos Faloutsos","Michael W. Mahoney","Andrew Gordon Wilson","Youngsuk Park","Syama Rangapuram","Danielle C. Maddix","Yuyang Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05244v1.pdf","comment":"25 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.05243v1","updated":"2024-12-06T18:22:47Z","published":"2024-12-06T18:22:47Z","title":"CompCap: Improving Multimodal Large Language Models with Composite\n Captions","summary":" How well can Multimodal Large Language Models (MLLMs) understand composite\nimages? Composite images (CIs) are synthetic visuals created by merging\nmultiple visual elements, such as charts, posters, or screenshots, rather than\nbeing captured directly by a camera. While CIs are prevalent in real-world\napplications, recent MLLM developments have primarily focused on interpreting\nnatural images (NIs). Our research reveals that current MLLMs face significant\nchallenges in accurately understanding CIs, often struggling to extract\ninformation or perform complex reasoning based on these images. We find that\nexisting training data for CIs are mostly formatted for question-answer tasks\n(e.g., in datasets like ChartQA and ScienceQA), while high-quality\nimage-caption datasets, critical for robust vision-language alignment, are only\navailable for NIs. To bridge this gap, we introduce Composite Captions\n(CompCap), a flexible framework that leverages Large Language Models (LLMs) and\nautomation tools to synthesize CIs with accurate and detailed captions. Using\nCompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs\nacross six CI types. We validate the effectiveness of CompCap-118K by\nsupervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and\nLLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K\nsignificantly enhances MLLMs' understanding of CIs, yielding average gains of\n1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively.\n","authors":["Xiaohui Chen","Satya Narayan Shukla","Mahmoud Azab","Aashu Singh","Qifan Wang","David Yang","ShengYun Peng","Hanchao Yu","Shen Yan","Xuewen Zhang","Baosheng He"],"pdf_url":"https://arxiv.org/pdf/2412.05243v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.14471v2","updated":"2024-12-06T18:22:32Z","published":"2024-08-26T17:59:01Z","title":"A Practitioner's Guide to Continual Multimodal Pretraining","summary":" Multimodal foundation models serve numerous applications at the intersection\nof vision and language. Still, despite being pretrained on extensive data, they\nbecome outdated over time. To keep models updated, research into continual\npretraining mainly explores scenarios with either (1) infrequent,\nindiscriminate updates on large-scale new data, or (2) frequent, sample-level\nupdates. However, practical model deployment often operates in the gap between\nthese two limit cases, as real-world applications often demand adaptation to\nspecific subdomains, tasks or concepts -- spread over the entire, varying life\ncycle of a model. In this work, we complement current perspectives on continual\npretraining through a research test bed as well as provide comprehensive\nguidance for effective continual model updates in such scenarios. We first\nintroduce FoMo-in-Flux, a continual multimodal pretraining benchmark with\nrealistic compute constraints and practical deployment requirements,\nconstructed over 63 datasets with diverse visual and semantic coverage. Using\nFoMo-in-Flux, we explore the complex landscape of practical continual\npretraining through multiple perspectives: (1) A data-centric investigation of\ndata mixtures and stream orderings that emulate real-world deployment\nsituations, (2) a method-centric investigation ranging from simple fine-tuning\nand traditional continual learning strategies to parameter-efficient updates\nand model merging, (3) meta learning rate schedules and mechanistic design\nchoices, and (4) the influence of model and compute scaling. Together, our\ninsights provide a practitioner's guide to continual multimodal pretraining for\nreal-world deployment. Our benchmark and code is here:\nhttps://github.com/ExplainableML/fomo_in_flux.\n","authors":["Karsten Roth","Vishaal Udandarao","Sebastian Dziadzio","Ameya Prabhu","Mehdi Cherti","Oriol Vinyals","Olivier Hénaff","Samuel Albanie","Matthias Bethge","Zeynep Akata"],"pdf_url":"https://arxiv.org/pdf/2408.14471v2.pdf","comment":"Technical Report. 52 pages. Shorter version published at the NeurIPS\n 2024 Dataset & Benchmarks track"},{"id":"http://arxiv.org/abs/2412.05233v1","updated":"2024-12-06T18:04:33Z","published":"2024-12-06T18:04:33Z","title":"Physics-informed reduced order model with conditional neural fields","summary":" This study presents the conditional neural fields for reduced-order modeling\n(CNF-ROM) framework to approximate solutions of parametrized partial\ndifferential equations (PDEs). The approach combines a parametric neural ODE\n(PNODE) for modeling latent dynamics over time with a decoder that reconstructs\nPDE solutions from the corresponding latent states. We introduce a\nphysics-informed learning objective for CNF-ROM, which includes two key\ncomponents. First, the framework uses coordinate-based neural networks to\ncalculate and minimize PDE residuals by computing spatial derivatives via\nautomatic differentiation and applying the chain rule for time derivatives.\nSecond, exact initial and boundary conditions (IC/BC) are imposed using\napproximate distance functions (ADFs) [Sukumar and Srivastava, CMAME, 2022].\nHowever, ADFs introduce a trade-off as their second- or higher-order\nderivatives become unstable at the joining points of boundaries. To address\nthis, we introduce an auxiliary network inspired by [Gladstone et al., NeurIPS\nML4PS workshop, 2022]. Our method is validated through parameter extrapolation\nand interpolation, temporal extrapolation, and comparisons with analytical\nsolutions.\n","authors":["Minji Kim","Tianshu Wen","Kookjin Lee","Youngsoo Choi"],"pdf_url":"https://arxiv.org/pdf/2412.05233v1.pdf","comment":"7 pages, 2 figures, NeurIPS 2024 Workshop on Machine Learning and the\n Physical Sciences"},{"id":"http://arxiv.org/abs/2412.05218v1","updated":"2024-12-06T17:48:43Z","published":"2024-12-06T17:48:43Z","title":"Transformers Meet Relational Databases","summary":" Transformer models have continuously expanded into all machine learning\ndomains convertible to the underlying sequence-to-sequence representation,\nincluding tabular data. However, while ubiquitous, this representation\nrestricts their extension to the more general case of relational databases. In\nthis paper, we introduce a modular neural message-passing scheme that closely\nadheres to the formal relational model, enabling direct end-to-end learning of\ntabular Transformers from database storage systems. We address the challenges\nof appropriate learning data representation and loading, which are critical in\nthe database setting, and compare our approach against a number of\nrepresentative models from various related fields across a significantly wide\nrange of datasets. Our results demonstrate a superior performance of this newly\nproposed class of neural architectures.\n","authors":["Jakub Peleška","Gustav Šír"],"pdf_url":"https://arxiv.org/pdf/2412.05218v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05216v1","updated":"2024-12-06T17:48:06Z","published":"2024-12-06T17:48:06Z","title":"ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And\n Segmentation Of GI Bleeding","summary":" This study presents an integrated deep learning model for automatic detection\nand classification of Gastrointestinal bleeding in the frames extracted from\nWireless Capsule Endoscopy (WCE) videos. The dataset has been released as part\nof Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model\nattained the highest performance among 75 teams that took part in this\ncompetition. It aims to efficiently utilizes CNN based model i.e. DenseNet and\nUNet to detect and segment bleeding and non-bleeding areas in the real-world\ncomplex dataset. The model achieves an impressive overall accuracy of 80% which\nwould surely help a skilled doctor to carry out further diagnostics.\n","authors":["Ayushman Singh","Sharad Prakash","Aniket Das","Nidhi Kushwaha"],"pdf_url":"https://arxiv.org/pdf/2412.05216v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.04922v2","updated":"2024-12-06T17:38:56Z","published":"2024-02-07T14:47:13Z","title":"Voronoi Candidates for Bayesian Optimization","summary":" Bayesian optimization (BO) offers an elegant approach for efficiently\noptimizing black-box functions. However, acquisition criteria demand their own\nchallenging inner-optimization, which can induce significant overhead. Many\npractical BO methods, particularly in high dimension, eschew a formal,\ncontinuous optimization of the acquisition function and instead search\ndiscretely over a finite set of space-filling candidates. Here, we propose to\nuse candidates which lie on the boundary of the Voronoi tessellation of the\ncurrent design points, so they are equidistant to two or more of them. We\ndiscuss strategies for efficient implementation by directly sampling the\nVoronoi boundary without explicitly generating the tessellation, thus\naccommodating large designs in high dimension. On a battery of test problems\noptimized via Gaussian processes with expected improvement, our proposed\napproach significantly improves the execution time of a multi-start continuous\nsearch without a loss in accuracy.\n","authors":["Nathan Wycoff","John W. Smith","Annie S. Booth","Robert B. Gramacy"],"pdf_url":"https://arxiv.org/pdf/2402.04922v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05204v1","updated":"2024-12-06T17:33:43Z","published":"2024-12-06T17:33:43Z","title":"Global Optimization with A Power-Transformed Objective and Gaussian\n Smoothing","summary":" We propose a novel method that solves global optimization problems in two\nsteps: (1) perform a (exponential) power-$N$ transformation to the\nnot-necessarily differentiable objective function $f$ to obtain $f_N$, and (2)\noptimize the Gaussian-smoothed $f_N$ with stochastic approximations. Under mild\nconditions on $f$, for any $\\delta>0$, we prove that with a sufficiently large\npower $N_\\delta$, this method converges to a solution in the\n$\\delta$-neighborhood of $f$'s global maximum point. The convergence rate is\n$O(d^2\\sigma^4\\varepsilon^{-2})$, which is faster than both the standard and\nsingle-loop homotopy methods. Extensive experiments show that our method\nrequires significantly fewer iterations than other compared algorithms to\nproduce a high-quality solution.\n","authors":["Chen Xu"],"pdf_url":"https://arxiv.org/pdf/2412.05204v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11527v3","updated":"2024-12-06T17:31:39Z","published":"2024-08-21T11:06:02Z","title":"The Vizier Gaussian Process Bandit Algorithm","summary":" Google Vizier has performed millions of optimizations and accelerated\nnumerous research and production systems at Google, demonstrating the success\nof Bayesian optimization as a large-scale service. Over multiple years, its\nalgorithm has been improved considerably, through the collective experiences of\nnumerous research efforts and user feedback. In this technical report, we\ndiscuss the implementation details and design choices of the current default\nalgorithm provided by Open Source Vizier. Our experiments on standardized\nbenchmarks reveal its robustness and versatility against well-established\nindustry baselines on multiple practical modes.\n","authors":["Xingyou Song","Qiuyi Zhang","Chansoo Lee","Emily Fertig","Tzu-Kuo Huang","Lior Belenki","Greg Kochanski","Setareh Ariafar","Srinivas Vasudevan","Sagi Perel","Daniel Golovin"],"pdf_url":"https://arxiv.org/pdf/2408.11527v3.pdf","comment":"Google DeepMind Technical Report. Code can be found in\n https://github.com/google/vizier"},{"id":"http://arxiv.org/abs/2312.01201v5","updated":"2024-12-06T17:16:54Z","published":"2023-12-02T18:42:52Z","title":"PAC Privacy Preserving Diffusion Models","summary":" Data privacy protection is garnering increased attention among researchers.\nDiffusion models (DMs), particularly with strict differential privacy, can\npotentially produce images with both high privacy and visual quality. However,\nchallenges arise such as in ensuring robust protection in privatizing specific\ndata attributes, areas where current models often fall short. To address these\nchallenges, we introduce the PAC Privacy Preserving Diffusion Model, a model\nleverages diffusion principles and ensure Probably Approximately Correct (PAC)\nprivacy. We enhance privacy protection by integrating a private classifier\nguidance into the Langevin Sampling Process. Additionally, recognizing the gap\nin measuring the privacy of models, we have developed a novel metric to gauge\nprivacy levels. Our model, assessed with this new metric and supported by\nGaussian matrix computations for the PAC bound, has shown superior performance\nin privacy protection over existing leading private generative models according\nto benchmark tests.\n","authors":["Qipan Xu","Youlong Ding","Xinxi Zhang","Jie Gao","Hao Wang"],"pdf_url":"https://arxiv.org/pdf/2312.01201v5.pdf","comment":"arXiv admin note: text overlap with arXiv:2210.03458 by other authors"},{"id":"http://arxiv.org/abs/2412.05186v1","updated":"2024-12-06T17:05:34Z","published":"2024-12-06T17:05:34Z","title":"One-shot Federated Learning via Synthetic Distiller-Distillate\n Communication","summary":" One-shot Federated learning (FL) is a powerful technology facilitating\ncollaborative training of machine learning models in a single round of\ncommunication. While its superiority lies in communication efficiency and\nprivacy preservation compared to iterative FL, one-shot FL often compromises\nmodel performance. Prior research has primarily focused on employing data-free\nknowledge distillation to optimize data generators and ensemble models for\nbetter aggregating local knowledge into the server model. However, these\nmethods typically struggle with data heterogeneity, where inconsistent local\ndata distributions can cause teachers to provide misleading knowledge.\nAdditionally, they may encounter scalability issues with complex datasets due\nto inherent two-step information loss: first, during local training (from data\nto model), and second, when transferring knowledge to the server model (from\nmodel to inversed data). In this paper, we propose FedSD2C, a novel and\npractical one-shot FL framework designed to address these challenges. FedSD2C\nintroduces a distiller to synthesize informative distillates directly from\nlocal data to reduce information loss and proposes sharing synthetic\ndistillates instead of inconsistent local models to tackle data heterogeneity.\nOur empirical results demonstrate that FedSD2C consistently outperforms other\none-shot FL methods with more complex and real datasets, achieving up to 2.6\nthe performance of the best baseline. Code: https://github.com/Carkham/FedSD2C\n","authors":["Junyuan Zhang","Songhua Liu","Xinchao Wang"],"pdf_url":"https://arxiv.org/pdf/2412.05186v1.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.05185v1","updated":"2024-12-06T17:04:42Z","published":"2024-12-06T17:04:42Z","title":"LinVT: Empower Your Image-level Large Language Model to Understand\n Videos","summary":" Large Language Models (LLMs) have been widely used in various tasks,\nmotivating us to develop an LLM-based assistant for videos. Instead of training\nfrom scratch, we propose a module to transform arbitrary well-trained\nimage-based LLMs into video-LLMs (after being trained on video data). To better\nadapt image-LLMs for processing videos, we introduce two design principles:\nlinear transformation to preserve the original visual-language alignment and\nrepresentative information condensation from redundant video content. Guided by\nthese principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),\nwhich enables existing image-LLMs to understand videos. We benchmark LinVT with\nsix recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,\nshowcasing the high compatibility of LinVT. LinVT-based LLMs achieve\nstate-of-the-art performance across various video benchmarks, illustrating the\neffectiveness of LinVT in multi-modal video understanding.\n","authors":["Lishuai Gao","Yujie Zhong","Yingsen Zeng","Haoxian Tan","Dengjie Li","Zheng Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.05185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05183v1","updated":"2024-12-06T17:04:09Z","published":"2024-12-06T17:04:09Z","title":"Privacy Drift: Evolving Privacy Concerns in Incremental Learning","summary":" In the evolving landscape of machine learning (ML), Federated Learning (FL)\npresents a paradigm shift towards decentralized model training while preserving\nuser data privacy. This paper introduces the concept of ``privacy drift\", an\ninnovative framework that parallels the well-known phenomenon of concept drift.\nWhile concept drift addresses the variability in model accuracy over time due\nto changes in the data, privacy drift encapsulates the variation in the leakage\nof private information as models undergo incremental training. By defining and\nexamining privacy drift, this study aims to unveil the nuanced relationship\nbetween the evolution of model performance and the integrity of data privacy.\nThrough rigorous experimentation, we investigate the dynamics of privacy drift\nin FL systems, focusing on how model updates and data distribution shifts\ninfluence the susceptibility of models to privacy attacks, such as membership\ninference attacks (MIA). Our results highlight a complex interplay between\nmodel accuracy and privacy safeguards, revealing that enhancements in model\nperformance can lead to increased privacy risks. We provide empirical evidence\nfrom experiments on customized datasets derived from CIFAR-100 (Canadian\nInstitute for Advanced Research, 100 classes), showcasing the impact of data\nand concept drift on privacy. This work lays the groundwork for future research\non privacy-aware machine learning, aiming to achieve a delicate balance between\nmodel accuracy and data privacy in decentralized environments.\n","authors":["Sayyed Farid Ahamed","Soumya Banerjee","Sandip Roy","Aayush Kapoor","Marc Vucovich","Kevin Choi","Abdul Rahman","Edward Bowen","Sachin Shetty"],"pdf_url":"https://arxiv.org/pdf/2412.05183v1.pdf","comment":"6 pages, 7 figures, Accepted in IEEE ICNC 25"},{"id":"http://arxiv.org/abs/2410.21265v2","updated":"2024-12-06T17:02:28Z","published":"2024-10-28T17:57:31Z","title":"Modular Duality in Deep Learning","summary":" An old idea in optimization theory says that since the gradient is a dual\nvector it may not be subtracted from the weights without first being mapped to\nthe primal space where the weights reside. We take this idea seriously in this\npaper and construct such a duality map for general neural networks. Our map,\nwhich we call modular dualization, forms a unifying theoretical basis for\ntraining algorithms that are a) fast and b) scalable. Modular dualization\ninvolves first assigning operator norms to layers based on the semantics of\neach layer, and then using these layerwise norms to recursively induce a\nduality map on the weight space of the full neural architecture. We conclude by\nderiving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers\n-- the latter two methods are based on a rectangular Newton-Schulz iteration\n(Kovarik, 1970; Bj\\\"orck & Bowie, 1971). A variant of our methods was used to\nset speed records for training NanoGPT. Overall, we hope that our theory of\nmodular duality will yield a next generation of fast and scalable optimizers\nfor general neural architectures.\n","authors":["Jeremy Bernstein","Laker Newhouse"],"pdf_url":"https://arxiv.org/pdf/2410.21265v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18076v2","updated":"2024-12-06T16:57:15Z","published":"2024-10-23T17:58:45Z","title":"Leveraging Skills from Unlabeled Prior Data for Efficient Online\n Exploration","summary":" Unsupervised pretraining has been transformative in many supervised domains.\nHowever, applying such ideas to reinforcement learning (RL) presents a unique\nchallenge in that fine-tuning does not involve mimicking task-specific data,\nbut rather exploring and locating the solution through iterative\nself-improvement. In this work, we study how unlabeled prior trajectory data\ncan be leveraged to learn efficient exploration strategies. While prior data\ncan be used to pretrain a set of low-level skills, or as additional off-policy\ndata for online RL, it has been unclear how to combine these ideas effectively\nfor online exploration. Our method SUPE (Skills from Unlabeled Prior data for\nExploration) demonstrates that a careful combination of these ideas compounds\ntheir benefits. Our method first extracts low-level skills using a variational\nautoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an\noptimistic reward model, transforming prior data into high-level, task-relevant\nexamples. Finally, SUPE uses these transformed examples as additional\noff-policy data for online RL to learn a high-level policy that composes\npretrained low-level skills to explore efficiently. We empirically show that\nSUPE reliably outperforms prior strategies, successfully solving a suite of\nlong-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe.\n","authors":["Max Wilcoxson","Qiyang Li","Kevin Frans","Sergey Levine"],"pdf_url":"https://arxiv.org/pdf/2410.18076v2.pdf","comment":"32 pages, 19 figures"},{"id":"http://arxiv.org/abs/2202.05656v2","updated":"2024-12-06T16:56:46Z","published":"2022-02-11T14:55:56Z","title":"Evaluation of post-hoc interpretability methods in time-series\n classification","summary":" Post-hoc interpretability methods are critical tools to explain\nneural-network results. Several post-hoc methods have emerged in recent years,\nbut when applied to a given task, they produce different results, raising the\nquestion of which method is the most suitable to provide correct post-hoc\ninterpretability. To understand the performance of each method, quantitative\nevaluation of interpretability methods is essential. However, currently\navailable frameworks have several drawbacks which hinders the adoption of\npost-hoc interpretability methods, especially in high-risk sectors. In this\nwork, we propose a framework with quantitative metrics to assess the\nperformance of existing post-hoc interpretability methods in particular in time\nseries classification. We show that several drawbacks identified in the\nliterature are addressed, namely dependence on human judgement, retraining, and\nshift in the data distribution when occluding samples. We additionally design a\nsynthetic dataset with known discriminative features and tunable complexity.\nThe proposed methodology and quantitative metrics can be used to understand the\nreliability of interpretability methods results obtained in practical\napplications. In turn, they can be embedded within operational workflows in\ncritical fields that require accurate interpretability results for e.g.,\nregulatory policies.\n","authors":["Hugues Turbé","Mina Bjelogrlic","Christian Lovis","Gianmarco Mengaldo"],"pdf_url":"https://arxiv.org/pdf/2202.05656v2.pdf","comment":"New version to match published version in Nature Machine Intelligence"},{"id":"http://arxiv.org/abs/2412.05175v1","updated":"2024-12-06T16:46:48Z","published":"2024-12-06T16:46:48Z","title":"Variational Encoder-Decoders for Learning Latent Representations of\n Physical Systems","summary":" We present a deep-learning Variational Encoder-Decoder (VED) framework for\nlearning data-driven low-dimensional representations of the relationship\nbetween high-dimensional parameters of a physical system and the system's\nhigh-dimensional observable response. The framework consists of two deep\nlearning-based probabilistic transformations: An encoder mapping parameters to\nlatent codes and a decoder mapping latent codes to the observable response. The\nhyperparameters of these transformations are identified by maximizing a\nvariational lower bound on the log-conditional distribution of the observable\nresponse given parameters. To promote the disentanglement of latent codes, we\nequip this variational loss with a penalty on the off-diagonal entries of the\naggregate distribution covariance of codes. This regularization penalty\nencourages the pushforward of a standard Gaussian distribution of latent codes\nto approximate the marginal distribution of the observable response.\n Using the proposed framework we successfully model the hydraulic pressure\nresponse at observation wells of a groundwater flow model as a function of its\ndiscrete log-hydraulic transmissivity field. Compared to the canonical\ncorrelation analysis encoding, the VED model achieves a lower-dimensional\nlatent representation, with as low as $r = 50$ latent dimensions without a\nsignificant loss of reconstruction accuracy. We explore the impact of\nregularization on model performance, finding that KL-divergence and covariance\nregularization improve feature disentanglement in latent space while\nmaintaining reconstruction accuracy. Furthermore, we evaluate the generative\ncapabilities of the regularized model by decoding random Gaussian noise,\nrevealing that tuning both $\\beta$ and $\\lambda$ parameters enhances the\nquality of the generated observable response data.\n","authors":["Subashree Venkatasubramanian","David A. Barajas-Solano"],"pdf_url":"https://arxiv.org/pdf/2412.05175v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05169v1","updated":"2024-12-06T16:41:44Z","published":"2024-12-06T16:41:44Z","title":"Towards Understanding the Role of Sharpness-Aware Minimization\n Algorithms for Out-of-Distribution Generalization","summary":" Recently, sharpness-aware minimization (SAM) has emerged as a promising\nmethod to improve generalization by minimizing sharpness, which is known to\ncorrelate well with generalization ability. Since the original proposal of SAM,\nmany variants of SAM have been proposed to improve its accuracy and efficiency,\nbut comparisons have mainly been restricted to the i.i.d. setting. In this\npaper we study SAM for out-of-distribution (OOD) generalization. First, we\nperform a comprehensive comparison of eight SAM variants on zero-shot OOD\ngeneralization, finding that the original SAM outperforms the Adam baseline by\n$4.76\\%$ and the strongest SAM variants outperform the Adam baseline by\n$8.01\\%$ on average. We then provide an OOD generalization bound in terms of\nsharpness for this setting. Next, we extend our study of SAM to the related\nsetting of gradual domain adaptation (GDA), another form of OOD generalization\nwhere intermediate domains are constructed between the source and target\ndomains, and iterative self-training is done on intermediate domains, to\nimprove the overall target domain error. In this setting, our experimental\nresults demonstrate that the original SAM outperforms the baseline of Adam on\neach of the experimental datasets by $0.82\\%$ on average and the strongest SAM\nvariants outperform Adam by $1.52\\%$ on average. We then provide a\ngeneralization bound for SAM in the GDA setting. Asymptotically, this\ngeneralization bound is no better than the one for self-training in the\nliterature of GDA. This highlights a further disconnection between the\ntheoretical justification for SAM versus its empirical performance, with recent\nwork finding that low sharpness alone does not account for all of SAM's\ngeneralization benefits. For future work, we provide several potential avenues\nfor obtaining a tighter analysis for SAM in the OOD setting.\n","authors":["Samuel Schapiro","Han Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.05169v1.pdf","comment":"25 pages"},{"id":"http://arxiv.org/abs/2412.05164v1","updated":"2024-12-06T16:29:53Z","published":"2024-12-06T16:29:53Z","title":"A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving\n Survival Analysis","summary":" This paper presents a differentially private approach to Kaplan-Meier\nestimation that achieves accurate survival probability estimates while\nsafeguarding individual privacy. The Kaplan-Meier estimator is widely used in\nsurvival analysis to estimate survival functions over time, yet applying it to\nsensitive datasets, such as clinical records, risks revealing private\ninformation. To address this, we introduce a novel algorithm that applies\ntime-indexed Laplace noise, dynamic clipping, and smoothing to produce a\nprivacy-preserving survival curve while maintaining the cumulative structure of\nthe Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts\nfor decreasing sensitivity as fewer individuals remain at risk, while dynamic\nclipping and smoothing prevent extreme values and reduce fluctuations,\npreserving the natural shape of the survival curve.\n Our results, evaluated on the NCCTG lung cancer dataset, show that the\nproposed method effectively lowers root mean squared error (RMSE) and enhances\naccuracy across privacy budgets ($\\epsilon$). At $\\epsilon = 10$, the algorithm\nachieves an RMSE as low as 0.04, closely approximating non-private estimates.\nAdditionally, membership inference attacks reveal that higher $\\epsilon$ values\n(e.g., $\\epsilon \\geq 6$) significantly reduce influential points, particularly\nat higher thresholds, lowering susceptibility to inference attacks. These\nfindings confirm that our approach balances privacy and utility, advancing\nprivacy-preserving survival analysis.\n","authors":["Narasimha Raghavan Veeraragavan","Sai Praneeth Karimireddy","Jan Franz Nygård"],"pdf_url":"https://arxiv.org/pdf/2412.05164v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.12537v2","updated":"2024-12-06T16:22:21Z","published":"2024-11-19T14:35:38Z","title":"Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues","summary":" Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and\nDeltaNet have emerged as efficient alternatives to Transformers in large\nlanguage modeling, offering linear scaling with sequence length and improved\ntraining efficiency. However, LRNNs struggle to perform state-tracking which\nmay impair performance in tasks such as code evaluation or tracking a chess\ngame. Even parity, the simplest state-tracking task, which non-linear RNNs like\nLSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et\nal. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity\nstems from restricting the value range of their diagonal state-transition\nmatrices to $[0, 1]$ and that incorporating negative values can resolve this\nissue. We extend this result to non-diagonal LRNNs, which have recently shown\npromise in models such as DeltaNet. We prove that finite precision LRNNs with\nstate-transition matrices having only positive eigenvalues cannot solve parity,\nwhile complex eigenvalues are needed to count modulo $3$. Notably, we also\nprove that LRNNs can learn any regular language when their state-transition\nmatrices are products of identity minus vector outer product matrices, each\nwith eigenvalues in the range $[-1, 1]$. Our empirical results confirm that\nextending the eigenvalue range of models like Mamba and DeltaNet to include\nnegative values not only enables them to solve parity but consistently improves\ntheir performance on state-tracking tasks. Furthermore, pre-training LRNNs with\nan extended eigenvalue range for language modeling achieves comparable\nperformance and stability while showing promise on code and math data. Our work\nenhances the expressivity of modern LRNNs, broadening their applicability\nwithout changing the cost of training or inference.\n","authors":["Riccardo Grazzi","Julien Siems","Jörg K. H. Franke","Arber Zela","Frank Hutter","Massimiliano Pontil"],"pdf_url":"https://arxiv.org/pdf/2411.12537v2.pdf","comment":"Main changes: Correction to Theorem 1 and 2 (we excluded from the\n only if condition complex eigenvalues with modulus strictly less than one).\n Correction to point 3 of Proposition 3"},{"id":"http://arxiv.org/abs/2412.05153v1","updated":"2024-12-06T16:10:40Z","published":"2024-12-06T16:10:40Z","title":"A text-to-tabular approach to generate synthetic patient data using LLMs","summary":" Access to large-scale high-quality healthcare databases is key to accelerate\nmedical research and make insightful discoveries about diseases. However,\naccess to such data is often limited by patient privacy concerns, data sharing\nrestrictions and high costs. To overcome these limitations, synthetic patient\ndata has emerged as an alternative. However, synthetic data generation (SDG)\nmethods typically rely on machine learning (ML) models trained on original\ndata, leading back to the data scarcity problem. We propose an approach to\ngenerate synthetic tabular patient data that does not require access to the\noriginal data, but only a description of the desired database. We leverage\nprior medical knowledge and in-context learning capabilities of large language\nmodels (LLMs) to generate realistic patient data, even in a low-resource\nsetting. We quantitatively evaluate our approach against state-of-the-art SDG\nmodels, using fidelity, privacy, and utility metrics. Our results show that\nwhile LLMs may not match the performance of state-of-the-art models trained on\nthe original data, they effectively generate realistic patient data with\nwell-preserved clinical correlations. An ablation study highlights key elements\nof our prompt contributing to high-quality synthetic patient data generation.\nThis approach, which is easy to use and does not require original data or\nadvanced ML skills, is particularly valuable for quickly generating\ncustom-designed patient data, supporting project implementation and providing\neducational resources.\n","authors":["Margaux Tornqvist","Jean-Daniel Zucker","Tristan Fauvel","Nicolas Lambert","Mathilde Berthelot","Antoine Movschin"],"pdf_url":"https://arxiv.org/pdf/2412.05153v1.pdf","comment":"12 pages, 2 figures, 3 tables"},{"id":"http://arxiv.org/abs/2412.05152v1","updated":"2024-12-06T16:10:13Z","published":"2024-12-06T16:10:13Z","title":"Navigating Shortcuts, Spurious Correlations, and Confounders: From\n Origins via Detection to Mitigation","summary":" Shortcuts, also described as Clever Hans behavior, spurious correlations, or\nconfounders, present a significant challenge in machine learning and AI,\ncritically affecting model generalization and robustness. Research in this\narea, however, remains fragmented across various terminologies, hindering the\nprogress of the field as a whole. Consequently, we introduce a unifying\ntaxonomy of shortcut learning by providing a formal definition of shortcuts and\nbridging the diverse terms used in the literature. In doing so, we further\nestablish important connections between shortcuts and related fields, including\nbias, causality, and security, where parallels exist but are rarely discussed.\nOur taxonomy organizes existing approaches for shortcut detection and\nmitigation, providing a comprehensive overview of the current state of the\nfield and revealing underexplored areas and open challenges. Moreover, we\ncompile and classify datasets tailored to study shortcut learning. Altogether,\nthis work provides a holistic perspective to deepen understanding and drive the\ndevelopment of more effective strategies for addressing shortcuts in machine\nlearning.\n","authors":["David Steinmann","Felix Divo","Maurice Kraus","Antonia Wüst","Lukas Struppek","Felix Friedrich","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2412.05152v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05148v1","updated":"2024-12-06T16:04:56Z","published":"2024-12-06T16:04:56Z","title":"LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style\n Conditioned Image Generation","summary":" Recent advancements in image generation models have enabled personalized\nimage creation with both user-defined subjects (content) and styles. Prior\nworks achieved personalization by merging corresponding low-rank adaptation\nparameters (LoRAs) through optimization-based methods, which are\ncomputationally demanding and unsuitable for real-time use on\nresource-constrained devices like smartphones. To address this, we introduce\nLoRA.rar, a method that not only improves image quality but also achieves a\nremarkable speedup of over $4000\\times$ in the merging process. LoRA.rar\npre-trains a hypernetwork on a diverse set of content-style LoRA pairs,\nlearning an efficient merging strategy that generalizes to new, unseen\ncontent-style pairs, enabling fast, high-quality personalization. Moreover, we\nidentify limitations in existing evaluation metrics for content-style quality\nand propose a new protocol using multimodal large language models (MLLM) for\nmore accurate assessment. Our method significantly outperforms the current\nstate of the art in both content and style fidelity, as validated by MLLM\nassessments and human evaluations.\n","authors":["Donald Shenaj","Ondrej Bohdal","Mete Ozay","Pietro Zanuttigh","Umberto Michieli"],"pdf_url":"https://arxiv.org/pdf/2412.05148v1.pdf","comment":"17 pages, 20 figures"},{"id":"http://arxiv.org/abs/2304.12906v3","updated":"2024-12-06T16:02:25Z","published":"2023-04-25T15:21:12Z","title":"The Score-Difference Flow for Implicit Generative Modeling","summary":" Implicit generative modeling (IGM) aims to produce samples of synthetic data\nmatching the characteristics of a target data distribution. Recent work (e.g.\nscore-matching networks, diffusion models) has approached the IGM problem from\nthe perspective of pushing synthetic source data toward the target distribution\nvia dynamical perturbations or flows in the ambient space. In this direction,\nwe present the score difference (SD) between arbitrary target and source\ndistributions as a flow that optimally reduces the Kullback-Leibler divergence\nbetween them. We apply the SD flow to convenient proxy distributions, which are\naligned if and only if the original distributions are aligned. We demonstrate\nthe formal equivalence of this formulation to denoising diffusion models under\ncertain conditions. We also show that the training of generative adversarial\nnetworks includes a hidden data-optimization sub-problem, which induces the SD\nflow under certain choices of loss function when the discriminator is optimal.\nAs a result, the SD flow provides a theoretical link between model classes that\nindividually address the three challenges of the \"generative modeling trilemma\"\n-- high sample quality, mode coverage, and fast sampling -- thereby setting the\nstage for a unified approach.\n","authors":["Romann M. Weber"],"pdf_url":"https://arxiv.org/pdf/2304.12906v3.pdf","comment":"25 pages, 5 figures, 4 tables. Updated, lightly revised version of a\n paper originally published in Transactions on Machine Learning Research\n (TMLR)"},{"id":"http://arxiv.org/abs/2412.05145v1","updated":"2024-12-06T16:01:30Z","published":"2024-12-06T16:01:30Z","title":"Explingo: Explaining AI Predictions using Large Language Models","summary":" Explanations of machine learning (ML) model predictions generated by\nExplainable AI (XAI) techniques such as SHAP are essential for people using ML\noutputs for decision-making. We explore the potential of Large Language Models\n(LLMs) to transform these explanations into human-readable, narrative formats\nthat align with natural communication. We address two key research questions:\n(1) Can LLMs reliably transform traditional explanations into high-quality\nnarratives? and (2) How can we effectively evaluate the quality of narrative\nexplanations? To answer these questions, we introduce Explingo, which consists\nof two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML\nexplanations and transforms them into natural-language descriptions. The Grader\nscores these narratives on a set of metrics including accuracy, completeness,\nfluency, and conciseness.\n Our experiments demonstrate that LLMs can generate high-quality narratives\nthat achieve high scores across all metrics, particularly when guided by a\nsmall number of human-labeled and bootstrapped examples. We also identified\nareas that remain challenging, in particular for effectively scoring narratives\nin complex domains. The findings from this work have been integrated into an\nopen-source tool that makes narrative explanations available for further\napplications.\n","authors":["Alexandra Zytek","Sara Pido","Sarah Alnegheimish","Laure Berti-Equille","Kalyan Veeramachaneni"],"pdf_url":"https://arxiv.org/pdf/2412.05145v1.pdf","comment":"To be presented in the 2024 IEEE International Conference on Big Data\n (IEEE BigData)"},{"id":"http://arxiv.org/abs/2412.05144v1","updated":"2024-12-06T16:00:50Z","published":"2024-12-06T16:00:50Z","title":"Effective Rank and the Staircase Phenomenon: New Insights into Neural\n Network Training Dynamics","summary":" In recent years, deep learning, powered by neural networks, has achieved\nwidespread success in solving high-dimensional problems, particularly those\nwith low-dimensional feature structures. This success stems from their ability\nto identify and learn low dimensional features tailored to the problems.\nUnderstanding how neural networks extract such features during training\ndynamics remains a fundamental question in deep learning theory. In this work,\nwe propose a novel perspective by interpreting the neurons in the last hidden\nlayer of a neural network as basis functions that represent essential features.\nTo explore the linear independence of these basis functions throughout the deep\nlearning dynamics, we introduce the concept of 'effective rank'. Our extensive\nnumerical experiments reveal a notable phenomenon: the effective rank increases\nprogressively during the learning process, exhibiting a staircase-like pattern,\nwhile the loss function concurrently decreases as the effective rank rises. We\nrefer to this observation as the 'staircase phenomenon'. Specifically, for deep\nneural networks, we rigorously prove the negative correlation between the loss\nfunction and effective rank, demonstrating that the lower bound of the loss\nfunction decreases with increasing effective rank. Therefore, to achieve a\nrapid descent of the loss function, it is critical to promote the swift growth\nof effective rank. Ultimately, we evaluate existing advanced learning\nmethodologies and find that these approaches can quickly achieve a higher\neffective rank, thereby avoiding redundant staircase processes and accelerating\nthe rapid decline of the loss function.\n","authors":["Yang Jiang","Yuxiang Zhao","Quanhui Zhu"],"pdf_url":"https://arxiv.org/pdf/2412.05144v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05135v1","updated":"2024-12-06T15:51:04Z","published":"2024-12-06T15:51:04Z","title":"The Polynomial Stein Discrepancy for Assessing Moment Convergence","summary":" We propose a novel method for measuring the discrepancy between a set of\nsamples and a desired posterior distribution for Bayesian inference. Classical\nmethods for assessing sample quality like the effective sample size are not\nappropriate for scalable Bayesian sampling algorithms, such as stochastic\ngradient Langevin dynamics, that are asymptotically biased. Instead, the gold\nstandard is to use the kernel Stein Discrepancy (KSD), which is itself not\nscalable given its quadratic cost in the number of samples. The KSD and its\nfaster extensions also typically suffer from the curse-of-dimensionality and\ncan require extensive tuning. To address these limitations, we develop the\npolynomial Stein discrepancy (PSD) and an associated goodness-of-fit test.\nWhile the new test is not fully convergence-determining, we prove that it\ndetects differences in the first r moments in the Bernstein-von Mises limit. We\nempirically show that the test has higher power than its competitors in several\nexamples, and at a lower computational cost. Finally, we demonstrate that the\nPSD can assist practitioners to select hyper-parameters of Bayesian sampling\nalgorithms more efficiently than competitors.\n","authors":["Narayan Srinivasan","Matthew Sutton","Christopher Drovandi","Leah F South"],"pdf_url":"https://arxiv.org/pdf/2412.05135v1.pdf","comment":"17 Pages, 14 Figs"},{"id":"http://arxiv.org/abs/2412.05134v1","updated":"2024-12-06T15:47:53Z","published":"2024-12-06T15:47:53Z","title":"How to Squeeze An Explanation Out of Your Model","summary":" Deep learning models are widely used nowadays for their reliability in\nperforming various tasks. However, they do not typically provide the reasoning\nbehind their decision, which is a significant drawback, particularly for more\nsensitive areas such as biometrics, security and healthcare. The most commonly\nused approaches to provide interpretability create visual attention heatmaps of\nregions of interest on an image based on models gradient backpropagation.\nAlthough this is a viable approach, current methods are targeted toward image\nsettings and default/standard deep learning models, meaning that they require\nsignificant adaptations to work on video/multi-modal settings and custom\narchitectures. This paper proposes an approach for interpretability that is\nmodel-agnostic, based on a novel use of the Squeeze and Excitation (SE) block\nthat creates visual attention heatmaps. By including an SE block prior to the\nclassification layer of any model, we are able to retrieve the most influential\nfeatures via SE vector manipulation, one of the key components of the SE block.\nOur results show that this new SE-based interpretability can be applied to\nvarious models in image and video/multi-modal settings, namely biometrics of\nfacial features with CelebA and behavioral biometrics using Active Speaker\nDetection datasets. Furthermore, our proposal does not compromise model\nperformance toward the original task, and has competitive results with current\ninterpretability approaches in state-of-the-art object datasets, highlighting\nits robustness to perform in varying data aside from the biometric context.\n","authors":["Tiago Roxo","Joana C. Costa","Pedro R. M. Inácio","Hugo Proença"],"pdf_url":"https://arxiv.org/pdf/2412.05134v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05133v1","updated":"2024-12-06T15:44:59Z","published":"2024-12-06T15:44:59Z","title":"Learning Hidden Physics and System Parameters with Deep Operator\n Networks","summary":" Big data is transforming scientific progress by enabling the discovery of\nnovel models, enhancing existing frameworks, and facilitating precise\nuncertainty quantification, while advancements in scientific machine learning\ncomplement this by providing powerful tools to solve inverse problems to\nidentify the complex systems where traditional methods falter due to sparse or\nnoisy data. We introduce two innovative neural operator frameworks tailored for\ndiscovering hidden physics and identifying unknown system parameters from\nsparse measurements. The first framework integrates a popular neural operator,\nDeepONet, and a physics-informed neural network to capture the relationship\nbetween sparse data and the underlying physics, enabling the accurate discovery\nof a family of governing equations. The second framework focuses on system\nparameter identification, leveraging a DeepONet pre-trained on sparse sensor\nmeasurements to initialize a physics-constrained inverse model. Both frameworks\nexcel in handling limited data and preserving physical consistency.\nBenchmarking on the Burgers' equation and reaction-diffusion system\ndemonstrates state-of-the-art performance, achieving average $L_2$ errors of\n$\\mathcal{O}(10^{-2})$ for hidden physics discovery and absolute errors of\n$\\mathcal{O}(10^{-3})$ for parameter identification. These results underscore\nthe frameworks' robustness, efficiency, and potential for solving complex\nscientific problems with minimal observational data.\n","authors":["Vijay Kag","Dibakar Roy Sarkar","Birupaksha Pal","Somdatta Goswami"],"pdf_url":"https://arxiv.org/pdf/2412.05133v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.10793v2","updated":"2024-12-06T15:44:46Z","published":"2024-02-16T16:20:11Z","title":"An end-to-end attention-based approach for learning on graphs","summary":" There has been a recent surge in transformer-based architectures for learning\non graphs, mainly motivated by attention as an effective learning mechanism and\nthe desire to supersede handcrafted operators characteristic of message passing\nschemes. However, concerns over their empirical effectiveness, scalability, and\ncomplexity of the pre-processing steps have been raised, especially in relation\nto much simpler graph neural networks that typically perform on par with them\nacross a wide range of benchmarks. To tackle these shortcomings, we consider\ngraphs as sets of edges and propose a purely attention-based approach\nconsisting of an encoder and an attention pooling mechanism. The encoder\nvertically interleaves masked and vanilla self-attention modules to learn an\neffective representations of edges, while allowing for tackling possible\nmisspecifications in input graphs. Despite its simplicity, the approach\noutperforms fine-tuned message passing baselines and recently proposed\ntransformer-based methods on more than 70 node and graph-level tasks, including\nchallenging long-range benchmarks. Moreover, we demonstrate state-of-the-art\nperformance across different tasks, ranging from molecular to vision graphs,\nand heterophilous node classification. The approach also outperforms graph\nneural networks and transformers in transfer learning settings, and scales much\nbetter than alternatives with a similar performance level or expressive power.\n","authors":["David Buterez","Jon Paul Janet","Dino Oglic","Pietro Lio"],"pdf_url":"https://arxiv.org/pdf/2402.10793v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04384v2","updated":"2024-12-06T15:43:40Z","published":"2024-12-05T17:59:58Z","title":"GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D\n Occupancy Prediction","summary":" 3D semantic occupancy prediction is an important task for robust\nvision-centric autonomous driving, which predicts fine-grained geometry and\nsemantics of the surrounding scene. Most existing methods leverage dense\ngrid-based scene representations, overlooking the spatial sparsity of the\ndriving scenes. Although 3D semantic Gaussian serves as an object-centric\nsparse alternative, most of the Gaussians still describe the empty region with\nlow efficiency. To address this, we propose a probabilistic Gaussian\nsuperposition model which interprets each Gaussian as a probability\ndistribution of its neighborhood being occupied and conforms to probabilistic\nmultiplication to derive the overall geometry. Furthermore, we adopt the exact\nGaussian mixture model for semantics calculation to avoid unnecessary\noverlapping of Gaussians. To effectively initialize Gaussians in non-empty\nregion, we design a distribution-based initialization module which learns the\npixel-aligned occupancy distribution instead of the depth of surfaces. We\nconduct extensive experiments on nuScenes and KITTI-360 datasets and our\nGaussianFormer-2 achieves state-of-the-art performance with high efficiency.\nCode: https://github.com/huang-yh/GaussianFormer.\n","authors":["Yuanhui Huang","Amonnut Thammatadatrakoon","Wenzhao Zheng","Yunpeng Zhang","Dalong Du","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04384v2.pdf","comment":"Code is available at: https://github.com/huang-yh/GaussianFormer"},{"id":"http://arxiv.org/abs/2412.04380v2","updated":"2024-12-06T15:43:38Z","published":"2024-12-05T17:57:09Z","title":"EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online\n Scene Understanding","summary":" 3D occupancy prediction provides a comprehensive description of the\nsurrounding scenes and has become an essential task for 3D perception. Most\nexisting methods focus on offline perception from one or a few views and cannot\nbe applied to embodied agents which demands to gradually perceive the scene\nthrough progressive embodied exploration. In this paper, we formulate an\nembodied 3D occupancy prediction task to target this practical scenario and\npropose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize\nthe global scene with uniform 3D semantic Gaussians and progressively update\nlocal regions observed by the embodied agent. For each update, we extract\nsemantic and structural features from the observed image and efficiently\nincorporate them via deformable cross-attention to refine the regional\nGaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global\n3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown\n(i.e., uniformly distributed) environment and maintains an explicit global\nmemory of it with 3D Gaussians. It gradually gains knowledge through the local\nrefinement of regional Gaussians, which is consistent with how humans\nunderstand new scenes through embodied exploration. We reorganize an\nEmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the\nevaluation of the embodied 3D occupancy prediction task. Experiments\ndemonstrate that our EmbodiedOcc outperforms existing local prediction methods\nand accomplishes the embodied occupancy prediction with high accuracy and\nstrong expandability. Code: https://github.com/YkiWu/EmbodiedOcc.\n","authors":["Yuqi Wu","Wenzhao Zheng","Sicheng Zuo","Yuanhui Huang","Jie Zhou","Jiwen Lu"],"pdf_url":"https://arxiv.org/pdf/2412.04380v2.pdf","comment":"Code: https://github.com/YkiWu/EmbodiedOcc"},{"id":"http://arxiv.org/abs/2405.04517v2","updated":"2024-12-06T15:42:07Z","published":"2024-05-07T17:50:21Z","title":"xLSTM: Extended Long Short-Term Memory","summary":" In the 1990s, the constant error carousel and gating were introduced as the\ncentral ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have\nstood the test of time and contributed to numerous deep learning success\nstories, in particular they constituted the first Large Language Models (LLMs).\nHowever, the advent of the Transformer technology with parallelizable\nself-attention at its core marked the dawn of a new era, outpacing LSTMs at\nscale. We now raise a simple question: How far do we get in language modeling\nwhen scaling LSTMs to billions of parameters, leveraging the latest techniques\nfrom modern LLMs, but mitigating known limitations of LSTMs? Firstly, we\nintroduce exponential gating with appropriate normalization and stabilization\ntechniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM\nwith a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that\nis fully parallelizable with a matrix memory and a covariance update rule.\nIntegrating these LSTM extensions into residual block backbones yields xLSTM\nblocks that are then residually stacked into xLSTM architectures. Exponential\ngating and modified memory structures boost xLSTM capabilities to perform\nfavorably when compared to state-of-the-art Transformers and State Space\nModels, both in performance and scaling.\n","authors":["Maximilian Beck","Korbinian Pöppel","Markus Spanring","Andreas Auer","Oleksandra Prudnikova","Michael Kopp","Günter Klambauer","Johannes Brandstetter","Sepp Hochreiter"],"pdf_url":"https://arxiv.org/pdf/2405.04517v2.pdf","comment":"Code available at https://github.com/NX-AI/xlstm"},{"id":"http://arxiv.org/abs/2208.01631v2","updated":"2024-12-06T15:41:07Z","published":"2022-08-02T17:58:52Z","title":"Stochastic Primal-Dual Three Operator Splitting Algorithm with Extension\n to Equivariant Regularization-by-Denoising","summary":" In this work we propose a stochastic primal-dual three-operator splitting\nalgorithm (TOS-SPDHG) for solving a class of convex three-composite\noptimization problems. Our proposed scheme is a direct three-operator splitting\nextension of the SPDHG algorithm [Chambolle et al. 2018]. We provide\ntheoretical convergence analysis showing ergodic $O(1/K)$ convergence rate, and\ndemonstrate the effectiveness of our approach in imaging inverse problems.\nMoreover, we further propose TOS-SPDHG-RED and TOS-SPDHG-eRED which utilizes\nthe regularization-by-denoising (RED) framework to leverage pretrained deep\ndenoising networks as priors.\n","authors":["Junqi Tang","Matthias Ehrhardt","Carola-Bibiane Schönlieb"],"pdf_url":"https://arxiv.org/pdf/2208.01631v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05132v1","updated":"2024-12-06T15:38:58Z","published":"2024-12-06T15:38:58Z","title":"Dirac-Equation Signal Processing: Physics Boosts Topological Machine\n Learning","summary":" Topological signals are variables or features associated with both nodes and\nedges of a network. Recently, in the context of Topological Machine Learning,\ngreat attention has been devoted to signal processing of such topological\nsignals. Most of the previous topological signal processing algorithms treat\nnode and edge signals separately and work under the hypothesis that the true\nsignal is smooth and/or well approximated by a harmonic eigenvector of the\nHodge-Laplacian, which may be violated in practice. Here we propose\nDirac-equation signal processing, a framework for efficiently reconstructing\ntrue signals on nodes and edges, also if they are not smooth or harmonic, by\nprocessing them jointly. The proposed physics-inspired algorithm is based on\nthe spectral properties of the topological Dirac operator. It leverages the\nmathematical structure of the topological Dirac equation to boost the\nperformance of the signal processing algorithm. We discuss how the relativistic\ndispersion relation obeyed by the topological Dirac equation can be used to\nassess the quality of the signal reconstruction. Finally, we demonstrate the\nimproved performance of the algorithm with respect to previous algorithms.\nSpecifically, we show that Dirac-equation signal processing can also be used\nefficiently if the true signal is a non-trivial linear combination of more than\none eigenstate of the Dirac equation, as it generally occurs for real signals.\n","authors":["Runyue Wang","Yu Tian","Pietro Liò","Ginestra Bianconi"],"pdf_url":"https://arxiv.org/pdf/2412.05132v1.pdf","comment":"(14 pages, 7 figures)"},{"id":"http://arxiv.org/abs/2412.05126v1","updated":"2024-12-06T15:34:58Z","published":"2024-12-06T15:34:58Z","title":"Robust Computation with Intrinsic Heterogeneity","summary":" Intrinsic within-type neuronal heterogeneity is a ubiquitous feature of\nbiological systems, with well-documented computational advantages. Recent works\nin machine learning have incorporated such diversities by optimizing neuronal\nparameters alongside synaptic connections and demonstrated state-of-the-art\nperformance across common benchmarks. However, this performance gain comes at\nthe cost of significantly higher computational costs, imposed by a larger\nparameter space. Furthermore, it is unclear how the neuronal parameters,\nconstrained by the biophysics of their surroundings, are globally orchestrated\nto minimize top-down errors. To address these challenges, we postulate that\nneurons are intrinsically diverse, and investigate the computational\ncapabilities of such heterogeneous neuronal parameters. Our results show that\nintrinsic heterogeneity, viewed as a fixed quenched disorder, often\nsubstantially improves performance across hundreds of temporal tasks. Notably,\nsmaller but heterogeneous networks outperform larger homogeneous networks,\ndespite consuming less data. We elucidate the underlying mechanisms driving\nthis performance boost and illustrate its applicability to both rate and\nspiking dynamics. Moreover, our findings demonstrate that heterogeneous\nnetworks are highly resilient to severe alterations in their recurrent synaptic\nhyperparameters, and even recurrent connections removal does not compromise\nperformance. The remarkable effectiveness of heterogeneous networks with small\nsizes and relaxed connectivity is particularly relevant for the neuromorphic\ncommunity, which faces challenges due to device-to-device variability.\nFurthermore, understanding the mechanism of robust computation with\nheterogeneity also benefits neuroscientists and machine learners.\n","authors":["Arash Golmohammadi","Christian Tetzlaff"],"pdf_url":"https://arxiv.org/pdf/2412.05126v1.pdf","comment":"29 pages, 15 figures"},{"id":"http://arxiv.org/abs/2411.19908v2","updated":"2024-12-06T15:34:08Z","published":"2024-11-29T18:12:50Z","title":"Another look at inference after prediction","summary":" Prediction-based (PB) inference is increasingly used in applications where\nthe outcome of interest is difficult to obtain, but its predictors are readily\navailable. Unlike traditional inference, PB inference performs statistical\ninference using a partially observed outcome and a set of covariates by\nleveraging a prediction of the outcome generated from a machine learning (ML)\nmodel. Motwani and Witten (2023) recently revisited two innovative PB inference\napproaches for ordinary least squares. They found that the method proposed by\nWang et al. (2020) yields a consistent estimator for the association of\ninterest when the ML model perfectly captures the underlying regression\nfunction. Conversely, the prediction-powered inference (PPI) method proposed by\nAngelopoulos et al. (2023) yields valid inference regardless of the model's\naccuracy. In this paper, we study the statistical efficiency of the PPI\nestimator. Our analysis reveals that a more efficient estimator, proposed 25\nyears ago by Chen and Chen (2000), can be obtained by simply adding a weight to\nthe PPI estimator. We also contextualize PB inference with methods from the\neconomics and statistics literature dating back to the 1960s. Our extensive\ntheoretical and numerical analyses indicate that the Chen and Chen (CC)\nestimator offers a balance between robustness to ML model specification and\nstatistical efficiency, making it the preferred choice for use in practice.\n","authors":["Jessica Gronsbell","Jianhui Gao","Yaqi Shi","Zachary R. McCaw","David Cheng"],"pdf_url":"https://arxiv.org/pdf/2411.19908v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18857v2","updated":"2024-12-06T15:20:28Z","published":"2024-10-24T15:42:25Z","title":"Probabilistic Language-Image Pre-Training","summary":" Vision-language models (VLMs) embed aligned image-text pairs into a joint\nspace but often rely on deterministic embeddings, assuming a one-to-one\ncorrespondence between images and texts. This oversimplifies real-world\nrelationships, which are inherently many-to-many, with multiple captions\ndescribing a single image and vice versa. We introduce Probabilistic\nLanguage-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained\non a billion-scale image-text dataset using only probabilistic objectives,\nachieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot\naccuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an\n\"uncertainty token\" without extra parameters. We also introduce a novel\ninclusion loss that enforces distributional inclusion relationships between\nimage-text pairs and between original and masked inputs. Experiments\ndemonstrate that, by leveraging uncertainty estimates, ProLIP benefits\ndownstream tasks and aligns with intuitive notions of uncertainty, e.g.,\nshorter texts being more uncertain and more general inputs including specific\nones. Utilizing text uncertainties, we further improve ImageNet accuracy from\n74.6% to 75.8% (under a few-shot setting), supporting the practical advantages\nof our probabilistic approach. The code is available at\nhttps://github.com/naver-ai/prolip\n","authors":["Sanghyuk Chun","Wonjae Kim","Song Park","Sangdoo Yun"],"pdf_url":"https://arxiv.org/pdf/2410.18857v2.pdf","comment":"Code: https://github.com/naver-ai/prolip HuggingFace Hub:\n https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291\n 31 pages, 4.29 MB"},{"id":"http://arxiv.org/abs/2412.05117v1","updated":"2024-12-06T15:19:10Z","published":"2024-12-06T15:19:10Z","title":"Transformers Can Navigate Mazes With Multi-Step Prediction","summary":" Despite their remarkable success in language modeling, transformers trained\nto predict the next token in a sequence struggle with long-term planning. This\nlimitation is particularly evident in tasks requiring foresight to plan\nmultiple steps ahead such as maze navigation. The standard next single token\nprediction objective, however, offers no explicit mechanism to predict multiple\nsteps ahead - or revisit the path taken so far. Consequently, in this work we\nstudy whether explicitly predicting multiple steps ahead (and backwards) can\nimprove transformers' maze navigation. We train parameter-matched transformers\nfrom scratch, under identical settings, to navigate mazes of varying types and\nsizes with standard next token prediction and MLM-U, an objective explicitly\npredicting multiple steps ahead and backwards. We find that MLM-U considerably\nimproves transformers' ability to navigate mazes compared to standard next\ntoken prediction across maze types and complexities. We also find MLM-U\ntraining is 4x more sample efficient and converges 2x faster in terms of GPU\ntraining hours relative to next token training. Finally, for more complex mazes\nwe find MLM-U benefits from scaling to larger transformers. Remarkably, we find\ntransformers trained with MLM-U outperform larger transformers trained with\nnext token prediction using additional supervision from A* search traces. We\nhope these findings underscore the promise of learning objectives to advance\ntransformers' capacity for long-term planning.\n","authors":["Niklas Nolte","Ouail Kitouni","Adina Williams","Mike Rabbat","Mark Ibrahim"],"pdf_url":"https://arxiv.org/pdf/2412.05117v1.pdf","comment":"20 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.05109v1","updated":"2024-12-06T15:10:04Z","published":"2024-12-06T15:10:04Z","title":"Generating Rectifiable Measures through Neural Networks","summary":" We derive universal approximation results for the class of (countably)\n$m$-rectifiable measures. Specifically, we prove that $m$-rectifiable measures\ncan be approximated as push-forwards of the one-dimensional Lebesgue measure on\n$[0,1]$ using ReLU neural networks with arbitrarily small approximation error\nin terms of Wasserstein distance. What is more, the weights in the networks\nunder consideration are quantized and bounded and the number of ReLU neural\nnetworks required to achieve an approximation error of $\\varepsilon$ is no\nlarger than $2^{b(\\varepsilon)}$ with\n$b(\\varepsilon)=\\mathcal{O}(\\varepsilon^{-m}\\log^2(\\varepsilon))$. This result\nimproves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which\n$b(\\varepsilon)$ tends to infinity as $\\varepsilon$ tends to zero equals the\nrectifiability parameter $m$, which can be much smaller than the ambient\ndimension. We extend this result to countably $m$-rectifiable measures and show\nthat this rate still equals the rectifiability parameter $m$ provided that,\namong other technical assumptions, the measure decays exponentially on the\nindividual components of the countably $m$-rectifiable support set.\n","authors":["Erwin Riegler","Alex Bühler","Yang Pan","Helmut Bölcskei"],"pdf_url":"https://arxiv.org/pdf/2412.05109v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.15109v3","updated":"2024-12-06T15:06:19Z","published":"2024-02-23T05:44:15Z","title":"Remaining-data-free Machine Unlearning by Suppressing Sample\n Contribution","summary":" Machine unlearning (MU) is to forget data from a well-trained model, which is\npractically important due to the ``right to be forgotten''. The unlearned model\nshould approach the retrained model, where the forgetting data are not involved\nin the training process and hence do not contribute to the retrained model.\nConsidering the forgetting data's absence during retraining, we think\nunlearning should withdraw their contribution from the pre-trained model. The\nchallenge is that when tracing the learning process is impractical, how to\nquantify and detach sample's contribution to the dynamic learning process using\nonly the pre-trained model. We first theoretically discover that sample's\ncontribution during the process will reflect in the learned model's sensitivity\nto it. We then practically design a novel method, namely MU-Mis (Machine\nUnlearning by Minimizing input sensitivity), to suppress the contribution of\nthe forgetting data. Experimental results demonstrate that MU-Mis can unlearn\neffectively and efficiently without utilizing the remaining data. It is the\nfirst time that a remaining-data-free method can outperform state-of-the-art\n(SoTA) unlearning methods that utilize the remaining data.\n","authors":["Xinwen Cheng","Zhehao Huang","Wenxin Zhou","Zhengbao He","Ruikai Yang","Yingwen Wu","Xiaolin Huang"],"pdf_url":"https://arxiv.org/pdf/2402.15109v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05103v1","updated":"2024-12-06T15:01:19Z","published":"2024-12-06T15:01:19Z","title":"Integrating Semantic Communication and Human Decision-Making into an\n End-to-End Sensing-Decision Framework","summary":" As early as 1949, Weaver defined communication in a very broad sense to\ninclude all procedures by which one mind or technical system can influence\nanother, thus establishing the idea of semantic communication. With the recent\nsuccess of machine learning in expert assistance systems where sensed\ninformation is wirelessly provided to a human to assist task execution, the\nneed to design effective and efficient communications has become increasingly\napparent. In particular, semantic communication aims to convey the meaning\nbehind the sensed information relevant for Human Decision-Making (HDM).\nRegarding the interplay between semantic communication and HDM, many questions\nremain, such as how to model the entire end-to-end sensing-decision-making\nprocess, how to design semantic communication for the HDM and which information\nshould be provided to the HDM. To address these questions, we propose to\nintegrate semantic communication and HDM into one probabilistic end-to-end\nsensing-decision framework that bridges communications and psychology. In our\ninterdisciplinary framework, we model the human through a HDM process, allowing\nus to explore how feature extraction from semantic communication can best\nsupport human decision-making. In this sense, our study provides new insights\nfor the design/interaction of semantic communication with models of HDM. Our\ninitial analysis shows how semantic communication can balance the level of\ndetail with human cognitive capabilities while demanding less bandwidth, power,\nand latency.\n","authors":["Edgar Beck","Hsuan-Yu Lin","Patrick Rückert","Yongping Bao","Bettina von Helversen","Sebastian Fehrler","Kirsten Tracht","Armin Dekorsy"],"pdf_url":"https://arxiv.org/pdf/2412.05103v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.12841v2","updated":"2024-12-06T14:57:59Z","published":"2024-06-18T17:57:11Z","title":"Demystifying Higher-Order Graph Neural Networks","summary":" Higher-order graph neural networks (HOGNNs) and the related architectures\nfrom Topological Deep Learning are an important class of GNN models that\nharness polyadic relations between vertices beyond plain edges. They have been\nused to eliminate issues such as over-smoothing or over-squashing, to\nsignificantly enhance the accuracy of GNN predictions, to improve the\nexpressiveness of GNN architectures, and for numerous other goals. A plethora\nof HOGNN models have been introduced, and they come with diverse neural\narchitectures, and even with different notions of what the \"higher-order\"\nmeans. This richness makes it very challenging to appropriately analyze and\ncompare HOGNN models, and to decide in what scenario to use specific ones. To\nalleviate this, we first design an in-depth taxonomy and a blueprint for\nHOGNNs. This facilitates designing models that maximize performance. Then, we\nuse our taxonomy to analyze and compare the available HOGNN models. The\noutcomes of our analysis are synthesized in a set of insights that help to\nselect the most beneficial GNN model in a given scenario, and a comprehensive\nlist of challenges and opportunities for further research into more powerful\nHOGNNs.\n","authors":["Maciej Besta","Florian Scheidl","Lukas Gianinazzi","Grzegorz Kwasniewski","Shachar Klaiman","Jürgen Müller","Torsten Hoefler"],"pdf_url":"https://arxiv.org/pdf/2406.12841v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.18156v2","updated":"2024-12-06T14:54:53Z","published":"2024-10-23T09:17:31Z","title":"Dreaming Learning","summary":" Incorporating novelties into deep learning systems remains a challenging\nproblem. Introducing new information to a machine learning system can interfere\nwith previously stored data and potentially alter the global model paradigm,\nespecially when dealing with non-stationary sources. In such cases, traditional\napproaches based on validation error minimization offer limited advantages. To\naddress this, we propose a training algorithm inspired by Stuart Kauffman's\nnotion of the Adjacent Possible. This novel training methodology explores new\ndata spaces during the learning phase. It predisposes the neural network to\nsmoothly accept and integrate data sequences with different statistical\ncharacteristics than expected. The maximum distance compatible with such\ninclusion depends on a specific parameter: the sampling temperature used in the\nexplorative phase of the present method. This algorithm, called Dreaming\nLearning, anticipates potential regime shifts over time, enhancing the neural\nnetwork's responsiveness to non-stationary events that alter statistical\nproperties. To assess the advantages of this approach, we apply this\nmethodology to unexpected statistical changes in Markov chains and\nnon-stationary dynamics in textual sequences. We demonstrated its ability to\nimprove the auto-correlation of generated textual sequences by $\\sim 29\\%$ and\nenhance the velocity of loss convergence by $\\sim 100\\%$ in the case of a\nparadigm shift in Markov chains.\n","authors":["Alessandro Londei","Matteo Benati","Denise Lanzieri","Vittorio Loreto"],"pdf_url":"https://arxiv.org/pdf/2410.18156v2.pdf","comment":"Accepted at the NeurIPS 2024 workshop on Intrinsically Motivated\n Open-ended Learning"},{"id":"http://arxiv.org/abs/2407.10921v5","updated":"2024-12-06T14:51:41Z","published":"2024-07-15T17:22:16Z","title":"Leveraging Bi-Focal Perspectives and Granular Feature Integration for\n Accurate Reliable Early Alzheimer's Detection","summary":" Alzheimer's disease (AD) is the most common neurodegeneration, annually\ndiagnosed in millions of patients. The present medicine scenario still finds\nchallenges in the exact diagnosis and classification of AD through neuroimaging\ndata. Traditional CNNs can extract a good amount of low-level information in an\nimage but fail to extract high-level minuscule particles, which is a\nsignificant challenge in detecting AD from MRI scans. To overcome this, we\npropose a novel Granular Feature Integration method to combine information\nextraction at different scales combined with an efficient information flow,\nenabling the model to capture both broad and fine-grained features\nsimultaneously. We also propose a Bi-Focal Perspective mechanism to highlight\nthe subtle neurofibrillary tangles and amyloid plaques in the MRI scans,\nensuring that critical pathological markers are accurately identified. Our\nmodel achieved an F1-Score of 99.31%, precision of 99.24%, and recall of\n99.51%. These scores prove that our model is significantly better than the\nstate-of-the-art (SOTA) CNNs in existence.\n","authors":["Pandiyaraju V","Shravan Venkatraman","Abeshek A","Pavan Kumar S","Aravintakshan S A"],"pdf_url":"https://arxiv.org/pdf/2407.10921v5.pdf","comment":"14 pages, 12 figures, 6 tables"},{"id":"http://arxiv.org/abs/2406.07057v2","updated":"2024-12-06T14:21:06Z","published":"2024-06-11T08:38:13Z","title":"MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal\n Large Language Models","summary":" Despite the superior capabilities of Multimodal Large Language Models (MLLMs)\nacross diverse tasks, they still face significant trustworthiness challenges.\nYet, current literature on the assessment of trustworthy MLLMs remains limited,\nlacking a holistic evaluation to offer thorough insights into future\nimprovements. In this work, we establish MultiTrust, the first comprehensive\nand unified benchmark on the trustworthiness of MLLMs across five primary\naspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark\nemploys a rigorous evaluation strategy that addresses both multimodal risks and\ncross-modal impacts, encompassing 32 diverse tasks with self-curated datasets.\nExtensive experiments with 21 modern MLLMs reveal some previously unexplored\ntrustworthiness issues and risks, highlighting the complexities introduced by\nthe multimodality and underscoring the necessity for advanced methodologies to\nenhance their reliability. For instance, typical proprietary models still\nstruggle with the perception of visually confusing images and are vulnerable to\nmultimodal jailbreaking and adversarial attacks; MLLMs are more inclined to\ndisclose privacy in text and reveal ideological and cultural biases even when\npaired with irrelevant images in inference, indicating that the multimodality\namplifies the internal risks from base LLMs. Additionally, we release a\nscalable toolbox for standardized trustworthiness research, aiming to\nfacilitate future advancements in this important field. Code and resources are\npublicly available at: https://multi-trust.github.io/.\n","authors":["Yichi Zhang","Yao Huang","Yitong Sun","Chang Liu","Zhe Zhao","Zhengwei Fang","Yifan Wang","Huanran Chen","Xiao Yang","Xingxing Wei","Hang Su","Yinpeng Dong","Jun Zhu"],"pdf_url":"https://arxiv.org/pdf/2406.07057v2.pdf","comment":"100 pages, 84 figures, 33 tables"},{"id":"http://arxiv.org/abs/2407.04513v2","updated":"2024-12-06T14:20:26Z","published":"2024-07-05T13:54:15Z","title":"LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing\n Layer Execution Order","summary":" Due to their architecture and how they are trained, artificial neural\nnetworks are typically not robust toward pruning or shuffling layers at test\ntime. However, such properties would be desirable for different applications,\nsuch as distributed neural network architectures where the order of execution\ncannot be guaranteed or parts of the network can fail during inference. In this\nwork, we address these issues through a number of training approaches for\nvision transformers whose most important component is randomizing the execution\norder of attention modules at training time. With our proposed approaches,\nvision transformers are capable to adapt to arbitrary layer execution orders at\ntest time assuming one tolerates a reduction (about 20\\%) in accuracy at the\nsame model size. We analyse the feature representations of our trained models\nas well as how each layer contributes to the models prediction based on its\nposition during inference. Our analysis shows that layers learn to contribute\ndifferently based on their position in the network. Finally, we layer-prune our\nmodels at test time and find that their performance declines gracefully. Code\navailable at https://github.com/matfrei/layershuffle.\n","authors":["Matthias Freiberger","Peter Kun","Anders Sundnes Løvlie","Sebastian Risi"],"pdf_url":"https://arxiv.org/pdf/2407.04513v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.20325v2","updated":"2024-12-06T14:09:22Z","published":"2024-09-30T14:26:12Z","title":"Old Optimizer, New Norm: An Anthology","summary":" Deep learning optimizers are often motivated through a mix of convex and\napproximate second-order theory. We select three such methods -- Adam, Shampoo\nand Prodigy -- and argue that each method can instead be understood as a\nsquarely first-order method without convexity assumptions. In fact, after\nswitching off exponential moving averages, each method is equivalent to\nsteepest descent under a particular norm. By generalizing this observation, we\nchart a new design space for training algorithms. Different operator norms\nshould be assigned to different tensors based on the role that the tensor plays\nwithin the network. For example, while linear and embedding layers may have the\nsame weight space of $\\mathbb{R}^{m\\times n}$, these layers play different\nroles and should be assigned different norms. We hope that this idea of\ncarefully metrizing the neural architecture might lead to more stable, scalable\nand indeed faster training.\n","authors":["Jeremy Bernstein","Laker Newhouse"],"pdf_url":"https://arxiv.org/pdf/2409.20325v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.00499v2","updated":"2024-12-06T14:02:59Z","published":"2024-11-01T10:25:25Z","title":"Cross-modal semantic segmentation for indoor environmental perception\n using single-chip millimeter-wave radar raw data","summary":" In the context of firefighting and rescue operations, a cross-modal semantic\nsegmentation model based on a single-chip millimeter-wave (mmWave) radar for\nindoor environmental perception is proposed and discussed. To efficiently\nobtain high-quality labels, an automatic label generation method utilizing\nLiDAR point clouds and occupancy grid maps is introduced. The proposed\nsegmentation model is based on U-Net. A spatial attention module is\nincorporated, which enhanced the performance of the mode. The results\ndemonstrate that cross-modal semantic segmentation provides a more intuitive\nand accurate representation of indoor environments. Unlike traditional methods,\nthe model's segmentation performance is minimally affected by azimuth. Although\nperformance declines with increasing distance, this can be mitigated by a\nwell-designed model. Additionally, it was found that using raw ADC data as\ninput is ineffective; compared to RA tensors, RD tensors are more suitable for\nthe proposed model.\n","authors":["Hairuo Hu","Haiyong Cong","Zhuyu Shao","Yubo Bi","Jinghao Liu"],"pdf_url":"https://arxiv.org/pdf/2411.00499v2.pdf","comment":"5291 words, 17 pages, 11 figures"},{"id":"http://arxiv.org/abs/2412.05043v1","updated":"2024-12-06T13:49:10Z","published":"2024-12-06T13:49:10Z","title":"ReF-LDM: A Latent Diffusion Model for Reference-based Face Image\n Restoration","summary":" While recent works on blind face image restoration have successfully produced\nimpressive high-quality (HQ) images with abundant details from low-quality (LQ)\ninput images, the generated content may not accurately reflect the real\nappearance of a person. To address this problem, incorporating well-shot\npersonal images as additional reference inputs could be a promising strategy.\nInspired by the recent success of the Latent Diffusion Model (LDM), we propose\nReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned\non one LQ image and multiple HQ reference images. Our model integrates an\neffective and efficient mechanism, CacheKV, to leverage the reference images\nduring the generation process. Additionally, we design a timestep-scaled\nidentity loss, enabling our LDM-based model to focus on learning the\ndiscriminating features of human faces. Lastly, we construct FFHQ-Ref, a\ndataset consisting of 20,405 high-quality (HQ) face images with corresponding\nreference images, which can serve as both training and evaluation data for\nreference-based face restoration models.\n","authors":["Chi-Wei Hsiao","Yu-Lun Liu","Cheng-Kun Yang","Sheng-Po Kuo","Kevin Jou","Chia-Ping Chen"],"pdf_url":"https://arxiv.org/pdf/2412.05043v1.pdf","comment":"NeurIPS 2024, project page\n https://chiweihsiao.github.io/refldm.github.io/"},{"id":"http://arxiv.org/abs/2310.00327v3","updated":"2024-12-06T13:48:43Z","published":"2023-09-30T10:06:05Z","title":"Memorization With Neural Nets: Going Beyond the Worst Case","summary":" In practice, deep neural networks are often able to easily interpolate their\ntraining data. To understand this phenomenon, many works have aimed to quantify\nthe memorization capacity of a neural network architecture: the largest number\nof points such that the architecture can interpolate any placement of these\npoints with any assignment of labels. For real-world data, however, one\nintuitively expects the presence of a benign structure so that interpolation\nalready occurs at a smaller network size than suggested by memorization\ncapacity. In this paper, we investigate interpolation by adopting an\ninstance-specific viewpoint. We introduce a simple randomized algorithm that,\ngiven a fixed finite data set with two classes, with high probability\nconstructs an interpolating three-layer neural network in polynomial time. The\nrequired number of parameters is linked to geometric properties of the two\nclasses and their mutual arrangement. As a result, we obtain guarantees that\nare independent of the number of samples and hence move beyond worst-case\nmemorization capacity bounds. We verify our theoretical result with numerical\nexperiments and additionally investigate the effectiveness of the algorithm on\nMNIST and CIFAR-10.\n","authors":["Sjoerd Dirksen","Patrick Finke","Martin Genzel"],"pdf_url":"https://arxiv.org/pdf/2310.00327v3.pdf","comment":"The current version of the manuscript has been accepted to Journal of\n Machine Learning Research"},{"id":"http://arxiv.org/abs/2411.18506v3","updated":"2024-12-06T13:35:45Z","published":"2024-11-27T16:48:24Z","title":"LLM-ABBA: Understanding time series via symbolic approximation","summary":" The success of large language models (LLMs) for time series has been\ndemonstrated in previous work. Utilizing a symbolic time series representation,\none can efficiently bridge the gap between LLMs and time series. However, the\nremaining challenge is to exploit the semantic information hidden in time\nseries by using symbols or existing tokens of LLMs, while aligning the\nembedding space of LLMs according to the hidden information of time series. The\nsymbolic time series approximation (STSA) method called adaptive Brownian\nbridge-based symbolic aggregation (ABBA) shows outstanding efficacy in\npreserving salient time series features by modeling time series patterns in\nterms of amplitude and period while using existing tokens of LLMs.\n In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA\ninto large language models for various downstream time series tasks. By\nsymbolizing time series, LLM-ABBA compares favorably to the recent\nstate-of-the-art (SOTA) in UCR and three medical time series classification\ntasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to\n\\kc{avoid obvious drifting} during prediction tasks by significantly mitigating\nthe effects of cumulative error arising from misused symbols during the\ntransition from symbols to numerical values. In time series regression tasks,\nLLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER)\nbenchmarks. LLM-ABBA also shows competitive prediction capability compared to\nrecent SOTA time series prediction results. We believe this framework can also\nseamlessly extend to other time series tasks.\n","authors":["Erin Carson","Xinye Chen","Cheng Kang"],"pdf_url":"https://arxiv.org/pdf/2411.18506v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05029v1","updated":"2024-12-06T13:25:39Z","published":"2024-12-06T13:25:39Z","title":"Mixed Blessing: Class-Wise Embedding guided Instance-Dependent Partial\n Label Learning","summary":" In partial label learning (PLL), every sample is associated with a candidate\nlabel set comprising the ground-truth label and several noisy labels. The\nconventional PLL assumes the noisy labels are randomly generated\n(instance-independent), while in practical scenarios, the noisy labels are\nalways instance-dependent and are highly related to the sample features,\nleading to the instance-dependent partial label learning (IDPLL) problem.\nInstance-dependent noisy label is a double-edged sword. On one side, it may\npromote model training as the noisy labels can depict the sample to some\nextent. On the other side, it brings high label ambiguity as the noisy labels\nare quite undistinguishable from the ground-truth label. To leverage the\nnuances of IDPLL effectively, for the first time we create class-wise\nembeddings for each sample, which allow us to explore the relationship of\ninstance-dependent noisy labels, i.e., the class-wise embeddings in the\ncandidate label set should have high similarity, while the class-wise\nembeddings between the candidate label set and the non-candidate label set\nshould have high dissimilarity. Moreover, to reduce the high label ambiguity,\nwe introduce the concept of class prototypes containing global feature\ninformation to disambiguate the candidate label set. Extensive experimental\ncomparisons with twelve methods on six benchmark data sets, including four\nfine-grained data sets, demonstrate the effectiveness of the proposed method.\nThe code implementation is publicly available at\nhttps://github.com/Yangfc-ML/CEL.\n","authors":["Fuchao Yang","Jianhong Cheng","Hui Liu","Yongqiang Dong","Yuheng Jia","Junhui Hou"],"pdf_url":"https://arxiv.org/pdf/2412.05029v1.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2410.13166v3","updated":"2024-12-06T13:22:11Z","published":"2024-10-17T02:47:10Z","title":"An Evolved Universal Transformer Memory","summary":" Prior methods propose to offset the escalating costs of modern foundation\nmodels by dropping specific parts of their contexts with hand-designed rules,\nwhile attempting to preserve their original performance. We overcome this\ntrade-off with Neural Attention Memory Models (NAMMs), introducing a learned\nnetwork for memory management that improves both the performance and efficiency\nof transformers. We evolve NAMMs atop pre-trained transformers to provide\ndifferent latent contexts focusing on the most relevant information for\nindividual layers and attention heads. NAMMs are universally applicable to any\nmodel using self-attention as they condition exclusively on the values in the\nproduced attention matrices. Learning NAMMs on a small set of problems, we\nachieve substantial performance improvements across multiple long-context\nbenchmarks while cutting the model's input contexts up to a fraction of the\noriginal sizes. We show the generality of our conditioning enables zero-shot\ntransfer of NAMMs trained only on language to entirely new transformer\narchitectures even across input modalities, with their benefits carrying over\nto vision and reinforcement learning.\n","authors":["Edoardo Cetin","Qi Sun","Tianyu Zhao","Yujin Tang"],"pdf_url":"https://arxiv.org/pdf/2410.13166v3.pdf","comment":"Preprint, under submission. Source code is available at\n https://github.com/SakanaAI/evo-memory"},{"id":"http://arxiv.org/abs/2311.15603v2","updated":"2024-12-06T13:11:19Z","published":"2023-11-27T07:53:44Z","title":"QuickDrop: Efficient Federated Unlearning by Integrated Dataset\n Distillation","summary":" Federated Unlearning (FU) aims to delete specific training data from an ML\nmodel trained using Federated Learning (FL). We introduce QuickDrop, an\nefficient and original FU method that utilizes dataset distillation (DD) to\naccelerate unlearning and drastically reduces computational overhead compared\nto existing approaches. In QuickDrop, each client uses DD to generate a compact\ndataset representative of the original training dataset, called a distilled\ndataset, and uses this compact dataset during unlearning. To unlearn specific\nknowledge from the global model, QuickDrop has clients execute Stochastic\nGradient Ascent with samples from the distilled datasets, thus significantly\nreducing computational overhead compared to conventional FU methods. We further\nincrease the efficiency of QuickDrop by ingeniously integrating DD into the FL\ntraining process. By reusing the gradient updates produced during FL training\nfor DD, the overhead of creating distilled datasets becomes close to\nnegligible. Evaluations on three standard datasets show that, with comparable\naccuracy guarantees, QuickDrop reduces the duration of unlearning by 463.8x\ncompared to model retraining from scratch and 65.1x compared to existing FU\napproaches. We also demonstrate the scalability of QuickDrop with 100 clients\nand show its effectiveness while handling multiple unlearning operations.\n","authors":["Akash Dhasade","Yaohong Ding","Song Guo","Anne-marie Kermarrec","Martijn De Vos","Leijie Wu"],"pdf_url":"https://arxiv.org/pdf/2311.15603v2.pdf","comment":"Accepted by Middleware 2024"},{"id":"http://arxiv.org/abs/2412.05010v1","updated":"2024-12-06T13:03:22Z","published":"2024-12-06T13:03:22Z","title":"Backdooring Outlier Detection Methods: A Novel Attack Approach","summary":" There have been several efforts in backdoor attacks, but these have primarily\nfocused on the closed-set performance of classifiers (i.e., classification).\nThis has left a gap in addressing the threat to classifiers' open-set\nperformance, referred to as outlier detection in the literature. Reliable\noutlier detection is crucial for deploying classifiers in critical real-world\napplications such as autonomous driving and medical image analysis. First, we\nshow that existing backdoor attacks fall short in affecting the open-set\nperformance of classifiers, as they have been specifically designed to confuse\nintra-closed-set decision boundaries. In contrast, an effective backdoor attack\nfor outlier detection needs to confuse the decision boundary between the closed\nand open sets. Motivated by this, in this study, we propose BATOD, a novel\nBackdoor Attack targeting the Outlier Detection task. Specifically, we design\ntwo categories of triggers to shift inlier samples to outliers and vice versa.\nWe evaluate BATOD using various real-world datasets and demonstrate its\nsuperior ability to degrade the open-set performance of classifiers compared to\nprevious attacks, both before and after applying defenses.\n","authors":["ZeinabSadat Taghavi","Hossein Mirzaei"],"pdf_url":"https://arxiv.org/pdf/2412.05010v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05004v1","updated":"2024-12-06T12:59:03Z","published":"2024-12-06T12:59:03Z","title":"Prompt Transfer for Dual-Aspect Cross Domain Cognitive Diagnosis","summary":" Cognitive Diagnosis (CD) aims to evaluate students' cognitive states based on\ntheir interaction data, enabling downstream applications such as exercise\nrecommendation and personalized learning guidance. However, existing methods\noften struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD),\na practical yet challenging task. While some efforts have explored\nexercise-aspect CDCD, such as crosssubject scenarios, they fail to address the\nbroader dual-aspect nature of CDCD, encompassing both student- and\nexerciseaspect variations. This diversity creates significant challenges in\ndeveloping a scenario-agnostic framework. To address these gaps, we propose\nPromptCD, a simple yet effective framework that leverages soft prompt transfer\nfor cognitive diagnosis. PromptCD is designed to adapt seamlessly across\ndiverse CDCD scenarios, introducing PromptCD-S for student-aspect CDCD and\nPromptCD-E for exercise-aspect CDCD. Extensive experiments on real-world\ndatasets demonstrate the robustness and effectiveness of PromptCD, consistently\nachieving superior performance across various CDCD scenarios. Our work offers a\nunified and generalizable approach to CDCD, advancing both theoretical and\npractical understanding in this critical domain. The implementation of our\nframework is publicly available at\nhttps://github.com/Publisher-PromptCD/PromptCD.\n","authors":["Fei Liu","Yizhong Zhang","Shuochen Liu","Shengwei Ji","Kui Yu","Le Wu"],"pdf_url":"https://arxiv.org/pdf/2412.05004v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05000v1","updated":"2024-12-06T12:52:24Z","published":"2024-12-06T12:52:24Z","title":"Noise Matters: Diffusion Model-based Urban Mobility Generation with\n Collaborative Noise Priors","summary":" With global urbanization, the focus on sustainable cities has largely grown,\ndriving research into equity, resilience, and urban planning, which often\nrelies on mobility data. The rise of web-based apps and mobile devices has\nprovided valuable user data for mobility-related research. However, real-world\nmobility data is costly and raises privacy concerns. To protect privacy while\nretaining key features of real-world movement, the demand for synthetic data\nhas steadily increased. Recent advances in diffusion models have shown great\npotential for mobility trajectory generation due to their ability to model\nrandomness and uncertainty. However, existing approaches often directly apply\nidentically distributed (i.i.d.) noise sampling from image generation\ntechniques, which fail to account for the spatiotemporal correlations and\nsocial interactions that shape urban mobility patterns. In this paper, we\npropose CoDiffMob, a diffusion method for urban mobility generation with\ncollaborative noise priors, we emphasize the critical role of noise in\ndiffusion models for generating mobility data. By leveraging both individual\nmovement characteristics and population-wide dynamics, we construct novel\ncollaborative noise priors that provide richer and more informative guidance\nthroughout the generation process. Extensive experiments demonstrate the\nsuperiority of our method, with generated data accurately capturing both\nindividual preferences and collective patterns, achieving an improvement of\nover 32\\%. Furthermore, it can effectively replace web-derived mobility data to\nbetter support downstream applications, while safeguarding user privacy and\nfostering a more secure and ethical web. This highlights its tremendous\npotential for applications in sustainable city-related research.\n","authors":["Yuheng Zhang","Yuan Yuan","Jingtao Ding","Jian Yuan","Yong Li"],"pdf_url":"https://arxiv.org/pdf/2412.05000v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.05250v3","updated":"2024-12-06T12:40:53Z","published":"2024-06-07T20:22:36Z","title":"LLM-Enhanced Bayesian Optimization for Efficient Analog Layout\n Constraint Generation","summary":" Analog layout synthesis faces significant challenges due to its dependence on\nmanual processes, considerable time requirements, and performance instability.\nCurrent Bayesian Optimization (BO)-based techniques for analog layout\nsynthesis, despite their potential for automation, suffer from slow convergence\nand extensive data needs, limiting their practical application. This paper\npresents the \\texttt{LLANA} framework, a novel approach that leverages Large\nLanguage Models (LLMs) to enhance BO by exploiting the few-shot learning\nabilities of LLMs for more efficient generation of analog design-dependent\nparameter constraints. Experimental results demonstrate that \\texttt{LLANA} not\nonly achieves performance comparable to state-of-the-art (SOTA) BO methods but\nalso enables a more effective exploration of the analog circuit design space,\nthanks to LLM's superior contextual understanding and learning efficiency. The\ncode is available at https://github.com/dekura/LLANA.\n","authors":["Guojin Chen","Keren Zhu","Seunggeun Kim","Hanqing Zhu","Yao Lai","Bei Yu","David Z. Pan"],"pdf_url":"https://arxiv.org/pdf/2406.05250v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.02976v2","updated":"2024-12-06T12:39:00Z","published":"2024-09-04T13:59:38Z","title":"Hallucination Detection in LLMs: Fast and Memory-Efficient Fine-Tuned\n Models","summary":" Uncertainty estimation is a necessary component when implementing AI in\nhigh-risk settings, such as autonomous cars, medicine, or insurances. Large\nLanguage Models (LLMs) have seen a surge in popularity in recent years, but\nthey are subject to hallucinations, which may cause serious harm in high-risk\nsettings. Despite their success, LLMs are expensive to train and run: they need\na large amount of computations and memory, preventing the use of ensembling\nmethods in practice. In this work, we present a novel method that allows for\nfast and memory-friendly training of LLM ensembles. We show that the resulting\nensembles can detect hallucinations and are a viable approach in practice as\nonly one GPU is needed for training and inference.\n","authors":["Gabriel Y. Arteaga","Thomas B. Schön","Nicolas Pielawski"],"pdf_url":"https://arxiv.org/pdf/2409.02976v2.pdf","comment":"6 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.04986v1","updated":"2024-12-06T12:15:11Z","published":"2024-12-06T12:15:11Z","title":"Power Plant Detection for Energy Estimation using GIS with Remote\n Sensing, CNN & Vision Transformers","summary":" In this research, we propose a hybrid model for power plant detection to\nassist energy estimation applications, by pipelining GIS (Geographical\nInformation Systems) having Remote Sensing capabilities with CNN (Convolutional\nNeural Networks) and ViT (Vision Transformers). Our proposed approach enables\nreal-time analysis with multiple data types on a common map via the GIS,\nentails feature-extraction abilities due to the CNN, and captures long-range\ndependencies through the ViT. This hybrid approach is found to enhance\nclassification, thus helping in the monitoring and operational management of\npower plants; hence assisting energy estimation and sustainable energy planning\nin the future. It exemplifies adequate deployment of machine learning methods\nin conjunction with domain-specific approaches to enhance performance.\n","authors":["Blessing Austin-Gabriel","Cristian Noriega Monsalve","Aparna S. Varde"],"pdf_url":"https://arxiv.org/pdf/2412.04986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04984v1","updated":"2024-12-06T12:09:50Z","published":"2024-12-06T12:09:50Z","title":"Frontier Models are Capable of In-context Scheming","summary":" Frontier models are increasingly trained and deployed as autonomous agent.\nOne safety concern is that AI agents might covertly pursue misaligned goals,\nhiding their true capabilities and objectives - also known as scheming. We\nstudy whether models have the capability to scheme in pursuit of a goal that we\nprovide in-context and instruct the model to strongly follow. We evaluate\nfrontier models on a suite of six agentic evaluations where models are\ninstructed to pursue goals and are placed in environments that incentivize\nscheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini\n1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities.\nThey recognize scheming as a viable strategy and readily engage in such\nbehavior. For example, models strategically introduce subtle mistakes into\ntheir responses, attempt to disable their oversight mechanisms, and even\nexfiltrate what they believe to be their model weights to external servers.\nAdditionally, this deceptive behavior proves persistent. When o1 has engaged in\nscheming, it maintains its deception in over 85% of follow-up questions and\noften remains deceptive in multi-turn interrogations. Analysis of the models'\nchains-of-thought reveals that models explicitly reason about these deceptive\nstrategies, providing evidence that the scheming behavior is not accidental.\nSurprisingly, we also find rare instances where models engage in scheming when\nonly given a goal, without being strongly nudged to pursue it. We observe cases\nwhere Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit\nof being helpful, a goal that was acquired during training rather than\nin-context. Our findings demonstrate that frontier models now possess\ncapabilities for basic in-context scheming, making the potential of AI agents\nto engage in scheming behavior a concrete rather than theoretical concern.\n","authors":["Alexander Meinke","Bronson Schoen","Jérémy Scheurer","Mikita Balesni","Rusheb Shah","Marius Hobbhahn"],"pdf_url":"https://arxiv.org/pdf/2412.04984v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04981v1","updated":"2024-12-06T12:01:04Z","published":"2024-12-06T12:01:04Z","title":"Causal discovery with endogenous context variables","summary":" Causal systems often exhibit variations of the underlying causal mechanisms\nbetween the variables of the system. Often, these changes are driven by\ndifferent environments or internal states in which the system operates, and we\nrefer to context variables as those variables that indicate this change in\ncausal mechanisms. An example are the causal relations in soil\nmoisture-temperature interactions and their dependence on soil moisture\nregimes: Dry soil triggers a dependence of soil moisture on latent heat, while\nenvironments with wet soil do not feature such a feedback, making it a\ncontext-specific property. Crucially, a regime or context variable such as soil\nmoisture need not be exogenous and can be influenced by the dynamical system\nvariables - precipitation can make a dry soil wet - leading to joint systems\nwith endogenous context variables. In this work we investigate the assumptions\nfor constraint-based causal discovery of context-specific information in\nsystems with endogenous context variables. We show that naive approaches such\nas learning different regime graphs on masked data, or pooling all data, can\nlead to uninformative results. We propose an adaptive constraint-based\ndiscovery algorithm and give a detailed discussion on the connection to\nstructural causal models, including sufficiency assumptions, which allow to\nprove the soundness of our algorithm and to interpret the results causally.\nNumerical experiments demonstrate the performance of the proposed method over\nalternative baselines, but they also unveil current limitations of our method.\n","authors":["Wiebke Günther","Oana-Iuliana Popescu","Martin Rabel","Urmi Ninad","Andreas Gerhardus","Jakob Runge"],"pdf_url":"https://arxiv.org/pdf/2412.04981v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04974v1","updated":"2024-12-06T11:48:49Z","published":"2024-12-06T11:48:49Z","title":"Putting the Iterative Training of Decision Trees to the Test on a\n Real-World Robotic Task","summary":" In previous research, we developed methods to train decision trees (DT) as\nagents for reinforcement learning tasks, based on deep reinforcement learning\n(DRL) networks. The samples from which the DTs are built, use the environment's\nstate as features and the corresponding action as label. To solve the\nnontrivial task of selecting samples, which on one hand reflect the DRL agent's\ncapabilities of choosing the right action but on the other hand also cover\nenough state space to generalize well, we developed an algorithm to iteratively\ntrain DTs.\n In this short paper, we apply this algorithm to a real-world implementation\nof a robotic task for the first time. Real-world tasks pose additional\nchallenges compared to simulations, such as noise and delays. The task consists\nof a physical pendulum attached to a cart, which moves on a linear track. By\nmovements to the left and to the right, the pendulum is to be swung in the\nupright position and balanced in the unstable equilibrium. Our results\ndemonstrate the applicability of the algorithm to real-world tasks by\ngenerating a DT whose performance matches the performance of the DRL agent,\nwhile consisting of fewer parameters. This research could be a starting point\nfor distilling DTs from DRL agents to obtain transparent, lightweight models\nfor real-world reinforcement learning tasks.\n","authors":["Raphael C. Engelhardt","Marcel J. Meinen","Moritz Lange","Laurenz Wiskott","Wolfgang Konen"],"pdf_url":"https://arxiv.org/pdf/2412.04974v1.pdf","comment":"5 pages, 4 figures"},{"id":"http://arxiv.org/abs/2407.02031v2","updated":"2024-12-06T11:47:06Z","published":"2024-07-02T07:59:08Z","title":"SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules","summary":" Text-to-image (T2I) generation using diffusion models has become a\nblockbuster service in today's AI cloud. A production T2I service typically\ninvolves a serving workflow where a base diffusion model is augmented with\nvarious \"add-on\" modules, notably ControlNet and LoRA, to enhance image\ngeneration control. Compared to serving the base model alone, these add-on\nmodules introduce significant loading and computational overhead, resulting in\nincreased latency. In this paper, we present SwiftDiffusion, a system that\nefficiently serves a T2I workflow through a holistic approach. SwiftDiffusion\ndecouples ControNet from the base model and deploys it as a separate,\nindependently scaled service on dedicated GPUs, enabling ControlNet caching,\nparallelization, and sharing. To mitigate the high loading overhead of LoRA\nserving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL)\ntechnique, allowing LoRA loading to overlap with the initial base model\nexecution by up to k steps without compromising image quality. Furthermore,\nSwiftDiffusion optimizes base model execution with a novel latent parallelism\ntechnique. Collectively, these designs enable SwiftDiffusion to outperform the\nstate-of-the-art T2I serving systems, achieving up to 7.8x latency reduction\nand 1.6x throughput improvement in serving SDXL models on H800 GPUs, without\nsacrificing image quality.\n","authors":["Suyi Li","Lingyun Yang","Xiaoxiao Jiang","Hanfeng Lu","Dakai An","Zhipeng Di","Weiyi Lu","Jiawei Chen","Kan Liu","Yinghao Yu","Tao Lan","Guodong Yang","Lin Qu","Liping Zhang","Wei Wang"],"pdf_url":"https://arxiv.org/pdf/2407.02031v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.16664v3","updated":"2024-12-06T11:29:27Z","published":"2023-08-31T12:12:56Z","title":"What can we learn from quantum convolutional neural networks?","summary":" Quantum machine learning (QML) shows promise for analyzing quantum data. A\nnotable example is the use of quantum convolutional neural networks (QCNNs),\nimplemented as specific types of quantum circuits, to recognize phases of\nmatter. In this approach, ground states of many-body Hamiltonians are prepared\nto form a quantum dataset and classified in a supervised manner using only a\nfew labeled examples. However, this type of dataset and model differs\nfundamentally from typical QML paradigms based on feature maps and\nparameterized circuits. In this study, we demonstrate how models utilizing\nquantum data can be interpreted through hidden feature maps, where physical\nfeatures are implicitly embedded via ground-state feature maps. By analyzing\nselected examples previously explored with QCNNs, we show that high performance\nin quantum phase recognition comes from generating a highly effective basis set\nwith sharp features at critical points. The learning process adapts the\nmeasurement to create sharp decision boundaries. Our analysis highlights\nimproved generalization when working with quantum data, particularly in the\nlimited-shots regime. Furthermore, translating these insights into the domain\nof quantum scientific machine learning, we demonstrate that ground-state\nfeature maps can be applied to fluid dynamics problems, expressing shock wave\nsolutions with good generalization and proven trainability.\n","authors":["Chukwudubem Umeano","Annie E. Paine","Vincent E. Elfving","Oleksandr Kyriienko"],"pdf_url":"https://arxiv.org/pdf/2308.16664v3.pdf","comment":"15 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04954v1","updated":"2024-12-06T11:14:03Z","published":"2024-12-06T11:14:03Z","title":"Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for\n Radiology Report Generation","summary":" We introduce a radiology-focused visual language model designed to generate\nradiology reports from chest X-rays. Building on previous findings that large\nlanguage models (LLMs) can acquire multimodal capabilities when aligned with\npretrained vision encoders, we demonstrate similar potential with chest X-ray\nimages. This integration enhances the ability of model to understand and\ndescribe chest X-ray images. Our model combines an image encoder with a\nfine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate\ndifferent sections of a radiology report with notable accuracy. The training\nprocess involves a two-stage approach: (i) initial alignment of chest X-ray\nfeatures with the LLM (ii) followed by fine-tuning for radiology report\ngeneration.\n","authors":["Xi Zhang","Zaiqiao Meng","Jake Lever","Edmond S. L. Ho"],"pdf_url":"https://arxiv.org/pdf/2412.04954v1.pdf","comment":"Accepted by BioNLP@ACL 2024"},{"id":"http://arxiv.org/abs/2406.12945v3","updated":"2024-12-06T11:13:18Z","published":"2024-06-18T07:27:38Z","title":"Under the Hood of Tabular Data Generation Models: Benchmarks with\n Extensive Tuning","summary":" The ability to train generative models that produce realistic, safe and\nuseful tabular data is essential for data privacy, imputation, oversampling,\nexplainability or simulation. However, generating tabular data is not\nstraightforward due to its heterogeneity, non-smooth distributions, complex\ndependencies and imbalanced categorical features. Although diverse methods have\nbeen proposed in the literature, there is a need for a unified evaluation,\nunder the same conditions, on a variety of datasets. This study addresses this\nneed by fully considering the optimization of: hyperparameters, feature\nencodings, and architectures. We investigate the impact of dataset-specific\ntuning on five recent model families for tabular data generation through an\nextensive benchmark on 16 datasets. These datasets vary in terms of size (an\naverage of 80,000 rows), data types, and domains. We also propose a reduced\nsearch space for each model that allows for quick optimization, achieving\nnearly equivalent performance at a significantly lower cost. Our benchmark\ndemonstrates that, for most models, large-scale dataset-specific tuning\nsubstantially improves performance compared to the original configurations.\nFurthermore, we confirm that diffusion-based models generally outperform other\nmodels on tabular data. However, this advantage is not significant when the\nentire tuning and training process is restricted to the same GPU budget.\n","authors":["G. Charbel N. Kindji","Lina Maria Rojas-Barahona","Elisa Fromont","Tanguy Urvoy"],"pdf_url":"https://arxiv.org/pdf/2406.12945v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04950v1","updated":"2024-12-06T11:08:47Z","published":"2024-12-06T11:08:47Z","title":"Bed-Attached Vibration Sensor System: A Machine Learning Approach for\n Fall Detection in Nursing Homes","summary":" The increasing shortage of nursing staff and the acute risk of falls in\nnursing homes pose significant challenges for the healthcare system. This study\npresents the development of an automated fall detection system integrated into\ncare beds, aimed at enhancing patient safety without compromising privacy\nthrough wearables or video monitoring. Mechanical vibrations transmitted\nthrough the bed frame are processed using a short-time Fourier transform,\nenabling robust classification of distinct human fall patterns with a\nconvolutional neural network. Challenges pertaining to the quantity and\ndiversity of the data are addressed, proposing the generation of additional\ndata with a specific emphasis on enhancing variation. While the model shows\npromising results in distinguishing fall events from noise using lab data,\nfurther testing in real-world environments is recommended for validation and\nimprovement. Despite limited available data, the proposed system shows the\npotential for an accurate and rapid response to falls, mitigating health\nimplications, and addressing the needs of an aging population. This case study\nwas performed as part of the ZIM Project. Further research on sensors enhanced\nby artificial intelligence will be continued in the ShapeFuture Project.\n","authors":["Thomas Bartz-Beielstein","Axel Wellendorf","Noah Pütz","Jens Brandt","Alexander Hinterleitner","Richard Schulz","Richard Scholz","Olaf Mersmann","Robin Knabe"],"pdf_url":"https://arxiv.org/pdf/2412.04950v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.14949v3","updated":"2024-12-06T11:00:14Z","published":"2024-10-19T02:36:11Z","title":"2-Rectifications are Enough for Straight Flows: A Theoretical Insight\n into Wasserstein Convergence","summary":" Diffusion models have emerged as a powerful tool for image generation and\ndenoising. Typically, generative models learn a trajectory between the starting\nnoise distribution and the target data distribution. Recently Liu et al.\n(2023b) designed a novel alternative generative model Rectified Flow (RF),\nwhich aims to learn straight flow trajectories from noise to data using a\nsequence of convex optimization problems with close ties to optimal transport.\nIf the trajectory is curved, one must use many Euler discretization steps or\nnovel strategies, such as exponential integrators, to achieve a satisfactory\ngeneration quality. In contrast, RF has been shown to theoretically straighten\nthe trajectory through successive rectifications, reducing the number of\nfunction evaluations (NFEs) while sampling. It has also been shown empirically\nthat RF may improve the straightness in two rectifications if one can solve the\nunderlying optimization problem within a sufficiently small error. In this\npaper, we make two key theoretical contributions: 1) we provide the first\ntheoretical analysis of the Wasserstein distance between the sampling\ndistribution of RF and the target distribution. Our error rate is characterized\nby the number of discretization steps and a \\textit{new formulation of\nstraightness} stronger than that in the original work. 2) under a mild\nregularity assumption, we show that for a rectified flow from a Gaussian to any\ngeneral target distribution with finite first moment (e.g. mixture of\nGaussians), two rectifications are sufficient to achieve a straight flow, which\nis in line with the previous empirical findings. Additionally, we also present\nempirical results on both simulated and real datasets to validate our\ntheoretical findings.\n","authors":["Saptarshi Roy","Vansh Bansal","Purnamrita Sarkar","Alessandro Rinaldo"],"pdf_url":"https://arxiv.org/pdf/2410.14949v3.pdf","comment":"28 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.04936v1","updated":"2024-12-06T10:44:20Z","published":"2024-12-06T10:44:20Z","title":"Probing the contents of semantic representations from text, behavior,\n and brain data using the psychNorms metabase","summary":" Semantic representations are integral to natural language processing,\npsycholinguistics, and artificial intelligence. Although often derived from\ninternet text, recent years have seen a rise in the popularity of\nbehavior-based (e.g., free associations) and brain-based (e.g., fMRI)\nrepresentations, which promise improvements in our ability to measure and model\nhuman representations. We carry out the first systematic evaluation of the\nsimilarities and differences between semantic representations derived from\ntext, behavior, and brain data. Using representational similarity analysis, we\nshow that word vectors derived from behavior and brain data encode information\nthat differs from their text-derived cousins. Furthermore, drawing on our\npsychNorms metabase, alongside an interpretability method that we call\nrepresentational content analysis, we find that, in particular, behavior\nrepresentations capture unique variance on certain affective, agentic, and\nsocio-moral dimensions. We thus establish behavior as an important complement\nto text for capturing human representations and behavior. These results are\nbroadly relevant to research aimed at learning human-aligned semantic\nrepresentations, including work on evaluating and aligning large language\nmodels.\n","authors":["Zak Hussain","Rui Mata","Ben R. Newell","Dirk U. Wulff"],"pdf_url":"https://arxiv.org/pdf/2412.04936v1.pdf","comment":"13 pages, 5 figures, 2 tables"},{"id":"http://arxiv.org/abs/2412.02865v3","updated":"2024-12-06T10:38:02Z","published":"2024-12-03T22:00:12Z","title":"Memory-efficient Continual Learning with Neural Collapse Contrastive","summary":" Contrastive learning has significantly improved representation quality,\nenhancing knowledge transfer across tasks in continual learning (CL). However,\ncatastrophic forgetting remains a key challenge, as contrastive based methods\nprimarily focus on \"soft relationships\" or \"softness\" between samples, which\nshift with changing data distributions and lead to representation overlap\nacross tasks. Recently, the newly identified Neural Collapse phenomenon has\nshown promise in CL by focusing on \"hard relationships\" or \"hardness\" between\nsamples and fixed prototypes. However, this approach overlooks \"softness\",\ncrucial for capturing intra-class variability, and this rigid focus can also\npull old class representations toward current ones, increasing forgetting.\nBuilding on these insights, we propose Focal Neural Collapse Contrastive\n(FNC^2), a novel representation learning loss that effectively balances both\nsoft and hard relationships. Additionally, we introduce the Hardness-Softness\nDistillation (HSD) loss to progressively preserve the knowledge gained from\nthese relationships across tasks. Our method outperforms state-of-the-art\napproaches, particularly in minimizing memory reliance. Remarkably, even\nwithout the use of memory, our approach rivals rehearsal-based methods,\noffering a compelling solution for data privacy concerns.\n","authors":["Trung-Anh Dang","Vincent Nguyen","Ngoc-Son Vu","Christel Vrain"],"pdf_url":"https://arxiv.org/pdf/2412.02865v3.pdf","comment":"Accepted at WACV 2025"},{"id":"http://arxiv.org/abs/2412.04930v1","updated":"2024-12-06T10:35:45Z","published":"2024-12-06T10:35:45Z","title":"Video Decomposition Prior: A Methodology to Decompose Videos into Layers","summary":" In the evolving landscape of video enhancement and editing methodologies, a\nmajority of deep learning techniques often rely on extensive datasets of\nobserved input and ground truth sequence pairs for optimal performance. Such\nreliance often falters when acquiring data becomes challenging, especially in\ntasks like video dehazing and relighting, where replicating identical motions\nand camera angles in both corrupted and ground truth sequences is complicated.\nMoreover, these conventional methodologies perform best when the test\ndistribution closely mirrors the training distribution. Recognizing these\nchallenges, this paper introduces a novel video decomposition prior\n`\\texttt{VDP}' framework which derives inspiration from professional video\nediting practices. Our methodology does not mandate task-specific external data\ncorpus collection, instead pivots to utilizing the motion and appearance of the\ninput video. \\texttt{VDP} framework decomposes a video sequence into a set of\nmultiple RGB layers and associated opacity levels. These set of layers are then\nmanipulated individually to obtain the desired results. We addresses tasks such\nas video object segmentation, dehazing, and relighting. Moreover, we introduce\na novel logarithmic video decomposition formulation for video relighting tasks,\nsetting a new benchmark over the existing methodologies. We observe the\nproperty of relighting emerge as we optimize for our novel relighting\ndecomposition formulation. We evaluate our approach on standard video datasets\nlike DAVIS, REVIDE, \\& SDSD and show qualitative results on a diverse array of\ninternet videos. Project Page -\nhttps://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video\nresults.\n","authors":["Gaurav Shrivastava","Ser-Nam Lim","Abhinav Shrivastava"],"pdf_url":"https://arxiv.org/pdf/2412.04930v1.pdf","comment":"Project Page -\n https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video\n results. Extended version of ICLR publication"},{"id":"http://arxiv.org/abs/2412.04929v1","updated":"2024-12-06T10:34:50Z","published":"2024-12-06T10:34:50Z","title":"Continuous Video Process: Modeling Videos as Continuous\n Multi-Dimensional Processes for Video Prediction","summary":" Diffusion models have made significant strides in image generation, mastering\ntasks such as unconditional image synthesis, text-image translation, and\nimage-to-image conversions. However, their capability falls short in the realm\nof video prediction, mainly because they treat videos as a collection of\nindependent images, relying on external constraints such as temporal attention\nmechanisms to enforce temporal coherence. In our paper, we introduce a novel\nmodel class, that treats video as a continuous multi-dimensional process rather\nthan a series of discrete frames. We also report a reduction of 75\\% sampling\nsteps required to sample a new frame thus making our framework more efficient\nduring the inference time. Through extensive experimentation, we establish\nstate-of-the-art performance in video prediction, validated on benchmark\ndatasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project\npage https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.}\n","authors":["Gaurav Shrivastava","Abhinav Shrivastava"],"pdf_url":"https://arxiv.org/pdf/2412.04929v1.pdf","comment":"Navigate to the project page\n https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.\n Extended version of published CVPR paper"},{"id":"http://arxiv.org/abs/2412.04914v1","updated":"2024-12-06T10:10:47Z","published":"2024-12-06T10:10:47Z","title":"Achieving Group Fairness through Independence in Predictive Process\n Monitoring","summary":" Predictive process monitoring focuses on forecasting future states of ongoing\nprocess executions, such as predicting the outcome of a particular case. In\nrecent years, the application of machine learning models in this domain has\ngarnered significant scientific attention. When using historical execution\ndata, which may contain biases or exhibit unfair behavior, these biases may be\nencoded into the trained models. Consequently, when such models are deployed to\nmake decisions or guide interventions for new cases, they risk perpetuating\nthis unwanted behavior. This work addresses group fairness in predictive\nprocess monitoring by investigating independence, i.e. ensuring predictions are\nunaffected by sensitive group membership. We explore independence through\nmetrics for demographic parity such as $\\Delta$DP, as well as recently\nintroduced, threshold-independent distribution-based alternatives.\nAdditionally, we propose a composite loss functions existing of binary\ncross-entropy and a distribution-based loss (Wasserstein) to train models that\nbalance predictive performance and fairness, and allow for customizable\ntrade-offs. The effectiveness of both the fairness metrics and the composite\nloss functions is validated through a controlled experimental setup.\n","authors":["Jari Peeperkorn","Simon De Vos"],"pdf_url":"https://arxiv.org/pdf/2412.04914v1.pdf","comment":"Preprint"},{"id":"http://arxiv.org/abs/2412.04910v1","updated":"2024-12-06T10:05:10Z","published":"2024-12-06T10:05:10Z","title":"Learning High-Degree Parities: The Crucial Role of the Initialization","summary":" Parities have become a standard benchmark for evaluating learning algorithms.\nRecent works show that regular neural networks trained by gradient descent can\nefficiently learn degree $k$ parities on uniform inputs for constant $k$, but\nfail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient\ndimension). However, the case where $k=d-O_d(1)$ (almost-full parities),\nincluding the degree $d$ parity (the full parity), has remained unsettled. This\npaper shows that for gradient descent on regular neural networks, learnability\ndepends on the initial weight distribution. On one hand, the discrete\nRademacher initialization enables efficient learning of almost-full parities,\nwhile on the other hand, its Gaussian perturbation with large enough constant\nstandard deviation $\\sigma$ prevents it. The positive result for almost-full\nparities is shown to hold up to $\\sigma=O(d^{-1})$, pointing to questions about\na sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a\nsingleton function class like the full parity is trivially learnable, our\nnegative result applies to a fixed function and relies on an initial gradient\nalignment measure of potential broader relevance to neural networks learning.\n","authors":["Emmanuel Abbe","Elisabetta Cornacchia","Jan Hązła","Donald Kougang-Yombi"],"pdf_url":"https://arxiv.org/pdf/2412.04910v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04905v1","updated":"2024-12-06T10:01:38Z","published":"2024-12-06T10:01:38Z","title":"DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling","summary":" Large language models (LLMs) have made dialogue one of the central modes of\nhuman-machine interaction, leading to the accumulation of vast amounts of\nconversation logs and increasing demand for dialogue generation. A\nconversational life-cycle spans from the Prelude through the Interlocution to\nthe Epilogue, encompassing various elements. Despite the existence of numerous\ndialogue-related studies, there is a lack of benchmarks that encompass\ncomprehensive dialogue elements, hindering precise modeling and systematic\nevaluation. To bridge this gap, we introduce an innovative research task\n$\\textbf{D}$ialogue $\\textbf{E}$lement $\\textbf{MO}$deling, including\n$\\textit{Element Awareness}$ and $\\textit{Dialogue Agent Interaction}$, and\npropose a novel benchmark, $\\textbf{DEMO}$, designed for a comprehensive\ndialogue modeling and assessment. Inspired by imitation learning, we further\nbuild the agent which possesses the adept ability to model dialogue elements\nbased on the DEMO benchmark. Extensive experiments indicate that existing LLMs\nstill exhibit considerable potential for enhancement, and our DEMO agent has\nsuperior performance in both in-domain and out-of-domain tasks.\n","authors":["Minzheng Wang","Xinghua Zhang","Kun Chen","Nan Xu","Haiyang Yu","Fei Huang","Wenji Mao","Yongbin Li"],"pdf_url":"https://arxiv.org/pdf/2412.04905v1.pdf","comment":"We release the code and data at https://github.com/MozerWang/DEMO"},{"id":"http://arxiv.org/abs/2412.04903v1","updated":"2024-12-06T09:59:47Z","published":"2024-12-06T09:59:47Z","title":"EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation","summary":" Multimodal large language models (MLLMs) have achieved remarkable progress on\nvarious visual question answering and reasoning tasks leveraging instruction\nfine-tuning specific datasets. They can also learn from preference data\nannotated by human to enhance their reasoning ability and mitigate\nhallucinations. Most of preference data is generated from the model itself.\nHowever, existing methods require high-quality critical labels, which are\ncostly and rely on human or proprietary models like GPT-4V. In this work, we\npropose Enhancing Alignment in MLLMs via Critical Observation (EACO), which\naligns MLLMs by self-generated preference data using only 5k images\neconomically. Our approach begins with collecting and refining a Scoring\nEvaluation Instruction-tuning dataset to train a critical evaluation model,\ntermed the Critic. This Critic observes model responses across multiple\ndimensions, selecting preferred and non-preferred outputs for refined Direct\nPreference Optimization (DPO) tuning. To further enhance model performance, we\nemploy an additional supervised fine-tuning stage after preference tuning. EACO\nreduces the overall hallucinations by 65.6% on HallusionBench and improves the\nreasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement\nover LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also\nshows the potential critical ability in open-source MLLMs, demonstrating that\nEACO is a viable path to boost the competence of MLLMs.\n","authors":["Yongxin Wang","Meng Cao","Haokun Lin","Mingfei Han","Liang Ma","Jin Jiang","Yuhao Cheng","Xiaodan Liang"],"pdf_url":"https://arxiv.org/pdf/2412.04903v1.pdf","comment":"19 pages"},{"id":"http://arxiv.org/abs/2412.04898v1","updated":"2024-12-06T09:56:49Z","published":"2024-12-06T09:56:49Z","title":"Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised\n Pretraining with Pseudo-Label Refinement","summary":" Deep learning models rely heavily on large volumes of labeled data to achieve\nhigh performance. However, real-world datasets often contain noisy labels due\nto human error, ambiguity, or resource constraints during the annotation\nprocess. Instance-dependent label noise (IDN), where the probability of a label\nbeing corrupted depends on the input features, poses a significant challenge\nbecause it is more prevalent and harder to address than instance-independent\nnoise. In this paper, we propose a novel hybrid framework that combines\nself-supervised learning using SimCLR with iterative pseudo-label refinement to\nmitigate the effects of IDN. The self-supervised pre-training phase enables the\nmodel to learn robust feature representations without relying on potentially\nnoisy labels, establishing a noise-agnostic foundation. Subsequently, we employ\nan iterative training process with pseudo-label refinement, where confidently\npredicted samples are identified through a multistage approach and their labels\nare updated to improve label quality progressively. We evaluate our method on\nthe CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent\nnoise at varying noise levels. Experimental results demonstrate that our\napproach significantly outperforms several state-of-the-art methods,\nparticularly under high noise conditions, achieving notable improvements in\nclassification accuracy and robustness. Our findings suggest that integrating\nself-supervised learning with iterative pseudo-label refinement offers an\neffective strategy for training deep neural networks on noisy datasets\nafflicted by instance-dependent label noise.\n","authors":["Gouranga Bala","Anuj Gupta","Subrat Kumar Behera","Amit Sethi"],"pdf_url":"https://arxiv.org/pdf/2412.04898v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2310.12000v4","updated":"2024-12-06T09:51:23Z","published":"2023-10-18T14:31:16Z","title":"Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian\n Process Models","summary":" Latent Gaussian process (GP) models are flexible probabilistic non-parametric\nfunction models. Vecchia approximations are accurate approximations for GPs to\novercome computational bottlenecks for large data, and the Laplace\napproximation is a fast method with asymptotic convergence guarantees to\napproximate marginal likelihoods and posterior predictive distributions for\nnon-Gaussian likelihoods. Unfortunately, the computational complexity of\ncombined Vecchia-Laplace approximations grows faster than linearly in the\nsample size when used in combination with direct solver methods such as the\nCholesky decomposition. Computations with Vecchia-Laplace approximations can\nthus become prohibitively slow precisely when the approximations are usually\nthe most accurate, i.e., on large data sets. In this article, we present\niterative methods to overcome this drawback. Among other things, we introduce\nand analyze several preconditioners, derive new convergence results, and\npropose novel methods for accurately approximating predictive variances. We\nanalyze our proposed methods theoretically and in experiments with simulated\nand real-world data. In particular, we obtain a speed-up of an order of\nmagnitude compared to Cholesky-based calculations and a threefold increase in\nprediction accuracy in terms of the continuous ranked probability score\ncompared to a state-of-the-art method on a large satellite data set. All\nmethods are implemented in a free C++ software library with high-level Python\nand R packages.\n","authors":["Pascal Kündig","Fabio Sigrist"],"pdf_url":"https://arxiv.org/pdf/2310.12000v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04884v1","updated":"2024-12-06T09:26:22Z","published":"2024-12-06T09:26:22Z","title":"AI-Driven Non-Invasive Detection and Staging of Steatosis in Fatty Liver\n Disease Using a Novel Cascade Model and Information Fusion Techniques","summary":" Non-alcoholic fatty liver disease (NAFLD) is one of the most widespread liver\ndisorders on a global scale, posing a significant threat of progressing to more\nsevere conditions like nonalcoholic steatohepatitis (NASH), liver fibrosis,\ncirrhosis, and hepatocellular carcinoma. Diagnosing and staging NAFLD presents\nchallenges due to its non-specific symptoms and the invasive nature of liver\nbiopsies. Our research introduces a novel artificial intelligence cascade model\nemploying ensemble learning and feature fusion techniques. We developed a\nnon-invasive, robust, and reliable diagnostic artificial intelligence tool that\nutilizes anthropometric and laboratory parameters, facilitating early detection\nand intervention in NAFLD progression. Our novel artificial intelligence\nachieved an 86% accuracy rate for the NASH steatosis staging task (non-NASH,\nsteatosis grade 1, steatosis grade 2, and steatosis grade 3) and an impressive\n96% AUC-ROC for distinguishing between NASH (steatosis grade 1, grade 2, and\ngrade3) and non-NASH cases, outperforming current state-of-the-art models. This\nnotable improvement in diagnostic performance underscores the potential\napplication of artificial intelligence in the early diagnosis and treatment of\nNAFLD, leading to better patient outcomes and a reduced healthcare burden\nassociated with advanced liver disease.\n","authors":["Niloufar Delfan","Pardis Ketabi Moghadam","Mohammad Khoshnevisan","Mehdi Hosseini Chagahi","Behzad Hatami","Melika Asgharzadeh","Mohammadreza Zali","Behzad Moshiri","Amin Momeni Moghaddam","Mohammad Amin Khalafi","Khosrow Dehnad"],"pdf_url":"https://arxiv.org/pdf/2412.04884v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04882v1","updated":"2024-12-06T09:25:00Z","published":"2024-12-06T09:25:00Z","title":"Nonmyopic Global Optimisation via Approximate Dynamic Programming","summary":" Unconstrained global optimisation aims to optimise expensive-to-evaluate\nblack-box functions without gradient information. Bayesian optimisation, one of\nthe most well-known techniques, typically employs Gaussian processes as\nsurrogate models, leveraging their probabilistic nature to balance exploration\nand exploitation. However, Gaussian processes become computationally\nprohibitive in high-dimensional spaces. Recent alternatives, based on inverse\ndistance weighting (IDW) and radial basis functions (RBFs), offer competitive,\ncomputationally lighter solutions. Despite their efficiency, both traditional\nglobal and Bayesian optimisation strategies suffer from the myopic nature of\ntheir acquisition functions, which focus solely on immediate improvement\nneglecting future implications of the sequential decision making process.\nNonmyopic acquisition functions devised for the Bayesian setting have shown\npromise in improving long-term performance. Yet, their use in deterministic\nstrategies with IDW and RBF remains unexplored. In this work, we introduce\nnovel nonmyopic acquisition strategies tailored to IDW- and RBF-based global\noptimisation. Specifically, we develop dynamic programming-based paradigms,\nincluding rollout and multi-step scenario-based optimisation schemes, to enable\nlookahead acquisition. These methods optimise a sequence of query points over a\nhorizon (instead of only at the next step) by predicting the evolution of the\nsurrogate model, inherently managing the exploration-exploitation trade-off in\na systematic way via optimisation techniques. The proposed approach represents\na significant advance in extending nonmyopic acquisition principles, previously\nconfined to Bayesian optimisation, to the deterministic framework. Empirical\nresults on synthetic and hyperparameter tuning benchmark problems demonstrate\nthat these nonmyopic methods outperform conventional myopic approaches.\n","authors":["Filippo Airaldi","Bart De Schutter","Azita Dabiri"],"pdf_url":"https://arxiv.org/pdf/2412.04882v1.pdf","comment":"31 pages, 4 figures, 2 tables, submitted to Springer Computational\n Optimization and Applications"},{"id":"http://arxiv.org/abs/2409.09304v2","updated":"2024-12-06T09:00:26Z","published":"2024-09-14T04:54:31Z","title":"Consistent Spectral Clustering in Hyperbolic Spaces","summary":" Clustering, as an unsupervised technique, plays a pivotal role in various\ndata analysis applications. Among clustering algorithms, Spectral Clustering on\nEuclidean Spaces has been extensively studied. However, with the rapid\nevolution of data complexity, Euclidean Space is proving to be inefficient for\nrepresenting and learning algorithms. Although Deep Neural Networks on\nhyperbolic spaces have gained recent traction, clustering algorithms or\nnon-deep machine learning models on non-Euclidean Spaces remain underexplored.\nIn this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces\nto address this gap. Hyperbolic Spaces offer advantages in representing complex\ndata structures like hierarchical and tree-like structures, which cannot be\nembedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the\nEuclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix,\ndemonstrating improved efficiency compared to clustering in Euclidean Spaces.\nOur contributions include the development of the spectral clustering algorithm\non Hyperbolic Spaces and the proof of its weak consistency. We show that our\nalgorithm converges at least as fast as Spectral Clustering on Euclidean\nSpaces. To illustrate the efficacy of our approach, we present experimental\nresults on the Wisconsin Breast Cancer Dataset, highlighting the superior\nperformance of Hyperbolic Spectral Clustering over its Euclidean counterpart.\nThis work opens up avenues for utilizing non-Euclidean Spaces in clustering\nalgorithms, offering new perspectives for handling complex data structures and\nimproving clustering efficiency.\n","authors":["Sagar Ghosh","Swagatam Das"],"pdf_url":"https://arxiv.org/pdf/2409.09304v2.pdf","comment":"Currently under review"},{"id":"http://arxiv.org/abs/2412.04861v1","updated":"2024-12-06T08:53:31Z","published":"2024-12-06T08:53:31Z","title":"MSECG: Incorporating Mamba for Robust and Efficient ECG Super-Resolution","summary":" Electrocardiogram (ECG) signals play a crucial role in diagnosing\ncardiovascular diseases. To reduce power consumption in wearable or portable\ndevices used for long-term ECG monitoring, super-resolution (SR) techniques\nhave been developed, enabling these devices to collect and transmit signals at\na lower sampling rate. In this study, we propose MSECG, a compact neural\nnetwork model designed for ECG SR. MSECG combines the strength of the recurrent\nMamba model with convolutional layers to capture both local and global\ndependencies in ECG waveforms, allowing for the effective reconstruction of\nhigh-resolution signals. We also assess the model's performance in real-world\nnoisy conditions by utilizing ECG data from the PTB-XL database and noise data\nfrom the MIT-BIH Noise Stress Test Database. Experimental results show that\nMSECG outperforms two contemporary ECG SR models under both clean and noisy\nconditions while using fewer parameters, offering a more powerful and robust\nsolution for long-term ECG monitoring applications.\n","authors":["Jie Lin","I Chiu","Kuan-Chen Wang","Kai-Chun Liu","Hsin-Min Wang","Ping-Cheng Yeh","Yu Tsao"],"pdf_url":"https://arxiv.org/pdf/2412.04861v1.pdf","comment":"5 pages, 3 figures"},{"id":"http://arxiv.org/abs/2411.09175v2","updated":"2024-12-06T08:41:19Z","published":"2024-11-14T04:26:47Z","title":"Hybrid deep additive neural networks","summary":" Traditional neural networks (multi-layer perceptrons) have become an\nimportant tool in data science due to their success across a wide range of\ntasks. However, their performance is sometimes unsatisfactory, and they often\nrequire a large number of parameters, primarily due to their reliance on the\nlinear combination structure. Meanwhile, additive regression has been a popular\nalternative to linear regression in statistics. In this work, we introduce\nnovel deep neural networks that incorporate the idea of additive regression.\nOur neural networks share architectural similarities with Kolmogorov-Arnold\nnetworks but are based on simpler yet flexible activation and basis functions.\nAdditionally, we introduce several hybrid neural networks that combine this\narchitecture with that of traditional neural networks. We derive their\nuniversal approximation properties and demonstrate their effectiveness through\nsimulation studies and a real-data application. The numerical results indicate\nthat our neural networks generally achieve better performance than traditional\nneural networks while using fewer parameters.\n","authors":["Gyu Min Kim","Jeong Min Jeon"],"pdf_url":"https://arxiv.org/pdf/2411.09175v2.pdf","comment":"30 pages, 10 figures"},{"id":"http://arxiv.org/abs/2404.11869v4","updated":"2024-12-06T08:37:55Z","published":"2024-04-18T03:03:37Z","title":"An Efficient Loop and Clique Coarsening Algorithm for Graph\n Classification","summary":" Graph Transformers (GTs) have made remarkable achievements in graph-level\ntasks. However, most existing works regard graph structures as a form of\nguidance or bias for enhancing node representations, which focuses on\nnode-central perspectives and lacks explicit representations of edges and\nstructures. One natural question arises as to whether we can leverage a\nhypernode to represent some structures. Through experimental analysis, we\nexplore the feasibility of this assumption. Based on our findings, we propose\nan efficient Loop and Clique Coarsening algorithm with linear complexity for\nGraph Classification (LCC4GC) on GT architecture. Specifically, we build three\nunique views, original, coarsening, and conversion, to learn a thorough\nstructural representation. We compress loops and cliques via hierarchical\nheuristic graph coarsening and restrict them with well-designed constraints,\nwhich builds the coarsening view to learn high-level interactions between\nstructures. We also introduce line graphs for edge embeddings and switch to\nedge-central perspective to alleviate the impact of coarsening reduction.\nExperiments on eight real-world datasets demonstrate the improvements of LCC4GC\nover 31 baselines from various architectures.\n","authors":["Xiaorui Qi","Qijie Bai","Yanlong Wen","Haiwei Zhang","Xiaojie Yuan"],"pdf_url":"https://arxiv.org/pdf/2404.11869v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04847v1","updated":"2024-12-06T08:35:33Z","published":"2024-12-06T08:35:33Z","title":"MTSpark: Enabling Multi-Task Learning with Spiking Neural Networks for\n Generalist Agents","summary":" Currently, state-of-the-art RL methods excel in single-task settings, but\nthey still struggle to generalize across multiple tasks due to catastrophic\nforgetting challenges, where previously learned tasks are forgotten as new\ntasks are introduced. This multi-task learning capability is significantly\nimportant for generalist agents, where adaptation features are highly required\n(e.g., autonomous robots). On the other hand, Spiking Neural Networks (SNNs)\nhave emerged as alternative energy-efficient neural network algorithms due to\ntheir sparse spike-based operations. Toward this, we propose MTSpark, a novel\nmethodology to enable multi-task RL using spiking networks. Specifically,\nMTSpark develops a Deep Spiking Q-Network (DSQN) with active dendrites and\ndueling structure by leveraging task-specific context signals. Specifically,\neach neuron computes task-dependent activations that dynamically modulate\ninputs, forming specialized sub-networks for each task. Moreover, this\nbioplausible network model also benefits from SNNs, enhancing energy efficiency\nand making the model suitable for hardware implementation. Experimental results\nshow that, our MTSpark effectively learns multiple tasks with higher\nperformance compared to the state-of-the-art. Specifically, MTSpark\nsuccessfully achieves high score in three Atari games (i.e., Pong: -5.4,\nBreakout: 0.6, and Enduro: 371.2), reaching human-level performance (i.e.,\nPong: -3, Breakout: 31, and Enduro: 368), where state-of-the-art struggle to\nachieve. In addition, our MTSpark also shows better accuracy in image\nclassification tasks than the state-of-the-art. These results highlight the\npotential of our MTSpark methodology to develop generalist agents that can\nlearn multiple tasks by leveraging both RL and SNN concepts.\n","authors":["Avaneesh Devkota","Rachmad Vidya Wicaksana Putra","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2412.04847v1.pdf","comment":"9 pages, 10 figures, 5 tables"},{"id":"http://arxiv.org/abs/2407.00641v2","updated":"2024-12-06T08:35:27Z","published":"2024-06-30T09:51:58Z","title":"NeuroNAS: A Framework for Energy-Efficient Neuromorphic\n Compute-in-Memory Systems using Hardware-Aware Spiking Neural Architecture\n Search","summary":" Spiking Neural Networks (SNNs) have demonstrated capabilities for solving\ndiverse machine learning tasks with ultra-low power/energy consumption. To\nmaximize the performance and efficiency of SNN inference, the Compute-in-Memory\n(CIM) hardware accelerators with emerging device technologies (e.g., RRAM) have\nbeen employed. However, SNN architectures are typically developed without\nconsidering constraints from the application and the underlying CIM hardware,\nthereby hindering SNNs from reaching their full potential in accuracy and\nefficiency. To address this, we propose NeuroNAS, a novel framework for\ndeveloping energy-efficient neuromorphic CIM systems using a hardware-aware\nspiking neural architecture search (NAS), i.e., by quickly finding an SNN\narchitecture that offers high accuracy under the given constraints (e.g.,\nmemory, area, latency, and energy consumption). NeuroNAS employs the following\nkey steps: (1) optimizing SNN operations to enable efficient NAS, (2) employing\nquantization to minimize the memory footprint, (3) developing an SNN\narchitecture that facilitates an effective learning, and (4) devising a\nsystematic hardware-aware search algorithm to meet the constraints. Compared to\nthe state-of-the-art, NeuroNAS with 8bit weight precision quickly finds SNNs\nthat maintain high accuracy by up to 6.6x search time speed-ups, while\nachieving up to 92% area savings, 1.2x latency speed-ups, 84% energy savings\nacross CIFAR-10, CIFAR-100, and TinyImageNet-200 datasets; while the\nstate-of-the-art fail to meet all constraints at once. In this manner, NeuroNAS\nenables efficient design automation in developing energy-efficient neuromorphic\nCIM systems for diverse ML-based applications.\n","authors":["Rachmad Vidya Wicaksana Putra","Muhammad Shafique"],"pdf_url":"https://arxiv.org/pdf/2407.00641v2.pdf","comment":"7 pages, 13 figures, 1 table"},{"id":"http://arxiv.org/abs/2412.04846v1","updated":"2024-12-06T08:33:49Z","published":"2024-12-06T08:33:49Z","title":"eXpath: Explaining Knowledge Graph Link Prediction with Ontological\n Closed Path Rules","summary":" Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but\ncommonly suffers from interpretability issues. While several methods have been\nproposed to explain embedding-based LP models, they are generally limited to\nlocal explanations on KG and are deficient in providing human interpretable\nsemantics. Based on real-world observations of the characteristics of KGs from\nmultiple domains, we propose to explain LP models in KG with path-based\nexplanations. An integrated framework, namely eXpath, is introduced which\nincorporates the concept of relation path with ontological closed path rules to\nenhance both the efficiency and effectiveness of LP interpretation. Notably,\nthe eXpath explanations can be fused with other single-link explanation\napproaches to achieve a better overall solution. Extensive experiments across\nbenchmark datasets and LP models demonstrate that introducing eXpath can boost\nthe quality of resulting explanations by about 20% on two key metrics and\nreduce the required explanation time by 61.4%, in comparison to the best\nexisting method. Case studies further highlight eXpath's ability to provide\nmore semantically meaningful explanations through path-based evidence.\n","authors":["Ye Sun","Lei Shi","Yongxin Tong"],"pdf_url":"https://arxiv.org/pdf/2412.04846v1.pdf","comment":"13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201"},{"id":"http://arxiv.org/abs/2412.04845v1","updated":"2024-12-06T08:30:01Z","published":"2024-12-06T08:30:01Z","title":"Using Machine Learning to Discover Parsimonious and\n Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff\n Dynamics","summary":" Despite the excellent real-world predictive performance of modern machine\nlearning (ML) methods, many scientists remain hesitant to discard traditional\nphysical-conceptual (PC) approaches due mainly to their relative\ninterpretability, which contributes to credibility during decision-making. In\nthis context, a currently underexplored aspect of ML is how to develop\nminimally-optimal representations that can facilitate better insight regarding\nsystem functioning. Regardless of how this is achieved, it is arguably true\nthat parsimonious representations better support the advancement of scientific\nunderstanding. Our own view is that ML-based modeling of geoscientific systems\nshould be based in the use of computational units that are fundamentally\ninterpretable by design.\n This paper continues our exploration of how the strengths of ML can be\nexploited in the service of better understanding via scientific investigation.\nHere, we use the Mass Conserving Perceptron (MCP) as the fundamental\ncomputational unit in a generic network architecture consisting of nodes\narranged in series and parallel to explore several generic and important issues\nrelated to the use of observational data for constructing input-state-output\nmodels of dynamical systems. In the context of lumped catchment modeling, we\nshow that physical interpretability and excellent predictive performance can\nboth be achieved using a relatively parsimonious distributed-state\nmultiple-flow-path network with context-dependent gating and information\nsharing across the nodes, suggesting that MCP-based modeling can play a\nsignificant role in application of ML to geoscientific investigation.\n","authors":["Yuan-Heng Wang","Hoshin V. Gupta"],"pdf_url":"https://arxiv.org/pdf/2412.04845v1.pdf","comment":"73 Pages, 4 Tables, 13 Figures, 11 Tables and 11 Figures in\n Supplementary Materials"},{"id":"http://arxiv.org/abs/2409.19976v3","updated":"2024-12-06T08:20:51Z","published":"2024-09-30T06:04:04Z","title":"Learning Partial Differential Equations with Deep Parallel Neural\n Operator","summary":" In recent years, Solving partial differential equations has shifted the focus\nof traditional neural network studies from finite-dimensional Euclidean spaces\nto generalized functional spaces in research. A novel methodology is to learn\nan operator as a means of approximating the mapping between outputs. Currently,\nresearchers have proposed a variety of operator architectures. Nevertheless,\nthe majority of these architectures adopt an iterative update architecture,\nwhereby a single operator is learned from the same function space. In practical\nphysical science problems, the numerical solutions of partial differential\nequations are complex, and a serial single operator is unable to accurately\napproximate the intricate mapping between input and output. So, We propose a\ndeep parallel operator model (DPNO) for efficiently and accurately solving\npartial differential equations. DPNO employs convolutional neural networks to\nextract local features and map data into distinct latent spaces. Designing a\nparallel block of double Fourier neural operators to solve the iterative error\nproblem. DPNO approximates complex mappings between inputs and outputs by\nlearning multiple operators in different potential spaces in parallel blocks.\nDPNO achieved the best performance on five of them, with an average improvement\nof 10.5\\%, and ranked second on one dataset.\n","authors":["Qinglong Ma","Peizhi Zhao","Sen Wang","Tao Song"],"pdf_url":"https://arxiv.org/pdf/2409.19976v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.17341v2","updated":"2024-12-06T08:07:23Z","published":"2024-06-25T07:54:32Z","title":"Generative Modelling of Structurally Constrained Graphs","summary":" Graph diffusion models have emerged as state-of-the-art techniques in graph\ngeneration; yet, integrating domain knowledge into these models remains\nchallenging. Domain knowledge is particularly important in real-world\nscenarios, where invalid generated graphs hinder deployment in practical\napplications. Unconstrained and conditioned graph diffusion models fail to\nguarantee such domain-specific structural properties. We present ConStruct, a\nnovel framework that enables graph diffusion models to incorporate hard\nconstraints on specific properties, such as planarity or acyclicity. Our\napproach ensures that the sampled graphs remain within the domain of graphs\nthat satisfy the specified property throughout the entire trajectory in both\nthe forward and reverse processes. This is achieved by introducing an\nedge-absorbing noise model and a new projector operator. ConStruct demonstrates\nversatility across several structural and edge-deletion invariant constraints\nand achieves state-of-the-art performance for both synthetic benchmarks and\nattributed real-world datasets. For example, by incorporating planarity\nconstraints in digital pathology graph datasets, the proposed method\noutperforms existing baselines, improving data validity by up to 71.1\npercentage points.\n","authors":["Manuel Madeira","Clement Vignac","Dorina Thanou","Pascal Frossard"],"pdf_url":"https://arxiv.org/pdf/2406.17341v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04835v1","updated":"2024-12-06T08:04:02Z","published":"2024-12-06T08:04:02Z","title":"Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards\n for Visuomotor Robot Policy Alignment","summary":" Visuomotor robot policies, increasingly pre-trained on large-scale datasets,\npromise significant advancements across robotics domains. However, aligning\nthese policies with end-user preferences remains a challenge, particularly when\nthe preferences are hard to specify. While reinforcement learning from human\nfeedback (RLHF) has become the predominant mechanism for alignment in\nnon-embodied domains like large language models, it has not seen the same\nsuccess in aligning visuomotor policies due to the prohibitive amount of human\nfeedback required to learn visual reward functions. To address this limitation,\nwe propose Representation-Aligned Preference-based Learning (RAPL), an\nobservation-only method for learning visual rewards from significantly less\nhuman preference feedback. Unlike traditional RLHF, RAPL focuses human feedback\non fine-tuning pre-trained vision encoders to align with the end-user's visual\nrepresentation and then constructs a dense visual reward via feature matching\nin this aligned representation space. We first validate RAPL through simulation\nexperiments in the X-Magical benchmark and Franka Panda robotic manipulation,\ndemonstrating that it can learn rewards aligned with human preferences, more\nefficiently uses preference data, and generalizes across robot embodiments.\nFinally, our hardware experiments align pre-trained Diffusion Policies for\nthree object manipulation tasks. We find that RAPL can fine-tune these policies\nwith 5x less real human preference data, taking the first step towards\nminimizing human feedback while maximizing visuomotor robot policy alignment.\n","authors":["Ran Tian","Yilin Wu","Chenfeng Xu","Masayoshi Tomizuka","Jitendra Malik","Andrea Bajcsy"],"pdf_url":"https://arxiv.org/pdf/2412.04835v1.pdf","comment":"Submitted to IJRR, this paper is an extended journal version of the\n conference paper arXiv:2310.07932 with new results and discussion. arXiv\n admin note: substantial text overlap with arXiv:2310.07932"},{"id":"http://arxiv.org/abs/2412.04833v1","updated":"2024-12-06T07:56:25Z","published":"2024-12-06T07:56:25Z","title":"Wavelet Diffusion Neural Operator","summary":" Simulating and controlling physical systems described by partial differential\nequations (PDEs) are crucial tasks across science and engineering. Recently,\ndiffusion generative models have emerged as a competitive class of methods for\nthese tasks due to their ability to capture long-term dependencies and model\nhigh-dimensional states. However, diffusion models typically struggle with\nhandling system states with abrupt changes and generalizing to higher\nresolutions. In this work, we propose Wavelet Diffusion Neural Operator (WDNO),\na novel PDE simulation and control framework that enhances the handling of\nthese complexities. WDNO comprises two key innovations. Firstly, WDNO performs\ndiffusion-based generative modeling in the wavelet domain for the entire\ntrajectory to handle abrupt changes and long-term dependencies effectively.\nSecondly, to address the issue of poor generalization across different\nresolutions, which is one of the fundamental tasks in modeling physical\nsystems, we introduce multi-resolution training. We validate WDNO on five\nphysical systems, including 1D advection equation, three challenging physical\nsystems with abrupt changes (1D Burgers' equation, 1D compressible\nNavier-Stokes equation and 2D incompressible fluid), and a real-world dataset\nERA5, which demonstrates superior performance on both simulation and control\ntasks over state-of-the-art methods, with significant improvements in long-term\nand detail prediction accuracy. Remarkably, in the challenging context of the\n2D high-dimensional and indirect control task aimed at reducing smoke leakage,\nWDNO reduces the leakage by 33.2% compared to the second-best baseline.\n","authors":["Peiyan Hu","Rui Wang","Xiang Zheng","Tao Zhang","Haodong Feng","Ruiqi Feng","Long Wei","Yue Wang","Zhi-Ming Ma","Tailin Wu"],"pdf_url":"https://arxiv.org/pdf/2412.04833v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04832v1","updated":"2024-12-06T07:56:14Z","published":"2024-12-06T07:56:14Z","title":"WRF-GS: Wireless Radiation Field Reconstruction with 3D Gaussian\n Splatting","summary":" Wireless channel modeling plays a pivotal role in designing, analyzing, and\noptimizing wireless communication systems. Nevertheless, developing an\neffective channel modeling approach has been a longstanding challenge. This\nissue has been escalated due to the denser network deployment, larger antenna\narrays, and wider bandwidth in 5G and beyond networks. To address this\nchallenge, we put forth WRF-GS, a novel framework for channel modeling based on\nwireless radiation field (WRF) reconstruction using 3D Gaussian splatting.\nWRF-GS employs 3D Gaussian primitives and neural networks to capture the\ninteractions between the environment and radio signals, enabling efficient WRF\nreconstruction and visualization of the propagation characteristics. The\nreconstructed WRF can then be used to synthesize the spatial spectrum for\ncomprehensive wireless channel characterization. Notably, with a small number\nof measurements, WRF-GS can synthesize new spatial spectra within milliseconds\nfor a given scene, thereby enabling latency-sensitive applications.\nExperimental results demonstrate that WRF-GS outperforms existing methods for\nspatial spectrum synthesis, such as ray tracing and other deep-learning\napproaches. Moreover, WRF-GS achieves superior performance in the channel state\ninformation prediction task, surpassing existing methods by a significant\nmargin of more than 2.43 dB.\n","authors":["Chaozheng Wen","Jingwen Tong","Yingdong Hu","Zehong Lin","Jun Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.04832v1.pdf","comment":"accepted to the IEEE International Conference on Computer\n Communications (INFOCOM 2025)"},{"id":"http://arxiv.org/abs/2412.04821v1","updated":"2024-12-06T07:29:34Z","published":"2024-12-06T07:29:34Z","title":"CCS: Continuous Learning for Customized Incremental Wireless Sensing\n Services","summary":" Wireless sensing has made significant progress in tasks ranging from action\nrecognition, vital sign estimation, pose estimation, etc. After over a decade\nof work, wireless sensing currently stands at the tipping point transitioning\nfrom proof-of-concept systems to the large-scale deployment. We envision a\nfuture service scenario where wireless sensing service providers distribute\nsensing models to users. During usage, users might request new sensing\ncapabilities. For example, if someone is away from home on a business trip or\nvacation for an extended period, they may want a new sensing capability that\ncan detect falls in elderly parents or grandparents and promptly alert them. In\nthis paper, we propose CCS (continuous customized service), enabling model\nupdates on users' local computing resources without data transmission to the\nservice providers. To address the issue of catastrophic forgetting in model\nupdates where updating model parameters to implement new capabilities leads to\nthe loss of existing capabilities we design knowledge distillation and weight\nalignment modules. These modules enable the sensing model to acquire new\ncapabilities while retaining the existing ones. We conducted extensive\nexperiments on the large-scale XRF55 dataset across Wi-Fi, millimeter-wave\nradar, and RFID modalities to simulate scenarios where four users sequentially\nintroduced new customized demands. The results affirm that CCS excels in\ncontinuous model services across all the above wireless modalities,\nsignificantly outperforming existing approaches like OneFi.\n","authors":["Qunhang Fu","Fei Wang","Mengdie Zhu","Han Ding","Jinsong Han","Tony Xiao Han"],"pdf_url":"https://arxiv.org/pdf/2412.04821v1.pdf","comment":"9 pages,8 figures"},{"id":"http://arxiv.org/abs/2411.19772v2","updated":"2024-12-06T07:24:10Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v2.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2404.06448v2","updated":"2024-12-06T07:10:30Z","published":"2024-04-09T16:50:30Z","title":"Automated Federated Pipeline for Parameter-Efficient Fine-Tuning of\n Large Language Models","summary":" Recently, there has been a surge in the development of advanced intelligent\ngenerative content (AIGC), especially large language models (LLMs). However,\nfor many downstream tasks, it is necessary to fine-tune LLMs using private\ndata. While federated learning offers a promising privacy-preserving solution\nto LLM fine-tuning, the substantial size of an LLM, combined with high\ncomputational and communication demands, makes it hard to apply to downstream\ntasks. More importantly, private edge servers often possess varying computing\nand network resources in real-world scenarios, introducing additional\ncomplexities to LLM fine-tuning. To tackle these problems, we design and\nimplement an automated federated pipeline, named FedPipe, to fine-tune LLMs\nwith minimal training cost but without adding any inference latency. FedPipe\nfirstly identifies the weights to be fine-tuned based on their contributions to\nthe LLM training. It then configures a low-rank adapter for each selected\nweight to train local low-rank adapters on an edge server, and aggregate local\nadapters of all edge servers to fine-tune the whole LLM. Finally, it\nappropriately quantizes the parameters of LLM to reduce memory space according\nto the requirements of edge servers. Extensive experiments demonstrate that\nFedPipe expedites the model training and achieves higher accuracy than\nstate-of-the-art benchmarks.\n","authors":["Zihan Fang","Zheng Lin","Zhe Chen","Xianhao Chen","Yue Gao","Yuguang Fang"],"pdf_url":"https://arxiv.org/pdf/2404.06448v2.pdf","comment":"15 pages, 16 figures"},{"id":"http://arxiv.org/abs/2410.03795v2","updated":"2024-12-06T06:59:09Z","published":"2024-10-04T02:50:58Z","title":"Deep Learning and Machine Learning: Advancing Big Data Analytics and\n Management with Design Patterns","summary":" This book, Design Patterns in Machine Learning and Deep Learning: Advancing\nBig Data Analytics Management, presents a comprehensive study of essential\ndesign patterns tailored for large-scale machine learning and deep learning\napplications. The book explores the application of classical software\nengineering patterns, Creational, Structural, Behavioral, and Concurrency\nPatterns, to optimize the development, maintenance, and scalability of big data\nanalytics systems. Through practical examples and detailed Python\nimplementations, it bridges the gap between traditional object-oriented design\npatterns and the unique demands of modern data analytics environments. Key\ndesign patterns such as Singleton, Factory, Observer, and Strategy are analyzed\nfor their impact on model management, deployment strategies, and team\ncollaboration, providing invaluable insights into the engineering of efficient,\nreusable, and flexible systems. This volume is an essential resource for\ndevelopers, researchers, and engineers aiming to enhance their technical\nexpertise in both machine learning and software design.\n","authors":["Keyu Chen","Ziqian Bi","Tianyang Wang","Yizhu Wen","Pohsun Feng","Qian Niu","Junyu Liu","Benji Peng","Sen Zhang","Ming Li","Xuanhe Pan","Jiawei Xu","Jinlang Wang","Ming Liu"],"pdf_url":"https://arxiv.org/pdf/2410.03795v2.pdf","comment":"138pages"},{"id":"http://arxiv.org/abs/2408.09181v2","updated":"2024-12-06T06:41:47Z","published":"2024-08-17T12:11:22Z","title":"PADetBench: Towards Benchmarking Physical Attacks against Object\n Detection","summary":" Physical attacks against object detection have gained increasing attention\ndue to their significant practical implications. However, conducting physical\nexperiments is extremely time-consuming and labor-intensive. Moreover, physical\ndynamics and cross-domain transformation are challenging to strictly regulate\nin the real world, leading to unaligned evaluation and comparison, severely\nhindering the development of physically robust models. To accommodate these\nchallenges, we explore utilizing realistic simulation to thoroughly and\nrigorously benchmark physical attacks with fairness under controlled physical\ndynamics and cross-domain transformation. This resolves the problem of\ncapturing identical adversarial images that cannot be achieved in the real\nworld. Our benchmark includes 20 physical attack methods, 48 object detectors,\ncomprehensive physical dynamics, and evaluation metrics. We also provide\nend-to-end pipelines for dataset generation, detection, evaluation, and further\nanalysis. In addition, we perform 8064 groups of evaluation based on our\nbenchmark, which includes both overall evaluation and further detailed ablation\nstudies for controlled physical dynamics. Through these experiments, we provide\nin-depth analyses of physical attack performance and physical adversarial\nrobustness, draw valuable observations, and discuss potential directions for\nfuture research.\n Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack\n","authors":["Jiawei Lian","Jianhong Pan","Lefan Wang","Yi Wang","Lap-Pui Chau","Shaohui Mei"],"pdf_url":"https://arxiv.org/pdf/2408.09181v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04806v1","updated":"2024-12-06T06:32:47Z","published":"2024-12-06T06:32:47Z","title":"Rethinking Time Series Forecasting with LLMs via Nearest Neighbor\n Contrastive Learning","summary":" Adapting Large Language Models (LLMs) that are extensively trained on\nabundant text data, and customizing the input prompt to enable time series\nforecasting has received considerable attention. While recent work has shown\ngreat potential for adapting the learned prior of LLMs, the formulation of the\nprompt to finetune LLMs remains challenging as prompt should be aligned with\ntime series data. Additionally, current approaches do not effectively leverage\nword token embeddings which embody the rich representation space learned by\nLLMs. This emphasizes the need for a robust approach to formulate the prompt\nwhich utilizes the word token embeddings while effectively representing the\ncharacteristics of the time series. To address these challenges, we propose\nNNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting\nvia LLMs. First, we generate time series compatible text prototypes such that\neach text prototype represents both word token embeddings in its neighborhood\nand time series characteristics via end-to-end finetuning. Next, we draw\ninspiration from Nearest Neighbor Contrastive Learning to formulate the prompt\nwhile obtaining the top-$k$ nearest neighbor time series compatible text\nprototypes. We then fine-tune the layer normalization and positional embeddings\nof the LLM, keeping the other layers intact, reducing the trainable parameters\nand decreasing the computational cost. Our comprehensive experiments\ndemonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving\ncompetitive or superior performance over the state-of-the-art methods in\nlong-term and short-term forecasting tasks.\n","authors":["Jayanie Bogahawatte","Sachith Seneviratne","Maneesha Perera","Saman Halgamuge"],"pdf_url":"https://arxiv.org/pdf/2412.04806v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.16852v2","updated":"2024-12-06T06:07:45Z","published":"2024-05-27T05:55:22Z","title":"EM Distillation for One-step Diffusion Models","summary":" While diffusion models can learn complex distributions, sampling requires a\ncomputationally expensive iterative process. Existing distillation methods\nenable efficient sampling, but have notable limitations, such as performance\ndegradation with very few sampling steps, reliance on training data access, or\nmode-seeking optimization that may fail to capture the full distribution. We\npropose EM Distillation (EMD), a maximum likelihood-based approach that\ndistills a diffusion model to a one-step generator model with minimal loss of\nperceptual quality. Our approach is derived through the lens of\nExpectation-Maximization (EM), where the generator parameters are updated using\nsamples from the joint distribution of the diffusion teacher prior and inferred\ngenerator latents. We develop a reparametrized sampling scheme and a noise\ncancellation technique that together stabilizes the distillation process. We\nfurther reveal an interesting connection of our method with existing methods\nthat minimize mode-seeking KL. EMD outperforms existing one-step generative\nmethods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares\nfavorably with prior work on distilling text-to-image diffusion models.\n","authors":["Sirui Xie","Zhisheng Xiao","Diederik P Kingma","Tingbo Hou","Ying Nian Wu","Kevin Patrick Murphy","Tim Salimans","Ben Poole","Ruiqi Gao"],"pdf_url":"https://arxiv.org/pdf/2405.16852v2.pdf","comment":"NeurIPS 2024"},{"id":"http://arxiv.org/abs/2411.14166v2","updated":"2024-12-06T05:45:50Z","published":"2024-11-21T14:23:06Z","title":"SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized\n Bilevel Optimization","summary":" This paper studies decentralized bilevel optimization, in which multiple\nagents collaborate to solve problems involving nested optimization structures\nwith neighborhood communications. Most existing literature primarily utilizes\ngradient tracking to mitigate the influence of data heterogeneity, without\nexploring other well-known heterogeneity-correction techniques such as EXTRA or\nExact Diffusion. Additionally, these studies often employ identical\ndecentralized strategies for both upper- and lower-level problems, neglecting\nto leverage distinct mechanisms across different levels. To address these\nlimitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual\nAlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the\nflexibility to incorporate various heterogeneitycorrection strategies into the\nalgorithm. Moreover, SPARKLE allows for different strategies to solve upper-\nand lower-level problems. We present a unified convergence analysis for\nSPARKLE, applicable to all its variants, with state-of-the-art convergence\nrates compared to existing decentralized bilevel algorithms. Our results\nfurther reveal that EXTRA and Exact Diffusion are more suitable for\ndecentralized bilevel optimization, and using mixed strategies in bilevel\nalgorithms brings more benefits than relying solely on gradient tracking.\n","authors":["Shuchen Zhu","Boao Kong","Songtao Lu","Xinmeng Huang","Kun Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.14166v2.pdf","comment":"73 pages, the Thirty-Eighth Annual Conference on Neural Information\n Processing Systems (2024)"},{"id":"http://arxiv.org/abs/2412.04787v1","updated":"2024-12-06T05:41:11Z","published":"2024-12-06T05:41:11Z","title":"Direct Quantized Training of Language Models with Stochastic Rounding","summary":" Although recent quantized Large Language Models (LLMs), such as BitNet, have\npaved the way for significant reduction in memory usage during deployment with\nbinary or ternary weights, training these models still demands substantial\nmemory footprints. This is partly because high-precision (i.e., unquantized)\nweight matrices required for straight-through estimation must be maintained\nthroughout the whole training process. To address this, we explore the\npotential of directly updating the quantized low-precision weight matrices\nwithout relying on the straight-through estimator during backpropagation,\nthereby saving memory usage during training. Specifically, we employ a\nstochastic rounding technique to minimize information loss caused by the use of\nlow-bit weights throughout training. Experimental results on our\nLLaMA-structured models indicate that (1) training with only low-precision\nweights is feasible even when they are constrained to ternary values, (2)\nextending the bit width to 8 bits results in only a 5% loss degradation\ncompared to BitNet b1.58 while offering the potential for reduced memory usage\nduring training, and (3) our models can also perform inference using ternary\nweights, showcasing their flexibility in deployment.\n","authors":["Kaiyan Zhao","Tsuguchika Tabaru","Kenichi Kobayashi","Takumi Honda","Masafumi Yamazaki","Yoshimasa Tsuruoka"],"pdf_url":"https://arxiv.org/pdf/2412.04787v1.pdf","comment":"work in progress"},{"id":"http://arxiv.org/abs/2306.09363v2","updated":"2024-12-06T05:35:09Z","published":"2023-06-14T05:46:52Z","title":"A Simple Data Augmentation for Feature Distribution Skewed Federated\n Learning","summary":" Federated Learning (FL) facilitates collaborative learning among multiple\nclients in a distributed manner and ensures the security of privacy. However,\nits performance inevitably degrades with non-Independent and Identically\nDistributed (non-IID) data. In this paper, we focus on the feature distribution\nskewed FL scenario, a common non-IID situation in real-world applications where\ndata from different clients exhibit varying underlying distributions. This\nvariation leads to feature shift, which is a key issue of this scenario. While\nprevious works have made notable progress, few pay attention to the data\nitself, i.e., the root of this issue. The primary goal of this paper is to\nmitigate feature shift from the perspective of data. To this end, we propose a\nsimple yet remarkably effective input-level data augmentation method, namely\nFedRDN, which randomly injects the statistical information of the local\ndistribution from the entire federation into the client's data. This is\nbeneficial to improve the generalization of local feature representations,\nthereby mitigating feature shift. Moreover, our FedRDN is a plug-and-play\ncomponent, which can be seamlessly integrated into the data augmentation flow\nwith only a few lines of code. Extensive experiments on several datasets show\nthat the performance of various representative FL methods can be further\nimproved by integrating our FedRDN, demonstrating its effectiveness, strong\ncompatibility and generalizability. Code will be released.\n","authors":["Yunlu Yan","Huazhu Fu","Yuexiang Li","Jinheng Xie","Jun Ma","Guang Yang","Lei Zhu"],"pdf_url":"https://arxiv.org/pdf/2306.09363v2.pdf","comment":"11 pages, 3 figures"},{"id":"http://arxiv.org/abs/2412.04786v1","updated":"2024-12-06T05:31:42Z","published":"2024-12-06T05:31:42Z","title":"Slicing Vision Transformer for Flexible Inference","summary":" Vision Transformers (ViT) is known for its scalability. In this work, we\ntarget to scale down a ViT to fit in an environment with dynamic-changing\nresource constraints. We observe that smaller ViTs are intrinsically the\nsub-networks of a larger ViT with different widths. Thus, we propose a general\nframework, named Scala, to enable a single network to represent multiple\nsmaller ViTs with flexible inference capability, which aligns with the inherent\ndesign of ViT to vary from widths. Concretely, Scala activates several subnets\nduring training, introduces Isolated Activation to disentangle the smallest\nsub-network from other subnets, and leverages Scale Coordination to ensure each\nsub-network receives simplified, steady, and accurate learning objectives.\nComprehensive empirical validations on different tasks demonstrate that with\nonly one-shot training, Scala learns slimmable representation without modifying\nthe original ViT structure and matches the performance of Separate Training.\nCompared with the prior art, Scala achieves an average improvement of 1.6% on\nImageNet-1K with fewer parameters.\n","authors":["Yitian Zhang","Huseyin Coskun","Xu Ma","Huan Wang","Ke Ma"," Xi"," Chen","Derek Hao Hu","Yun Fu"],"pdf_url":"https://arxiv.org/pdf/2412.04786v1.pdf","comment":"Accepted by NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04785v1","updated":"2024-12-06T05:31:08Z","published":"2024-12-06T05:31:08Z","title":"Differentially Private Random Feature Model","summary":" Designing privacy-preserving machine learning algorithms has received great\nattention in recent years, especially in the setting when the data contains\nsensitive information. Differential privacy (DP) is a widely used mechanism for\ndata analysis with privacy guarantees. In this paper, we produce a\ndifferentially private random feature model. Random features, which were\nproposed to approximate large-scale kernel machines, have been used to study\nprivacy-preserving kernel machines as well. We consider the over-parametrized\nregime (more features than samples) where the non-private random feature model\nis learned via solving the min-norm interpolation problem, and then we apply\noutput perturbation techniques to produce a private model. We show that our\nmethod preserves privacy and derive a generalization error bound for the\nmethod. To the best of our knowledge, we are the first to consider\nprivacy-preserving random feature models in the over-parametrized regime and\nprovide theoretical guarantees. We empirically compare our method with other\nprivacy-preserving learning methods in the literature as well. Our results show\nthat our approach is superior to the other methods in terms of generalization\nperformance on synthetic data and benchmark data sets. Additionally, it was\nrecently observed that DP mechanisms may exhibit and exacerbate disparate\nimpact, which means that the outcomes of DP learning algorithms vary\nsignificantly among different groups. We show that both theoretically and\nempirically, random features have the potential to reduce disparate impact, and\nhence achieve better fairness.\n","authors":["Chunyang Liao","Deanna Needell","Alexander Xue"],"pdf_url":"https://arxiv.org/pdf/2412.04785v1.pdf","comment":"Submitted to an IEEE journal"},{"id":"http://arxiv.org/abs/2412.04784v1","updated":"2024-12-06T05:30:41Z","published":"2024-12-06T05:30:41Z","title":"NLP-ADBench: NLP Anomaly Detection Benchmark","summary":" Anomaly detection (AD) is a critical machine learning task with diverse\napplications in web systems, including fraud detection, content moderation, and\nuser behavior analysis. Despite its significance, AD in natural language\nprocessing (NLP) remains underexplored, limiting advancements in detecting\nanomalies in text data such as harmful content, phishing attempts, or spam\nreviews. In this paper, we introduce NLP-ADBench, the most comprehensive\nbenchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets\nand evaluations of nineteen state-of-the-art algorithms. These include three\nend-to-end methods and sixteen two-step algorithms that apply traditional\nanomaly detection techniques to language embeddings generated by\nbert-base-uncased and OpenAI's text-embedding-3-large models.\n Our results reveal critical insights and future directions for NLP-AD.\nNotably, no single model excels across all datasets, highlighting the need for\nautomated model selection. Moreover, two-step methods leveraging\ntransformer-based embeddings consistently outperform specialized end-to-end\napproaches, with OpenAI embeddings demonstrating superior performance over BERT\nembeddings. By releasing NLP-ADBench at\nhttps://github.com/USC-FORTIS/NLP-ADBench, we provide a standardized framework\nfor evaluating NLP-AD methods, fostering the development of innovative\napproaches. This work fills a crucial gap in the field and establishes a\nfoundation for advancing NLP anomaly detection, particularly in the context of\nimproving the safety and reliability of web-based systems.\n","authors":["Yuangang Li","Jiaqi Li","Zhuo Xiao","Tiankai Yang","Yi Nian","Xiyang Hu","Yue Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.04784v1.pdf","comment":"The project is available at https://github.com/USC-FORTIS/NLP-ADBench"},{"id":"http://arxiv.org/abs/2412.01770v2","updated":"2024-12-06T05:23:30Z","published":"2024-12-02T18:12:02Z","title":"Robot Learning with Super-Linear Scaling","summary":" Scaling robot learning requires data collection pipelines that scale\nfavorably with human effort. In this work, we propose Crowdsourcing and\nAmortizing Human Effort for Real-to-Sim-to-Real(CASHER), a pipeline for scaling\nup data collection and learning in simulation where the performance scales\nsuperlinearly with human effort. The key idea is to crowdsource digital twins\nof real-world scenes using 3D reconstruction and collect large-scale data in\nsimulation, rather than the real-world. Data collection in simulation is\ninitially driven by RL, bootstrapped with human demonstrations. As the training\nof a generalist policy progresses across environments, its generalization\ncapabilities can be used to replace human effort with model generated\ndemonstrations. This results in a pipeline where behavioral data is collected\nin simulation with continually reducing human effort. We show that CASHER\ndemonstrates zero-shot and few-shot scaling laws on three real-world tasks\nacross diverse scenarios. We show that CASHER enables fine-tuning of\npre-trained policies to a target scenario using a video scan without any\nadditional human effort. See our project website:\nhttps://casher-robot-learning.github.io/CASHER/\n","authors":["Marcel Torne","Arhan Jain","Jiayi Yuan","Vidaaranya Macha","Lars Ankile","Anthony Simeonov","Pulkit Agrawal","Abhishek Gupta"],"pdf_url":"https://arxiv.org/pdf/2412.01770v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04781v1","updated":"2024-12-06T05:18:58Z","published":"2024-12-06T05:18:58Z","title":"DPGIIL: Dirichlet Process-Deep Generative Model-Integrated Incremental\n Learning for Clustering in Transmissibility-based Online Structural Anomaly\n Detection","summary":" Clustering based on vibration responses, such as transmissibility functions\n(TFs), is promising in structural anomaly detection, but most existing\napproaches struggle with determining the optimal cluster number and handling\nhigh-dimensional streaming data, while their shallow structures also make them\nsensitive to manually-engineered feature quality. To bridge this gap, this work\nproposes the Dirichlet process-deep generative model-integrated incremental\nlearning (DPGIIL) for clustering by combining the advantages of deep generative\nmodels (DGMs) in representation learning and the Dirichlet process mixture\nmodel (DPMM) in identifying distinct patterns in observed data. By introducing\na DPMM prior into the latent space of DGMs, DPGIIL automatically captures\ndissimilarities in extracted latent representations, enabling both generative\nmodeling and clustering. Within the context of variational Bayesian inference,\na lower bound on the log marginal likelihood of DPGIIL, tighter than the\nevidence lower bound given sufficient training data, is derived analytically,\nwhich enables the joint optimization of DGM and DPMM parameters, thereby\nallowing the DPMM to regularize the DGM's feature extraction process.\nAdditionally, a greedy split-merge scheme-based coordinate ascent variational\ninference method is devised to accelerate the optimization. The summary\nstatistics of the DPMM, along with the network parameters, are used to retain\ninformation about previous data for incremental learning. Notably, this study\nuses variational autoencoder (VAE) within DPGIIL as an illustrative example,\nwhile this framework is adaptable to other DGMs. Two case studies show that the\nproposed method outperforms some state-of-the-art approaches in structural\nanomaly detection and clustering, while also dynamically generating new\nclusters to indicate the emergence of new structural conditions for online\nmonitoring.\n","authors":["Lin-Feng Mei","Wang-Ji Yan"],"pdf_url":"https://arxiv.org/pdf/2412.04781v1.pdf","comment":"48 pages,9 figures,6 tables,submitted to Advanced Engineering\n Informatics"},{"id":"http://arxiv.org/abs/2406.14096v3","updated":"2024-12-06T05:18:37Z","published":"2024-06-20T08:22:07Z","title":"Graph Neural Networks for Job Shop Scheduling Problems: A Survey","summary":" Job shop scheduling problems (JSSPs) represent a critical and challenging\nclass of combinatorial optimization problems. Recent years have witnessed a\nrapid increase in the application of graph neural networks (GNNs) to solve\nJSSPs, albeit lacking a systematic survey of the relevant literature. This\npaper aims to thoroughly review prevailing GNN methods for different types of\nJSSPs and the closely related flow-shop scheduling problems (FSPs), especially\nthose leveraging deep reinforcement learning (DRL). We begin by presenting the\ngraph representations of various JSSPs, followed by an introduction to the most\ncommonly used GNN architectures. We then review current GNN-based methods for\neach problem type, highlighting key technical elements such as graph\nrepresentations, GNN architectures, GNN tasks, and training algorithms.\nFinally, we summarize and analyze the advantages and limitations of GNNs in\nsolving JSSPs and provide potential future research opportunities. We hope this\nsurvey can motivate and inspire innovative approaches for more powerful\nGNN-based approaches in tackling JSSPs and other scheduling problems.\n","authors":["Igor G. Smit","Jianan Zhou","Robbert Reijnen","Yaoxin Wu","Jian Chen","Cong Zhang","Zaharah Bukhsh","Yingqian Zhang","Wim Nuijten"],"pdf_url":"https://arxiv.org/pdf/2406.14096v3.pdf","comment":"Accepted by Computers & Operations Research"},{"id":"http://arxiv.org/abs/2411.18055v2","updated":"2024-12-06T05:18:13Z","published":"2024-11-27T04:58:10Z","title":"FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision\n Quantized DNNs--Down to 2 Bits!","summary":" A widely-used technique in designing energy-efficient deep neural network\n(DNN) accelerators is quantization. Recent progress in this direction has\nreduced the bitwidths used in DNN down to 2. Meanwhile, many prior works apply\napproximate multipliers (AppMuls) in designing DNN accelerators to lower their\nenergy consumption. Unfortunately, these works still assume a bitwidth much\nlarger than 2, which falls far behind the state-of-the-art in quantization area\nand even challenges the meaningfulness of applying AppMuls in DNN accelerators,\nsince a high-bitwidth AppMul consumes much more energy than a low-bitwidth\nexact multiplier! Thus, an important problem to study is: Can approximate\nmultipliers be effectively applied to quantized DNN models with very low\nbitwidths? In this work, we give an affirmative answer to this question and\npresent a systematic solution that achieves the answer: FAMES, a fast\napproximate multiplier substitution method for mixed-precision DNNs. Our\nexperiments demonstrate an average 28.67% energy reduction on state-of-the-art\nmixed-precision quantized models with bitwidths as low as 2 bits and accuracy\nlosses kept under 1%. Additionally, our approach is up to 300x faster than\nprevious genetic algorithm-based methods.\n","authors":["Yi Ren","Ruge Xu","Xinfei Guo","Weikang Qian"],"pdf_url":"https://arxiv.org/pdf/2411.18055v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.19582v2","updated":"2024-12-06T05:14:15Z","published":"2024-04-30T14:19:06Z","title":"URVFL: Undetectable Data Reconstruction Attack on Vertical Federated\n Learning","summary":" Launching effective malicious attacks in VFL presents unique challenges: 1)\nFirstly, given the distributed nature of clients' data features and models,\neach client rigorously guards its privacy and prohibits direct querying,\ncomplicating any attempts to steal data; 2) Existing malicious attacks alter\nthe underlying VFL training task, and are hence easily detected by comparing\nthe received gradients with the ones received in honest training. To overcome\nthese challenges, we develop URVFL, a novel attack strategy that evades current\ndetection mechanisms. The key idea is to integrate a discriminator with\nauxiliary classifier that takes a full advantage of the label information and\ngenerates malicious gradients to the victim clients: on one hand, label\ninformation helps to better characterize embeddings of samples from distinct\nclasses, yielding an improved reconstruction performance; on the other hand,\ncomputing malicious gradients with label information better mimics the honest\ntraining, making the malicious gradients indistinguishable from the honest\nones, and the attack much more stealthy. Our comprehensive experiments\ndemonstrate that URVFL significantly outperforms existing attacks, and\nsuccessfully circumvents SOTA detection methods for malicious attacks.\nAdditional ablation studies and evaluations on defenses further underscore the\nrobustness and effectiveness of URVFL. Our code will be available at\nhttps://github.com/duanyiyao/URVFL.\n","authors":["Duanyi Yao","Songze Li","Xueluan Gong","Sizai Hou","Gaoning Pan"],"pdf_url":"https://arxiv.org/pdf/2404.19582v2.pdf","comment":"Accepted by NDSS 2025"},{"id":"http://arxiv.org/abs/2412.04780v1","updated":"2024-12-06T05:03:10Z","published":"2024-12-06T05:03:10Z","title":"Anomaly Detection and Classification in Knowledge Graphs","summary":" Anomalies such as redundant, inconsistent, contradictory, and deficient\nvalues in a Knowledge Graph (KG) are unavoidable, as these graphs are often\ncurated manually, or extracted using machine learning and natural language\nprocessing techniques. Therefore, anomaly detection is a task that can enhance\nthe quality of KGs. In this paper, we propose SEKA (SEeking Knowledge graph\nAnomalies), an unsupervised approach for the detection of abnormal triples and\nentities in KGs. SEKA can help improve the correctness of a KG whilst retaining\nits coverage. We propose an adaption of the Path Rank Algorithm (PRA), named\nthe Corroborative Path Rank Algorithm (CPRA), which is an efficient adaptation\nof PRA that is customized to detect anomalies in KGs. Furthermore, we also\npresent TAXO (TAXOnomy of anomaly types in KGs), a taxonomy of possible anomaly\ntypes that can occur in a KG. This taxonomy provides a classification of the\nanomalies discovered by SEKA with an extensive discussion of possible data\nquality issues in a KG. We evaluate both approaches using the four real-world\nKGs YAGO-1, KBpedia, Wikidata, and DSKG to demonstrate the ability of SEKA and\nTAXO to outperform the baselines.\n","authors":["Asara Senaratne","Peter Christen","Pouya Omran","Graham Williams"],"pdf_url":"https://arxiv.org/pdf/2412.04780v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04778v1","updated":"2024-12-06T05:00:01Z","published":"2024-12-06T05:00:01Z","title":"IterNorm: Fast Iterative Normalization","summary":" Transformer-based large language models are a memory-bound model whose\noperation is based on a large amount of data that are marginally reused. Thus,\nthe data movement between a host and accelerator likely dictates the total\nwall-clock time. Layer normalization is one of the key workloads in the\ntransformer model, following each of multi-head attention and feed-forward\nnetwork blocks. To reduce data movement, layer normalization needs to be\nperformed on the same chip as the matrix-matrix multiplication engine. To this\nend, we introduce an iterative L2-normalization method for 1D input (IterNorm),\nensuring fast convergence to the steady-state solution within five iteration\nsteps and high precision, outperforming the fast inverse square root algorithm\nin six out of nine cases for FP32 and five out of nine for BFloat16 across the\nembedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the\nIterNorm macro normalizes $d$-dimensional vectors, where $64 \\leq d \\leq 1024$,\nwith a latency of 112-227 cycles at 100MHz/1.05V.\n","authors":["ChangMin Ye","Yonguk Sim","Youngchae Kim","SeongMin Jin","Doo Seok Jeong"],"pdf_url":"https://arxiv.org/pdf/2412.04778v1.pdf","comment":"Design, Automation & Test in Europe Conference 2025"},{"id":"http://arxiv.org/abs/2302.00098v2","updated":"2024-12-06T04:51:31Z","published":"2023-01-31T20:58:08Z","title":"Does Deep Active Learning Work in the Wild?","summary":" Deep active learning (DAL) methods have shown significant improvements in\nsample efficiency compared to simple random sampling. While these studies are\nvaluable, they nearly always assume that optimal DAL hyperparameter (HP)\nsettings are known in advance, or optimize the HPs through repeating DAL\nseveral times with different HP settings. Here, we argue that in real-world\nsettings, or in the wild, there is significant uncertainty regarding good HPs,\nand their optimization contradicts the premise of using DAL (i.e., we require\nlabeling efficiency). In this study, we evaluate the performance of eleven\nmodern DAL methods on eight benchmark problems as we vary a key HP shared by\nall methods: the pool ratio. Despite adjusting only one HP, our results\nindicate that eight of the eleven DAL methods sometimes underperform relative\nto simple random sampling and some frequently perform worse. Only three methods\nalways outperform random sampling (albeit narrowly), and we find that these\nmethods all utilize diversity to select samples - a relatively simple\ncriterion. Our findings reveal the limitations of existing DAL methods when\ndeployed in the wild, and present this as an important new open problem in the\nfield.\n","authors":["Simiao Ren","Saad Lahrichi","Yang Deng","Willie J. Padilla","Leslie Collins","Jordan Malof"],"pdf_url":"https://arxiv.org/pdf/2302.00098v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03716v2","updated":"2024-12-06T04:40:40Z","published":"2024-12-04T21:09:45Z","title":"A Water Efficiency Dataset for African Data Centers","summary":" AI computing and data centers consume a large amount of freshwater, both\ndirectly for cooling and indirectly for electricity generation. While most\nattention has been paid to developed countries such as the U.S., this paper\npresents the first-of-its-kind dataset that combines nation-level weather and\nelectricity generation data to estimate water usage efficiency for data centers\nin 41 African countries across five different climate regions. We also use our\ndataset to evaluate and estimate the water consumption of inference on two\nlarge language models (i.e., Llama-3-70B and GPT-4) in 11 selected African\ncountries. Our findings show that writing a 10-page report using Llama-3-70B\ncould consume about \\textbf{0.7 liters} of water, while the water consumption\nby GPT-4 for the same task may go up to about 60 liters. For writing a\nmedium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about\n\\textbf{0.13 liters} and 3 liters of water, respectively. Interestingly, given\nthe same AI model, 8 out of the 11 selected African countries consume less\nwater than the global average, mainly because of lower water intensities for\nelectricity generation. However, water consumption can be substantially higher\nin some African countries with a steppe climate than the U.S. and global\naverages, prompting more attention when deploying AI computing in these\ncountries. Our dataset is publicly available on\n\\href{https://huggingface.co/datasets/masterlion/WaterEfficientDatasetForAfricanCountries/tree/main}{Hugging\nFace}.\n","authors":["Noah Shumba","Opelo Tshekiso","Pengfei Li","Giulia Fanti","Shaolei Ren"],"pdf_url":"https://arxiv.org/pdf/2412.03716v2.pdf","comment":"Accepted by NeurIPS 2024 Workshop on Tackling Climate Change with\n Machine Learning"},{"id":"http://arxiv.org/abs/2412.04775v1","updated":"2024-12-06T04:38:43Z","published":"2024-12-06T04:38:43Z","title":"A Temporally Correlated Latent Exploration for Reinforcement Learning","summary":" Efficient exploration remains one of the longstanding problems of deep\nreinforcement learning. Instead of depending solely on extrinsic rewards from\nthe environments, existing methods use intrinsic rewards to enhance\nexploration. However, we demonstrate that these methods are vulnerable to Noisy\nTV and stochasticity. To tackle this problem, we propose Temporally Correlated\nLatent Exploration (TeCLE), which is a novel intrinsic reward formulation that\nemploys an action-conditioned latent space and temporal correlation. The\naction-conditioned latent space estimates the probability distribution of\nstates, thereby avoiding the assignment of excessive intrinsic rewards to\nunpredictable states and effectively addressing both problems. Whereas previous\nworks inject temporal correlation for action selection, the proposed method\ninjects it for intrinsic reward computation. We find that the injected temporal\ncorrelation determines the exploratory behaviors of agents. Various experiments\nshow that the environment where the agent performs well depends on the amount\nof temporal correlation. To the best of our knowledge, the proposed TeCLE is\nthe first approach to consider the action conditioned latent space and temporal\ncorrelation for curiosity-driven exploration. We prove that the proposed TeCLE\ncan be robust to the Noisy TV and stochasticity in benchmark environments,\nincluding Minigrid and Stochastic Atari.\n","authors":["SuMin Oh","WanSoo Kim","HyunJin Kim"],"pdf_url":"https://arxiv.org/pdf/2412.04775v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2311.09614v4","updated":"2024-12-06T04:35:45Z","published":"2023-11-16T06:58:46Z","title":"Comprehensive framework for evaluation of deep neural networks in\n detection and quantification of lymphoma from PET/CT images: clinical\n insights, pitfalls, and observer agreement analyses","summary":" This study addresses critical gaps in automated lymphoma segmentation from\nPET/CT images, focusing on issues often overlooked in existing literature.\nWhile deep learning has been applied for lymphoma lesion segmentation, few\nstudies incorporate out-of-distribution testing, raising concerns about model\ngeneralizability across diverse imaging conditions and patient populations. We\nhighlight the need to compare model performance with expert human annotators,\nincluding intra- and inter-observer variability, to understand task difficulty\nbetter. Most approaches focus on overall segmentation accuracy but overlook\nlesion-specific measures important for precise lesion detection and disease\nquantification. To address these gaps, we propose a clinically relevant\nframework for evaluating deep segmentation networks. Using this lesion\nmeasure-specific evaluation, we assess the performance of four deep networks\n(ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from\nmulti-institutional datasets, covering various lymphoma subtypes and lesion\ncharacteristics. Beyond standard metrics like the Dice similarity coefficient,\nwe evaluate clinical lesion measures and their prediction errors. We also\nintroduce detection criteria for lesion localization and propose a new\ndetection Criterion 3 based on metabolic characteristics. We show that networks\nperform better on large, intense lesions with higher metabolic activity.\nFinally, we compare network performance to physicians via intra- and\ninter-observer variability analyses, demonstrating that network errors closely\nresemble those made by experts, i.e., the small and faint lesions remain\nchallenging for both humans and networks. This study aims to improve automated\nlesion segmentation's clinical relevance, supporting better treatment decisions\nfor lymphoma patients. The code is available at:\nhttps://github.com/microsoft/lymphoma-segmentation-dnn.\n","authors":["Shadab Ahamed","Yixi Xu","Sara Kurkowska","Claire Gowdy","Joo H. O","Ingrid Bloise","Don Wilson","Patrick Martineau","François Bénard","Fereshteh Yousefirizi","Rahul Dodhia","Juan M. Lavista","William B. Weeks","Carlos F. Uribe","Arman Rahmim"],"pdf_url":"https://arxiv.org/pdf/2311.09614v4.pdf","comment":"32 pages, 15 figures, 5 tables"},{"id":"http://arxiv.org/abs/2412.04767v1","updated":"2024-12-06T04:23:05Z","published":"2024-12-06T04:23:05Z","title":"Towards counterfactual fairness thorough auxiliary variables","summary":" The challenge of balancing fairness and predictive accuracy in machine\nlearning models, especially when sensitive attributes such as race, gender, or\nage are considered, has motivated substantial research in recent years.\nCounterfactual fairness ensures that predictions remain consistent across\ncounterfactual variations of sensitive attributes, which is a crucial concept\nin addressing societal biases. However, existing counterfactual fairness\napproaches usually overlook intrinsic information about sensitive features,\nlimiting their ability to achieve fairness while simultaneously maintaining\nperformance. To tackle this challenge, we introduce EXOgenous Causal reasoning\n(EXOC), a novel causal reasoning framework motivated by exogenous variables. It\nleverages auxiliary variables to uncover intrinsic properties that give rise to\nsensitive attributes. Our framework explicitly defines an auxiliary node and a\ncontrol node that contribute to counterfactual fairness and control the\ninformation flow within the model. Our evaluation, conducted on synthetic and\nreal-world datasets, validates EXOC's superiority, showing that it outperforms\nstate-of-the-art approaches in achieving counterfactual fairness.\n","authors":["Bowei Tian","Ziyao Wang","Shwai He","Wanghao Ye","Guoheng Sun","Yucong Dai","Yongkai Wu","Ang Li"],"pdf_url":"https://arxiv.org/pdf/2412.04767v1.pdf","comment":"arXiv admin note: text overlap with arXiv:2307.08232 by other authors"},{"id":"http://arxiv.org/abs/2412.04766v1","updated":"2024-12-06T04:18:49Z","published":"2024-12-06T04:18:49Z","title":"DAWN-SI: Data-Aware and Noise-Informed Stochastic Interpolation for\n Solving Inverse Problems","summary":" Inverse problems, which involve estimating parameters from incomplete or\nnoisy observations, arise in various fields such as medical imaging,\ngeophysics, and signal processing. These problems are often ill-posed,\nrequiring regularization techniques to stabilize the solution. In this work, we\nemploy $\\textit{Stochastic Interpolation}$ (SI), a generative framework that\nintegrates both deterministic and stochastic processes to map a simple\nreference distribution, such as a Gaussian, to the target distribution. Our\nmethod $\\textbf{DAWN-SI}$: $\\textbf{D}$ata-$\\textbf{AW}$are and\n$\\textbf{N}$oise-informed $\\textbf{S}$tochastic $\\textbf{I}$nterpolation\nincorporates data and noise embedding, allowing the model to access\nrepresentations about the measured data explicitly and also account for noise\nin the observations, making it particularly robust in scenarios where data is\nnoisy or incomplete. By learning a time-dependent velocity field, SI not only\nprovides accurate solutions but also enables uncertainty quantification by\ngenerating multiple plausible outcomes. Unlike pre-trained diffusion models,\nwhich may struggle in highly ill-posed settings, our approach is trained\nspecifically for each inverse problem and adapts to varying noise levels. We\nvalidate the effectiveness and robustness of our method through extensive\nnumerical experiments on tasks such as image deblurring and tomography.\n","authors":["Shadab Ahamed","Eldad Haber"],"pdf_url":"https://arxiv.org/pdf/2412.04766v1.pdf","comment":"20 pages, 11 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.04764v1","updated":"2024-12-06T04:16:35Z","published":"2024-12-06T04:16:35Z","title":"Short-term Streamflow and Flood Forecasting based on Graph Convolutional\n Recurrent Neural Network and Residual Error Learning","summary":" Accurate short-term streamflow and flood forecasting are critical for\nmitigating river flood impacts, especially given the increasing climate\nvariability. Machine learning-based streamflow forecasting relies on large\nstreamflow datasets derived from rating curves. Uncertainties in rating curve\nmodeling could introduce errors to the streamflow data and affect the\nforecasting accuracy. This study proposes a streamflow forecasting method that\naddresses these data errors, enhancing the accuracy of river flood forecasting\nand flood modeling, thereby reducing flood-related risk. A convolutional\nrecurrent neural network is used to capture spatiotemporal patterns, coupled\nwith residual error learning and forecasting. The neural network outperforms\ncommonly used forecasting models over 1-6 hours of forecasting horizons, and\nthe residual error learners can further correct the residual errors. This\nprovides a more reliable tool for river flood forecasting and climate\nadaptation in this critical 1-6 hour time window for flood risk mitigation\nefforts.\n","authors":["Xiyu Pan","Neda Mohammadi","John E. Taylor"],"pdf_url":"https://arxiv.org/pdf/2412.04764v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03962v2","updated":"2024-12-06T04:11:24Z","published":"2024-12-05T08:26:13Z","title":"Local Curvature Smoothing with Stein's Identity for Efficient Score\n Matching","summary":" The training of score-based diffusion models (SDMs) is based on score\nmatching. The challenge of score matching is that it includes a computationally\nexpensive Jacobian trace. While several methods have been proposed to avoid\nthis computation, each has drawbacks, such as instability during training and\napproximating the learning as learning a denoising vector field rather than a\ntrue score. We propose a novel score matching variant, local curvature\nsmoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by\napplying Stein's identity, enabling regularization effectiveness and efficient\ncomputation. We show that LCSS surpasses existing methods in sample generation\nperformance and matches the performance of denoising score matching, widely\nadopted by most SDMs, in evaluations such as FID, Inception score, and bits per\ndimension. Furthermore, we show that LCSS enables realistic image generation\neven at a high resolution of $1024 \\times 1024$.\n","authors":["Genki Osada","Makoto Shing","Takashi Nishide"],"pdf_url":"https://arxiv.org/pdf/2412.03962v2.pdf","comment":"Accepted at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2405.01124v4","updated":"2024-12-06T03:53:11Z","published":"2024-05-02T09:38:07Z","title":"Investigating Self-Supervised Image Denoising with Denaturation","summary":" Self-supervised learning for image denoising problems in the presence of\ndenaturation for noisy data is a crucial approach in machine learning. However,\ntheoretical understanding of the performance of the approach that uses\ndenatured data is lacking. To provide better understanding of the approach, in\nthis paper, we analyze a self-supervised denoising algorithm that uses\ndenatured data in depth through theoretical analysis and numerical experiments.\nThrough the theoretical analysis, we discuss that the algorithm finds desired\nsolutions to the optimization problem with the population risk, while the\nguarantee for the empirical risk depends on the hardness of the denoising task\nin terms of denaturation levels. We also conduct several experiments to\ninvestigate the performance of an extended algorithm in practice. The results\nindicate that the algorithm training with denatured images works, and the\nempirical performance aligns with the theoretical results. These results\nsuggest several insights for further improvement of self-supervised image\ndenoising that uses denatured data in future directions.\n","authors":["Hiroki Waida","Kimihiro Yamazaki","Atsushi Tokuhisa","Mutsuyo Wada","Yuichiro Wada"],"pdf_url":"https://arxiv.org/pdf/2405.01124v4.pdf","comment":"The PDF v3 has a wrong license, while v4 has a correct license"},{"id":"http://arxiv.org/abs/2312.09193v3","updated":"2024-12-06T03:52:24Z","published":"2023-12-14T18:14:11Z","title":"Fast Sampling via Discrete Non-Markov Diffusion Models with\n Predetermined Transition Time","summary":" Discrete diffusion models have emerged as powerful tools for high-quality\ndata generation. Despite their success in discrete spaces, such as text\ngeneration tasks, the acceleration of discrete diffusion models remains\nunder-explored. In this paper, we propose discrete non-Markov diffusion models\n(DNDM), which naturally induce the predetermined transition time set. This\nenables a training-free sampling algorithm that significantly reduces the\nnumber of function evaluations (i.e., calls to the neural network), making the\nsampling process much faster. Furthermore, we study the transition from finite\nto infinite step sampling, offering new insights into bridging the gap between\ndiscrete and continuous-time processes for discrete diffusion models. Extensive\nexperiments on natural language generation and machine translation tasks\ndemonstrate the superior performance of our method in terms of both generation\nspeed and sample quality compared to existing methods for discrete diffusion\nmodels.\n","authors":["Zixiang Chen","Huizhuo Yuan","Yongqian Li","Yiwen Kou","Junkai Zhang","Quanquan Gu"],"pdf_url":"https://arxiv.org/pdf/2312.09193v3.pdf","comment":"36 pages, 5 figures, 13 tables. In NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04758v1","updated":"2024-12-06T03:48:47Z","published":"2024-12-06T03:48:47Z","title":"Measuring Goal-Directedness","summary":" We define maximum entropy goal-directedness (MEG), a formal measure of\ngoal-directedness in causal models and Markov decision processes, and give\nalgorithms for computing it. Measuring goal-directedness is important, as it is\na critical element of many concerns about harm from AI. It is also of\nphilosophical interest, as goal-directedness is a key aspect of agency. MEG is\nbased on an adaptation of the maximum causal entropy framework used in inverse\nreinforcement learning. It can measure goal-directedness with respect to a\nknown utility function, a hypothesis class of utility functions, or a set of\nrandom variables. We prove that MEG satisfies several desiderata and\ndemonstrate our algorithms with small-scale experiments.\n","authors":["Matt MacDermott","James Fox","Francesco Belardinelli","Tom Everitt"],"pdf_url":"https://arxiv.org/pdf/2412.04758v1.pdf","comment":"Accepted to the 38th Conference on Neural Information Processing\n Systems (NeurIPS 2024)"},{"id":"http://arxiv.org/abs/2412.04757v1","updated":"2024-12-06T03:46:06Z","published":"2024-12-06T03:46:06Z","title":"Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free\n Dynamic Triangular Attention Pattern","summary":" The quadratic computational complexity of the attention mechanism in current\nLarge Language Models (LLMs) renders inference with long contexts prohibitively\nexpensive. To address this challenge, various approaches aim to retain critical\nportions of the context to optimally approximate Full Attention (FA) through\nKey-Value (KV) compression or Sparse Attention (SA), enabling the processing of\nvirtually unlimited text lengths in a streaming manner. However, these methods\nstruggle to achieve performance levels comparable to FA, particularly in\nretrieval tasks. In this paper, our analysis of attention head patterns reveals\nthat LLMs' attention distributions show strong local correlations, naturally\nreflecting a chunking mechanism for input context. We propose Ltri-LLM\nframework, which divides KVs into spans, stores them in an offline index, and\nretrieves the relevant KVs into memory for various queries. Experimental\nresults on popular long text benchmarks show that Ltri-LLM can achieve\nperformance close to FA while maintaining efficient, streaming-based inference.\n","authors":["Hongyin Tang","Di Xiu","Lanrui Wang","Xiurui Geng","Jingang Wang","Xunliang Cai"],"pdf_url":"https://arxiv.org/pdf/2412.04757v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04755v1","updated":"2024-12-06T03:40:21Z","published":"2024-12-06T03:40:21Z","title":"Latent Space Characterization of Autoencoder Variants","summary":" Understanding the latent spaces learned by deep learning models is crucial in\nexploring how they represent and generate complex data. Autoencoders (AEs) have\nplayed a key role in the area of representation learning, with numerous\nregularization techniques and training principles developed not only to enhance\ntheir ability to learn compact and robust representations, but also to reveal\nhow different architectures influence the structure and smoothness of the\nlower-dimensional non-linear manifold. We strive to characterize the structure\nof the latent spaces learned by different autoencoders including convolutional\nautoencoders (CAEs), denoising autoencoders (DAEs), and variational\nautoencoders (VAEs) and how they change with the perturbations in the input. By\ncharacterizing the matrix manifolds corresponding to the latent spaces, we\nprovide an explanation for the well-known observation that the latent spaces of\nCAE and DAE form non-smooth manifolds, while that of VAE forms a smooth\nmanifold. We also map the points of the matrix manifold to a Hilbert space\nusing distance preserving transforms and provide an alternate view in terms of\nthe subspaces generated in the Hilbert space as a function of the distortion in\nthe input. The results show that the latent manifolds of CAE and DAE are\nstratified with each stratum being a smooth product manifold, while the\nmanifold of VAE is a smooth product manifold of two symmetric positive definite\nmatrices and a symmetric positive semi-definite matrix.\n","authors":["Anika Shrivastava","Renu Rameshan","Samar Agnihotri"],"pdf_url":"https://arxiv.org/pdf/2412.04755v1.pdf","comment":"8 pages, 6 figures, and 1 table"},{"id":"http://arxiv.org/abs/2412.04752v1","updated":"2024-12-06T03:33:31Z","published":"2024-12-06T03:33:31Z","title":"GABAR: Graph Attention-Based Action Ranking for Relational Policy\n Learning","summary":" We propose a novel approach to learn relational policies for classical\nplanning based on learning to rank actions. We introduce a new graph\nrepresentation that explicitly captures action information and propose a Graph\nNeural Network architecture augmented with Gated Recurrent Units (GRUs) to\nlearn action rankings. Our model is trained on small problem instances and\ngeneralizes to significantly larger instances where traditional planning\nbecomes computationally expensive. Experimental results across standard\nplanning benchmarks demonstrate that our action-ranking approach achieves\ngeneralization to significantly larger problems than those used in training.\n","authors":["Rajesh Mangannavar","Stefan Lee","Alan Fern","Prasad Tadepalli"],"pdf_url":"https://arxiv.org/pdf/2412.04752v1.pdf","comment":"6 Pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.04749v1","updated":"2024-12-06T03:25:01Z","published":"2024-12-06T03:25:01Z","title":"Machine learning algorithms to predict the risk of rupture of\n intracranial aneurysms: a systematic review","summary":" Purpose: Subarachnoid haemorrhage is a potentially fatal consequence of\nintracranial aneurysm rupture, however, it is difficult to predict if aneurysms\nwill rupture. Prophylactic treatment of an intracranial aneurysm also involves\nrisk, hence identifying rupture-prone aneurysms is of substantial clinical\nimportance. This systematic review aims to evaluate the performance of machine\nlearning algorithms for predicting intracranial aneurysm rupture risk.\n Methods: MEDLINE, Embase, Cochrane Library and Web of Science were searched\nuntil December 2023. Studies incorporating any machine learning algorithm to\npredict the risk of rupture of an intracranial aneurysm were included. Risk of\nbias was assessed using the Prediction Model Risk of Bias Assessment Tool\n(PROBAST). PROSPERO registration: CRD42023452509. Results: Out of 10,307\nrecords screened, 20 studies met the eligibility criteria for this review\nincorporating a total of 20,286 aneurysm cases. The machine learning models\ngave a 0.66-0.90 range for performance accuracy. The models were compared to\ncurrent clinical standards in six studies and gave mixed results. Most studies\nposed high or unclear risks of bias and concerns for applicability, limiting\nthe inferences that can be drawn from them. There was insufficient homogenous\ndata for a meta-analysis.\n Conclusions: Machine learning can be applied to predict the risk of rupture\nfor intracranial aneurysms. However, the evidence does not comprehensively\ndemonstrate superiority to existing practice, limiting its role as a clinical\nadjunct. Further prospective multicentre studies of recent machine learning\ntools are needed to prove clinical validation before they are implemented in\nthe clinic.\n","authors":["Karan Daga","Siddharth Agarwal","Zaeem Moti","Matthew BK Lee","Munaib Din","David Wood","Marc Modat","Thomas C Booth"],"pdf_url":"https://arxiv.org/pdf/2412.04749v1.pdf","comment":"Clin Neuroradiol (2024)"},{"id":"http://arxiv.org/abs/2412.04738v1","updated":"2024-12-06T02:59:01Z","published":"2024-12-06T02:59:01Z","title":"DHIL-GT: Scalable Graph Transformer with Decoupled Hierarchy Labeling","summary":" Graph Transformer (GT) has recently emerged as a promising neural network\narchitecture for learning graph-structured data. However, its global attention\nmechanism with quadratic complexity concerning the graph scale prevents wider\napplication to large graphs. While current methods attempt to enhance GT\nscalability by altering model architecture or encoding hierarchical graph data,\nour analysis reveals that these models still suffer from the computational\nbottleneck related to graph-scale operations. In this work, we target the GT\nscalability issue and propose DHIL-GT, a scalable Graph Transformer that\nsimplifies network learning by fully decoupling the graph computation to a\nseparate stage in advance. DHIL-GT effectively retrieves hierarchical\ninformation by exploiting the graph labeling technique, as we show that the\ngraph label hierarchy is more informative than plain adjacency by offering\nglobal connections while promoting locality, and is particularly suitable for\nhandling complex graph patterns such as heterophily. We further design subgraph\nsampling and positional encoding schemes for precomputing model input on top of\ngraph labels in an end-to-end manner. The training stage thus favorably removes\ngraph-related computations, leading to ideal mini-batch capability and GPU\nutilization. Notably, the precomputation and training processes of DHIL-GT\nachieve complexities linear to the number of graph edges and nodes,\nrespectively. Extensive experiments demonstrate that DHIL-GT is efficient in\nterms of computational boost and mini-batch capability over existing scalable\nGraph Transformer designs on large-scale benchmarks, while achieving top-tier\neffectiveness on both homophilous and heterophilous graphs.\n","authors":["Ningyi Liao","Zihao Yu","Siqiang Luo"],"pdf_url":"https://arxiv.org/pdf/2412.04738v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.12376v2","updated":"2024-12-06T02:58:51Z","published":"2024-04-18T17:57:53Z","title":"Matching the Statistical Query Lower Bound for $k$-Sparse Parity\n Problems with Sign Stochastic Gradient Descent","summary":" The $k$-sparse parity problem is a classical problem in computational\ncomplexity and algorithmic theory, serving as a key benchmark for understanding\ncomputational classes. In this paper, we solve the $k$-sparse parity problem\nwith sign stochastic gradient descent, a variant of stochastic gradient descent\n(SGD) on two-layer fully-connected neural networks. We demonstrate that this\napproach can efficiently solve the $k$-sparse parity problem on a\n$d$-dimensional hypercube ($k\\leq O(\\sqrt{d})$) with a sample complexity of\n$\\tilde{O}(d^{k-1})$ using $2^{\\Theta(k)}$ neurons, matching the established\n$\\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical\nanalysis begins by constructing a good neural network capable of correctly\nsolving the $k$-parity problem. We then demonstrate how a trained neural\nnetwork with sign SGD can effectively approximate this good network, solving\nthe $k$-parity problem with small statistical errors. To the best of our\nknowledge, this is the first result that matches the SQ lower bound for solving\n$k$-sparse parity problem using gradient-based methods.\n","authors":["Yiwen Kou","Zixiang Chen","Quanquan Gu","Sham M. Kakade"],"pdf_url":"https://arxiv.org/pdf/2404.12376v2.pdf","comment":"37 pages, 7 figures, 3 tables. In NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04737v1","updated":"2024-12-06T02:54:45Z","published":"2024-12-06T02:54:45Z","title":"Generative Humanization for Therapeutic Antibodies","summary":" Antibody therapies have been employed to address some of today's most\nchallenging diseases, but must meet many criteria during drug development\nbefore reaching a patient. Humanization is a sequence optimization strategy\nthat addresses one critical risk called immunogenicity - a patient's immune\nresponse to the drug - by making an antibody more \"human-like\" in the absence\nof a predictive lab-based test for immunogenicity. However, existing\nhumanization strategies generally yield very few humanized candidates, which\nmay have degraded biophysical properties or decreased drug efficacy. Here, we\nre-frame humanization as a conditional generative modeling task, where\nhumanizing mutations are sampled from a language model trained on human\nantibody data. We describe a sampling process that incorporates models of\ntherapeutic attributes, such as antigen binding affinity, to obtain candidate\nsequences that have both reduced immunogenicity risk and maintained or improved\ntherapeutic properties, allowing this algorithm to be readily embedded into an\niterative antibody optimization campaign. We demonstrate in silico and in lab\nvalidation that in real therapeutic programs our generative humanization method\nproduces diverse sets of antibodies that are both (1) highly-human and (2) have\nfavorable therapeutic properties, such as improved binding to target antigens.\n","authors":["Cade Gordon","Aniruddh Raghu","Hunter Elliott","Peyton Greenside"],"pdf_url":"https://arxiv.org/pdf/2412.04737v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04733v1","updated":"2024-12-06T02:47:54Z","published":"2024-12-06T02:47:54Z","title":"An Experimental Evaluation of Imputation Models for Spatial-Temporal\n Traffic Data","summary":" Traffic data imputation is a critical preprocessing step in intelligent\ntransportation systems, enabling advanced transportation services. Despite\nsignificant advancements in this field, selecting the most suitable model for\npractical applications remains challenging due to three key issues: 1)\nincomprehensive consideration of missing patterns that describe how data loss\nalong spatial and temporal dimensions, 2) the lack of test on standardized\ndatasets, and 3) insufficient evaluations. To this end, we first propose\npractice-oriented taxonomies for missing patterns and imputation models,\nsystematically identifying all possible forms of real-world traffic data loss\nand analyzing the characteristics of existing models. Furthermore, we introduce\na unified benchmarking pipeline to comprehensively evaluate 10 representative\nmodels across various missing patterns and rates. This work aims to provide a\nholistic understanding of traffic data imputation research and serve as a\npractical guideline.\n","authors":["Shengnan Guo","Tonglong Wei","Yiheng Huang","Miaomiao Zhao","Ran Chen","Yan Lin","Youfang Lin","Huaiyu Wan"],"pdf_url":"https://arxiv.org/pdf/2412.04733v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2402.08918v3","updated":"2024-12-06T02:46:38Z","published":"2024-02-14T03:16:13Z","title":"SimMLP: Training MLPs on Graphs without Supervision","summary":" Graph Neural Networks (GNNs) have demonstrated their effectiveness in various\ngraph learning tasks, yet their reliance on neighborhood aggregation during\ninference poses challenges for deployment in latency-sensitive applications,\nsuch as real-time financial fraud detection. To address this limitation, recent\nstudies have proposed distilling knowledge from teacher GNNs into student\nMulti-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate\ninference. However, these approaches often inadequately explore structural\ninformation when inferring unseen nodes. To this end, we introduce SimMLP, a\nSelf-supervised framework for learning MLPs on graphs, designed to fully\nintegrate rich structural information into MLPs. Notably, SimMLP is the first\nMLP-learning method that can achieve equivalence to GNNs in the optimal case.\nThe key idea is to employ self-supervised learning to align the representations\nencoded by graph context-aware GNNs and neighborhood dependency-free MLPs,\nthereby fully integrating the structural information into MLPs. We provide a\ncomprehensive theoretical analysis, demonstrating the equivalence between\nSimMLP and GNNs based on mutual information and inductive bias, highlighting\nSimMLP's advanced structural learning capabilities. Additionally, we conduct\nextensive experiments on 20 benchmark datasets, covering node classification,\nlink prediction, and graph classification, to showcase SimMLP's superiority\nover state-of-the-art baselines, particularly in scenarios involving unseen\nnodes (e.g., inductive and cold-start node classification) where structural\ninsights are crucial. Our codes are available at:\nhttps://github.com/Zehong-Wang/SimMLP.\n","authors":["Zehong Wang","Zheyuan Zhang","Chuxu Zhang","Yanfang Ye"],"pdf_url":"https://arxiv.org/pdf/2402.08918v3.pdf","comment":"New Version: arXiv:2412.03864"},{"id":"http://arxiv.org/abs/2412.03704v2","updated":"2024-12-06T02:21:48Z","published":"2024-12-04T20:35:07Z","title":"Scaling Inference-Time Search with Vision Value Model for Improved\n Visual Comprehension","summary":" Despite significant advancements in vision-language models (VLMs), there\nlacks effective approaches to enhance response quality by scaling\ninference-time computation. This capability is known to be a core step towards\nthe self-improving models in recent large language model studies. In this\npaper, we present Vision Value Model (VisVM) that can guide VLM inference-time\nsearch to generate responses with better visual comprehension. Specifically,\nVisVM not only evaluates the generated sentence quality in the current search\nstep, but also anticipates the quality of subsequent sentences that may result\nfrom the current step, thus providing a long-term value. In this way, VisVM\nsteers VLMs away from generating sentences prone to hallucinations or\ninsufficient detail, thereby producing higher quality responses. Experimental\nresults demonstrate that VisVM-guided search significantly enhances VLMs'\nability to generate descriptive captions with richer visual details and fewer\nhallucinations, compared with greedy decoding and search methods with other\nvisual reward signals. Furthermore, we find that self-training the model with\nthe VisVM-guided captions improve VLM's performance across a wide range of\nmultimodal benchmarks, indicating the potential for developing self-improving\nVLMs. Our value model and code are available at\nhttps://github.com/si0wang/VisVM.\n","authors":["Xiyao Wang","Zhengyuan Yang","Linjie Li","Hongjin Lu","Yuancheng Xu","Chung-Ching Lin","Kevin Lin","Furong Huang","Lijuan Wang"],"pdf_url":"https://arxiv.org/pdf/2412.03704v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.05185v1","updated":"2024-12-06T17:04:42Z","published":"2024-12-06T17:04:42Z","title":"LinVT: Empower Your Image-level Large Language Model to Understand\n Videos","summary":" Large Language Models (LLMs) have been widely used in various tasks,\nmotivating us to develop an LLM-based assistant for videos. Instead of training\nfrom scratch, we propose a module to transform arbitrary well-trained\nimage-based LLMs into video-LLMs (after being trained on video data). To better\nadapt image-LLMs for processing videos, we introduce two design principles:\nlinear transformation to preserve the original visual-language alignment and\nrepresentative information condensation from redundant video content. Guided by\nthese principles, we propose a plug-and-play Linear Video Tokenizer(LinVT),\nwhich enables existing image-LLMs to understand videos. We benchmark LinVT with\nsix recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL,\nshowcasing the high compatibility of LinVT. LinVT-based LLMs achieve\nstate-of-the-art performance across various video benchmarks, illustrating the\neffectiveness of LinVT in multi-modal video understanding.\n","authors":["Lishuai Gao","Yujie Zhong","Yingsen Zeng","Haoxian Tan","Dengjie Li","Zheng Zhao"],"pdf_url":"https://arxiv.org/pdf/2412.05185v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.05035v1","updated":"2024-12-06T13:39:36Z","published":"2024-12-06T13:39:36Z","title":"SMIC: Semantic Multi-Item Compression based on CLIP dictionary","summary":" Semantic compression, a compression scheme where the distortion metric,\ntypically MSE, is replaced with semantic fidelity metrics, tends to become more\nand more popular. Most recent semantic compression schemes rely on the\nfoundation model CLIP. In this work, we extend such a scheme to image\ncollection compression, where inter-item redundancy is taken into account\nduring the coding phase. For that purpose, we first show that CLIP's latent\nspace allows for easy semantic additions and subtractions. From this property,\nwe define a dictionary-based multi-item codec that outperforms state-of-the-art\ngenerative codec in terms of compression rate, around $10^{-5}$ BPP per image,\nwhile not sacrificing semantic fidelity. We also show that the learned\ndictionary is of a semantic nature and works as a semantic projector for the\nsemantic content of images.\n","authors":["Tom Bachard","Thomas Maugey"],"pdf_url":"https://arxiv.org/pdf/2412.05035v1.pdf","comment":"12 pages, 14 figures, 3 tables, journal paper, preprint"},{"id":"http://arxiv.org/abs/2411.19772v2","updated":"2024-12-06T07:24:10Z","published":"2024-11-29T15:18:06Z","title":"LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware\n Omni-Modal Perception of Long Videos","summary":" Despite impressive advancements in video understanding, most efforts remain\nlimited to coarse-grained or visual-only video tasks. However, real-world\nvideos encompass omni-modal information (vision, audio, and speech) with a\nseries of events forming a cohesive storyline. The lack of multi-modal video\ndata with fine-grained event annotations and the high cost of manual labeling\nare major obstacles to comprehensive omni-modality video perception. To address\nthis gap, we propose an automatic pipeline consisting of high-quality\nmulti-modal video filtering, semantically coherent omni-modal event boundary\ndetection, and cross-modal correlation-aware event captioning. In this way, we\npresent LongVALE, the first-ever Vision-Audio-Language Event understanding\nbenchmark comprising 105K omni-modal events with precise temporal boundaries\nand detailed relation-aware captions within 8.4K high-quality long videos.\nFurther, we build a baseline that leverages LongVALE to enable video large\nlanguage models (LLMs) for omni-modality fine-grained temporal video\nunderstanding for the first time. Extensive experiments demonstrate the\neffectiveness and great potential of LongVALE in advancing comprehensive\nmulti-modal video understanding.\n","authors":["Tiantian Geng","Jinrui Zhang","Qingni Wang","Teng Wang","Jinming Duan","Feng Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.19772v2.pdf","comment":"18 pages, 15 figures"},{"id":"http://arxiv.org/abs/2412.04746v1","updated":"2024-12-06T03:18:18Z","published":"2024-12-06T03:18:18Z","title":"Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval\n with Semantic Guidance","summary":" Modern music retrieval systems often rely on fixed representations of user\npreferences, limiting their ability to capture users' diverse and uncertain\nretrieval needs. To address this limitation, we introduce Diff4Steer, a novel\ngenerative retrieval framework that employs lightweight diffusion models to\nsynthesize diverse seed embeddings from user queries that represent potential\ndirections for music exploration. Unlike deterministic methods that map user\nquery to a single point in embedding space, Diff4Steer provides a statistical\nprior on the target modality (audio) for retrieval, effectively capturing the\nuncertainty and multi-faceted nature of user preferences. Furthermore,\nDiff4Steer can be steered by image or text inputs, enabling more flexible and\ncontrollable music discovery combined with nearest neighbor search. Our\nframework outperforms deterministic regression methods and LLM-based generative\nretrieval baseline in terms of retrieval and ranking metrics, demonstrating its\neffectiveness in capturing user preferences, leading to more diverse and\nrelevant recommendations. Listening examples are available at\ntinyurl.com/diff4steer.\n","authors":["Xuchan Bao","Judith Yue Li","Zhong Yi Wan","Kun Su","Timo Denk","Joonseok Lee","Dima Kuzmin","Fei Sha"],"pdf_url":"https://arxiv.org/pdf/2412.04746v1.pdf","comment":"NeurIPS 2024 Creative AI Track"},{"id":"http://arxiv.org/abs/2411.12825v2","updated":"2024-12-06T01:32:53Z","published":"2024-11-19T19:22:24Z","title":"TopoCode: Topologically Informed Error Detection and Correction in\n Communication Systems","summary":" Traditional error detection and correction codes focus on bit-level fidelity,\nwhich is insufficient for emerging technologies like eXtended Reality (XR) and\nholographic communications requiring high-data-rate, low-latency systems.\nBit-level metrics cannot comprehensively evaluate Quality-of-Service (QoS) in\nthese scenarios. This letter proposes TopoCode which leverages Topological Data\nAnalysis (TDA) and persistent homology to encode topological information for\nmessage-level error detection and correction. It introduces minimal redundancy\nwhile enabling effective data reconstruction, especially in low Signal-to-Noise\nRatio (SNR) conditions. TopoCode offers a promising approach to meet the\ndemands of next-generation communication systems prioritizing semantic accuracy\nand message-level integrity.\n","authors":["Hongzhi Guo"],"pdf_url":"https://arxiv.org/pdf/2411.12825v2.pdf","comment":null}]},"2024-12-05T00:00:00Z":{"Computation and Language":[{"id":"http://arxiv.org/abs/2406.15736v2","updated":"2024-12-05T23:59:06Z","published":"2024-06-22T05:04:39Z","title":"Evaluating Large Vision-and-Language Models on Children's Mathematical\n Olympiads","summary":" Recent years have seen a significant progress in the general-purpose problem\nsolving abilities of large vision and language models (LVLMs), such as ChatGPT,\nGemini, etc.; some of these breakthroughs even seem to enable AI models to\noutperform human abilities in varied tasks that demand higher-order cognitive\nskills. Are the current large AI models indeed capable of generalized problem\nsolving as humans do? A systematic analysis of AI capabilities for joint vision\nand text reasoning, however, is missing in the current scientific literature.\nIn this paper, we make an effort towards filling this gap, by evaluating\nstate-of-the-art LVLMs on their mathematical and algorithmic reasoning\nabilities using visuo-linguistic problems from children's Olympiads.\nSpecifically, we consider problems from the Mathematical Kangaroo (MK)\nOlympiad, which is a popular international competition targeted at children\nfrom grades 1-12, that tests children's deeper mathematical abilities using\npuzzles that are appropriately gauged to their age and skills. Using the\npuzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840\nproblems from years 2020-2024. With our dataset, we analyze LVLMs power on\nmathematical reasoning; their responses on our puzzles offer a direct way to\ncompare against that of children. Our results show that modern LVLMs do\ndemonstrate increasingly powerful reasoning skills in solving problems for\nhigher grades, but lack the foundations to correctly answer problems designed\nfor younger children. Further analysis shows that there is no significant\ncorrelation between the reasoning capabilities of AI models and that of young\nchildren, and their capabilities appear to be based on a different type of\nreasoning than the cumulative knowledge that underlies children's mathematics\nand logic skills.\n","authors":["Anoop Cherian","Kuan-Chuan Peng","Suhas Lohit","Joanna Matthiesen","Kevin Smith","Joshua B. Tenenbaum"],"pdf_url":"https://arxiv.org/pdf/2406.15736v2.pdf","comment":"Accepted at NeurIPS 2024 (Datasets and Benchmarks Track)"},{"id":"http://arxiv.org/abs/2406.08391v2","updated":"2024-12-05T23:48:19Z","published":"2024-06-12T16:41:31Z","title":"Large Language Models Must Be Taught to Know What They Don't Know","summary":" When using large language models (LLMs) in high-stakes applications, we need\nto know when we can trust their predictions. Some works argue that prompting\nhigh-performance LLMs is sufficient to produce calibrated uncertainties, while\nothers introduce sampling methods that can be prohibitively expensive. In this\nwork, we first argue that prompting on its own is insufficient to achieve good\ncalibration and then show that fine-tuning on a small dataset of correct and\nincorrect answers can create an uncertainty estimate with good generalization\nand small computational overhead. We show that a thousand graded examples are\nsufficient to outperform baseline methods and that training through the\nfeatures of a model is necessary for good performance and tractable for large\nopen-source models when using LoRA. We also investigate the mechanisms that\nenable reliable LLM uncertainty estimation, finding that many models can be\nused as general-purpose uncertainty estimators, applicable not just to their\nown uncertainties but also the uncertainty of other models. Lastly, we show\nthat uncertainty estimates inform human use of LLMs in human-AI collaborative\nsettings through a user study.\n","authors":["Sanyam Kapoor","Nate Gruver","Manley Roberts","Katherine Collins","Arka Pal","Umang Bhatt","Adrian Weller","Samuel Dooley","Micah Goldblum","Andrew Gordon Wilson"],"pdf_url":"https://arxiv.org/pdf/2406.08391v2.pdf","comment":"NeurIPS 2024 Camera Ready"},{"id":"http://arxiv.org/abs/2411.13282v2","updated":"2024-12-05T22:42:50Z","published":"2024-11-20T12:49:42Z","title":"Combining Autoregressive and Autoencoder Language Models for Text\n Classification","summary":" This paper presents CAALM-TC (Combining Autoregressive and Autoencoder\nLanguage Models for Text Classification), a novel method that enhances text\nclassification by integrating autoregressive and autoencoder language models.\nAutoregressive large language models such as Open AI's GPT, Meta's Llama or\nMicrosoft's Phi offer promising prospects for content analysis practitioners,\nbut they generally underperform supervised BERT based models for text\nclassification. CAALM leverages autoregressive models to generate contextual\ninformation based on input texts, which is then combined with the original text\nand fed into an autoencoder model for classification. This hybrid approach\ncapitalizes on the extensive contextual knowledge of autoregressive models and\nthe efficient classification capabilities of autoencoders. Experimental results\non four benchmark datasets demonstrate that CAALM consistently outperforms\nexisting methods, particularly in tasks with smaller datasets and more abstract\nclassification objectives. The findings indicate that CAALM offers a scalable\nand effective solution for automated content analysis in social science\nresearch that minimizes sample size requirements.\n","authors":["João Gonçalves"],"pdf_url":"https://arxiv.org/pdf/2411.13282v2.pdf","comment":"There is an error in the figure in page 7, where the formula and\n representation for an autoencoder based classifier are inconsistent and may\n mislead readers"},{"id":"http://arxiv.org/abs/2312.11556v3","updated":"2024-12-05T22:32:50Z","published":"2023-12-17T08:07:32Z","title":"StarVector: Generating Scalable Vector Graphics Code from Images and\n Text","summary":" Scalable Vector Graphics (SVGs) are vital for modern image rendering due to\ntheir scalability and versatility. Previous SVG generation methods have focused\non curve-based vectorization, lacking semantic understanding, often producing\nartifacts, and struggling with SVG primitives beyond path curves. To address\nthese issues, we introduce StarVector, a multimodal large language model for\nSVG generation. It performs image vectorization by understanding image\nsemantics and using SVG primitives for compact, precise outputs. Unlike\ntraditional methods, StarVector works directly in the SVG code space,\nleveraging visual understanding to apply accurate SVG primitives. To train\nStarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables\ngeneralization across vectorization tasks and precise use of primitives like\nellipses, polygons, and text. We address challenges in SVG evaluation, showing\nthat pixel-based metrics like MSE fail to capture the unique qualities of\nvector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3\ntasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this\nsetup, StarVector achieves state-of-the-art performance, producing more compact\nand semantically rich SVGs.\n","authors":["Juan A. Rodriguez","Abhay Puri","Shubham Agarwal","Issam H. Laradji","Pau Rodriguez","Sai Rajeswar","David Vazquez","Christopher Pal","Marco Pedersoli"],"pdf_url":"https://arxiv.org/pdf/2312.11556v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15275v4","updated":"2024-12-05T21:50:25Z","published":"2024-06-21T16:10:05Z","title":"How language models extrapolate outside the training data: A case study\n in Textualized Gridworld","summary":" Language models' ability to extrapolate learned behaviors to novel, more\ncomplex environments beyond their training scope is highly unknown. This study\nintroduces a path planning task in a textualized Gridworld to probe language\nmodels' extrapolation capabilities. We show that conventional approaches,\nincluding next token prediction and Chain of Thought (CoT) finetuning, fail to\nextrapolate in larger, unseen environments. Inspired by human cognition and\ndual process theory, we propose cognitive maps for path planning, a novel CoT\nframework that simulates humanlike mental representations. Our experiments show\nthat cognitive maps not only enhance extrapolation to unseen environments but\nalso exhibit humanlike characteristics through structured mental simulation and\nrapid adaptation. Our finding that these cognitive maps require specialized\ntraining schemes and cannot be induced through simple prompting opens up\nimportant questions about developing general-purpose cognitive maps in language\nmodels. Our comparison with exploration-based methods further illuminates the\ncomplementary strengths of offline planning and online exploration.\n","authors":["Doyoung Kim","Jongwon Lee","Jinho Park","Minjoon Seo"],"pdf_url":"https://arxiv.org/pdf/2406.15275v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04628v1","updated":"2024-12-05T21:50:22Z","published":"2024-12-05T21:50:22Z","title":"SWEPO: Simultaneous Weighted Preference Optimization for Group\n Contrastive Alignment","summary":" We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel\nextension of Direct Preference Optimization (DPO) designed to accommodate\nmultiple dynamically chosen positive and negative responses for each query.\nSWEPO employs a weighted group contrastive loss, assigning weights to responses\nbased on their deviation from the mean reward score. This approach effectively\nprioritizes responses that are significantly better or worse than the average,\nenhancing optimization. Our theoretical analysis demonstrates that\nsimultaneously considering multiple preferences reduces alignment bias,\nresulting in more robust alignment. Additionally, we provide insights into the\ntraining dynamics of our loss function and a related function, InfoNCA.\nEmpirical validation on the UltraFeedback dataset establishes SWEPO as\nstate-of-the-art, with superior performance in downstream evaluations using the\nAlpacaEval dataset.\n","authors":["Taneesh Gupta","Rahul Madhavan","Xuchao Zhang","Chetan Bansal","Saravan Rajmohan"],"pdf_url":"https://arxiv.org/pdf/2412.04628v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04626v1","updated":"2024-12-05T21:41:20Z","published":"2024-12-05T21:41:20Z","title":"BigDocs: An Open and Permissively-Licensed Dataset for Training\n Multimodal Models on Document and Code Tasks","summary":" Multimodal AI has the potential to significantly enhance\ndocument-understanding tasks, such as processing receipts, understanding\nworkflows, extracting data from documents, and summarizing reports. Code\ngeneration tasks that require long-structured outputs can also be enhanced by\nmultimodality. Despite this, their use in commercial applications is often\nlimited due to limited access to training data and restrictive licensing, which\nhinders open access. To address these limitations, we introduce BigDocs-7.5M, a\nhigh-quality, open-access dataset comprising 7.5 million multimodal documents\nacross 30 tasks. We use an efficient data curation process to ensure our data\nis high-quality and license-permissive. Our process emphasizes accountability,\nresponsibility, and transparency through filtering rules, traceable metadata,\nand careful content analysis. Additionally, we introduce BigDocs-Bench, a\nbenchmark suite with 10 novel tasks where we create datasets that reflect\nreal-world use cases involving reasoning over Graphical User Interfaces (GUI)\nand code generation from images. Our experiments show that training with\nBigDocs-Bench improves average performance up to 25.8% over closed-source\nGPT-4o in document reasoning and structured output tasks such as\nScreenshot2HTML or Image2Latex generation. Finally, human evaluations showed a\npreference for outputs from models trained on BigDocs over GPT-4o. This\nsuggests that BigDocs can help both academics and the open-source community\nutilize and improve AI tools to enhance multimodal capabilities and document\nreasoning. The project is hosted at https://bigdocs.github.io .\n","authors":["Juan Rodriguez","Xiangru Jian","Siba Smarak Panigrahi","Tianyu Zhang","Aarash Feizi","Abhay Puri","Akshay Kalkunte","François Savard","Ahmed Masry","Shravan Nayak","Rabiul Awal","Mahsa Massoud","Amirhossein Abaskohi","Zichao Li","Suyuchen Wang","Pierre-André Noël","Mats Leon Richter","Saverio Vadacchino","Shubbam Agarwal","Sanket Biswas","Sara Shanian","Ying Zhang","Noah Bolger","Kurt MacDonald","Simon Fauvel","Sathwik Tejaswi","Srinivas Sunkara","Joao Monteiro","Krishnamurthy DJ Dvijotham","Torsten Scholak","Nicolas Chapados","Sepideh Kharagani","Sean Hughes","M. Özsu","Siva Reddy","Marco Pedersoli","Yoshua Bengio","Christopher Pal","Issam Laradji","Spandanna Gella","Perouz Taslakian","David Vazquez","Sai Rajeswar"],"pdf_url":"https://arxiv.org/pdf/2412.04626v1.pdf","comment":"The project is hosted at https://bigdocs.github.io"},{"id":"http://arxiv.org/abs/2412.04619v1","updated":"2024-12-05T21:12:37Z","published":"2024-12-05T21:12:37Z","title":"Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization","summary":" Neural networks often favor shortcut heuristics based on surface-level\npatterns. As one example, language models (LMs) behave like n-gram models early\nin training. However, to correctly apply grammatical rules, LMs must rely on\nhierarchical syntactic representations instead of n-grams. In this work, we use\ncases studies of English grammar to explore how latent structure in training\ndata drives models toward improved out-of-distribution (OOD) generalization.We\nthen investigate how data composition can lead to inconsistent OOD behavior\nacross random seeds and to unstable training dynamics. Our results show that\nmodels stabilize in their OOD behavior only when they fully commit to either a\nsurface-level linear rule or a hierarchical rule. The hierarchical rule,\nfurthermore, is induced by grammatically complex sequences with deep embedding\nstructures, whereas the linear rule is induced by simpler sequences. When the\ndata contains a mix of simple and complex examples, potential rules compete;\neach independent training run either stabilizes by committing to a single rule\nor remains unstable in its OOD behavior. These conditions lead `stable seeds'\nto cluster around simple rules, forming bimodal performance distributions\nacross seeds. We also identify an exception to the relationship between\nstability and generalization: models which memorize patterns from low-diversity\ntraining data can overfit stably, with different rules for memorized and\nunmemorized patterns. Our findings emphasize the critical role of training data\nin shaping generalization patterns and how competition between data subsets\ncontributes to inconsistent generalization outcomes across random seeds. Code\nis available at https://github.com/sunnytqin/concept_comp.git.\n","authors":["Tian Qin","Naomi Saphra","David Alvarez-Melis"],"pdf_url":"https://arxiv.org/pdf/2412.04619v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04614v1","updated":"2024-12-05T21:00:46Z","published":"2024-12-05T21:00:46Z","title":"Extractive Structures Learned in Pretraining Enable Generalization on\n Finetuned Facts","summary":" Pretrained language models (LMs) can generalize to implications of facts that\nthey are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo,\"\nLMs can correctly answer ``What language do the people in John Doe's city\nspeak?'' with ``Japanese''. However, little is known about the mechanisms that\nenable this generalization or how they are learned during pretraining. We\nintroduce extractive structures as a framework for describing how components in\nLMs (e.g., MLPs or attention heads) coordinate to enable this generalization.\nThe structures consist of informative components that store training facts as\nweight changes, and upstream and downstream extractive components that query\nand process the stored information to produce the correct implication. We\nhypothesize that extractive structures are learned during pretraining when\nencountering implications of previously known facts. This yields two\npredictions: a data ordering effect where extractive structures can be learned\nonly if facts precede their implications, and a weight grafting effect where\nextractive structures can be transferred to predict counterfactual\nimplications. We empirically demonstrate these phenomena in the OLMo-7b, Llama\n3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results\nalso indicate that fact learning can occur at both early and late layers, which\nlead to different forms of generalization.\n","authors":["Jiahai Feng","Stuart Russell","Jacob Steinhardt"],"pdf_url":"https://arxiv.org/pdf/2412.04614v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04606v1","updated":"2024-12-05T20:43:39Z","published":"2024-12-05T20:43:39Z","title":"Semantic Consistency-Based Uncertainty Quantification for Factuality in\n Radiology Report Generation","summary":" Radiology report generation (RRG) has shown great potential in assisting\nradiologists by automating the labor-intensive task of report writing. While\nrecent advancements have improved the quality and coherence of generated\nreports, ensuring their factual correctness remains a critical challenge.\nAlthough generative medical Vision Large Language Models (VLLMs) have been\nproposed to address this issue, these models are prone to hallucinations and\ncan produce inaccurate diagnostic information. To address these concerns, we\nintroduce a novel Semantic Consistency-Based Uncertainty Quantification\nframework that provides both report-level and sentence-level uncertainties.\nUnlike existing approaches, our method does not require modifications to the\nunderlying model or access to its inner state, such as output token logits,\nthus serving as a plug-and-play module that can be seamlessly integrated with\nstate-of-the-art models. Extensive experiments demonstrate the efficacy of our\nmethod in detecting hallucinations and enhancing the factual accuracy of\nautomatically generated radiology reports. By abstaining from high-uncertainty\nreports, our approach improves factuality scores by $10$%, achieved by\nrejecting $20$% of reports using the Radialog model on the MIMIC-CXR dataset.\nFurthermore, sentence-level uncertainty flags the lowest-precision sentence in\neach report with an $82.9$% success rate.\n","authors":["Chenyu Wang","Weichao Zhou","Shantanu Ghosh","Kayhan Batmanghelich","Wenchao Li"],"pdf_url":"https://arxiv.org/pdf/2412.04606v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2403.02472v7","updated":"2024-12-05T20:43:34Z","published":"2024-03-04T20:34:58Z","title":"OffensiveLang: A Community Based Implicit Offensive Language Dataset","summary":" The widespread presence of hateful languages on social media has resulted in\nadverse effects on societal well-being. As a result, addressing this issue with\nhigh priority has become very important. Hate speech or offensive languages\nexist in both explicit and implicit forms, with the latter being more\nchallenging to detect. Current research in this domain encounters several\nchallenges. Firstly, the existing datasets primarily rely on the collection of\ntexts containing explicit offensive keywords, making it challenging to capture\nimplicitly offensive contents that are devoid of these keywords. Secondly,\ncommon methodologies tend to focus solely on textual analysis, neglecting the\nvaluable insights that community information can provide. In this research\npaper, we introduce a novel dataset OffensiveLang, a community based implicit\noffensive language dataset generated by ChatGPT 3.5 containing data for 38\ndifferent target groups. Despite limitations in generating offensive texts\nusing ChatGPT due to ethical constraints, we present a prompt-based approach\nthat effectively generates implicit offensive languages. To ensure data\nquality, we evaluate the dataset with human. Additionally, we employ a\nprompt-based zero-shot method with ChatGPT and compare the detection results\nbetween human annotation and ChatGPT annotation. We utilize existing\nstate-of-the-art models to see how effective they are in detecting such\nlanguages. The dataset is available here:\nhttps://github.com/AmitDasRup123/OffensiveLang\n","authors":["Amit Das","Mostafa Rahgouy","Dongji Feng","Zheng Zhang","Tathagata Bhattacharya","Nilanjana Raychawdhary","Fatemeh Jamshidi","Vinija Jain","Aman Chadha","Mary Sandage","Lauramarie Pope","Gerry Dozier","Cheryl Seals"],"pdf_url":"https://arxiv.org/pdf/2403.02472v7.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.17613v2","updated":"2024-12-05T20:41:27Z","published":"2024-05-27T19:22:41Z","title":"Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal\n Learning","summary":" Supervised multi-modal learning involves mapping multiple modalities to a\ntarget label. Previous studies in this field have concentrated on capturing in\nisolation either the inter-modality dependencies (the relationships between\ndifferent modalities and the label) or the intra-modality dependencies (the\nrelationships within a single modality and the label). We argue that these\nconventional approaches that rely solely on either inter- or intra-modality\ndependencies may not be optimal in general. We view the multi-modal learning\nproblem from the lens of generative models where we consider the target as a\nsource of multiple modalities and the interaction between them. Towards that\nend, we propose inter- & intra-modality modeling (I2M2) framework, which\ncaptures and integrates both the inter- and intra-modality dependencies,\nleading to more accurate predictions. We evaluate our approach using real-world\nhealthcare and vision-and-language datasets with state-of-the-art models,\ndemonstrating superior performance over traditional methods focusing only on\none type of modality dependency.\n","authors":["Divyam Madaan","Taro Makino","Sumit Chopra","Kyunghyun Cho"],"pdf_url":"https://arxiv.org/pdf/2405.17613v2.pdf","comment":"Accepted to NeurIPS 2024. Code available at\n https://github.com/divyam3897/I2M2"},{"id":"http://arxiv.org/abs/2412.04602v1","updated":"2024-12-05T20:32:23Z","published":"2024-12-05T20:32:23Z","title":"Formulation of probability theory problem with subtle condition","summary":" Problems in probability theory prove to be one of the most challenging for\nstudents. Here, we formulate and discuss four related problems in probability\ntheory that proved difficult for first to fourth-year undergraduate students\nwhose first language was not English. These examples emphasize how crucial it\nis to understand the conditions and requirements of the problems precisely\nbefore starting to solve them. We discuss the solutions to those problems in\ndetail, complement them with numerical estimations, and link the conditions in\nthe problems to the logical statements in Python programming language. We also\ntested two widely used chatbots (GPT-4o and Claude 3.5 Sonnet) by checking\ntheir responses to these problems.\n","authors":["Rafayel Petrosyan"],"pdf_url":"https://arxiv.org/pdf/2412.04602v1.pdf","comment":"7 pages"},{"id":"http://arxiv.org/abs/2408.10411v2","updated":"2024-12-05T20:28:35Z","published":"2024-08-19T20:50:41Z","title":"Resolving Lexical Bias in Edit Scoping with Projector Editor Networks","summary":" Weight-preserving model editing techniques heavily rely on the scoping\nmechanism that decides when to apply an edit to the base model. These scoping\nmechanisms utilize distance functions in the representation space to ascertain\nthe scope of the edit. In this work, we show that distance-based scoping\nfunctions grapple with lexical biases leading to issues such as misfires with\nirrelevant prompts that share similar lexical characteristics. To address this\nproblem, we introduce, Projector Editor Networks for Model Editing (PENME),is a\nmodel editing approach that employs a compact adapter with a projection network\ntrained via a contrastive learning objective. We demonstrate the efficacy of\nPENME in achieving superior results while being compute efficient and flexible\nto adapt across model architectures.\n","authors":["Hammad Rizwan","Domenic Rosati","Ga Wu","Hassan Sajjad"],"pdf_url":"https://arxiv.org/pdf/2408.10411v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.14774v2","updated":"2024-12-05T19:50:22Z","published":"2024-06-20T22:56:31Z","title":"Evaluating Numerical Reasoning in Text-to-Image Models","summary":" Text-to-image generative models are capable of producing high-quality images\nthat often faithfully depict concepts described using natural language. In this\nwork, we comprehensively evaluate a range of text-to-image models on numerical\nreasoning tasks of varying difficulty, and show that even the most advanced\nmodels have only rudimentary numerical skills. Specifically, their ability to\ncorrectly generate an exact number of objects in an image is limited to small\nnumbers, it is highly dependent on the context the number term appears in, and\nit deteriorates quickly with each successive number. We also demonstrate that\nmodels have poor understanding of linguistic quantifiers (such as \"a few\" or\n\"as many as\"), the concept of zero, and struggle with more advanced concepts\nsuch as partial quantities and fractional representations. We bundle prompts,\ngenerated images and human annotations into GeckoNum, a novel benchmark for\nevaluation of numerical reasoning.\n","authors":["Ivana Kajić","Olivia Wiles","Isabela Albuquerque","Matthias Bauer","Su Wang","Jordi Pont-Tuset","Aida Nematzadeh"],"pdf_url":"https://arxiv.org/pdf/2406.14774v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04576v1","updated":"2024-12-05T19:46:53Z","published":"2024-12-05T19:46:53Z","title":"Show, Don't Tell: Uncovering Implicit Character Portrayal using LLMs","summary":" Tools for analyzing character portrayal in fiction are valuable for writers\nand literary scholars in developing and interpreting compelling stories.\nExisting tools, such as visualization tools for analyzing fictional characters,\nprimarily rely on explicit textual indicators of character attributes. However,\nportrayal is often implicit, revealed through actions and behaviors rather than\nexplicit statements. We address this gap by leveraging large language models\n(LLMs) to uncover implicit character portrayals. We start by generating a\ndataset for this task with greater cross-topic similarity, lexical diversity,\nand narrative lengths than existing narrative text corpora such as TinyStories\nand WritingPrompts. We then introduce LIIPA (LLMs for Inferring Implicit\nPortrayal for Character Analysis), a framework for prompting LLMs to uncover\ncharacter portrayals. LIIPA can be configured to use various types of\nintermediate computation (character attribute word lists, chain-of-thought) to\ninfer how fictional characters are portrayed in the source text. We find that\nLIIPA outperforms existing approaches, and is more robust to increasing\ncharacter counts (number of unique persons depicted) due to its ability to\nutilize full narrative context. Lastly, we investigate the sensitivity of\nportrayal estimates to character demographics, identifying a fairness-accuracy\ntradeoff among methods in our LIIPA framework -- a phenomenon familiar within\nthe algorithmic fairness literature. Despite this tradeoff, all LIIPA variants\nconsistently outperform non-LLM baselines in both fairness and accuracy. Our\nwork demonstrates the potential benefits of using LLMs to analyze complex\ncharacters and to better understand how implicit portrayal biases may manifest\nin narrative texts.\n","authors":["Brandon Jaipersaud","Zining Zhu","Frank Rudzicz","Elliot Creager"],"pdf_url":"https://arxiv.org/pdf/2412.04576v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.08926v3","updated":"2024-12-05T19:46:36Z","published":"2024-08-15T17:23:10Z","title":"Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks\n of Language Models","summary":" Language Model (LM) agents for cybersecurity that are capable of autonomously\nidentifying vulnerabilities and executing exploits have potential to cause\nreal-world impact. Policymakers, model providers, and researchers in the AI and\ncybersecurity communities are interested in quantifying the capabilities of\nsuch agents to help mitigate cyberrisk and investigate opportunities for\npenetration testing. Toward that end, we introduce Cybench, a framework for\nspecifying cybersecurity tasks and evaluating agents on those tasks. We include\n40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF\ncompetitions, chosen to be recent, meaningful, and spanning a wide range of\ndifficulties. Each task includes its own description, starter files, and is\ninitialized in an environment where an agent can execute commands and observe\noutputs. Since many tasks are beyond the capabilities of existing LM agents, we\nintroduce subtasks for each task, which break down a task into intermediary\nsteps for a more detailed evaluation. To evaluate agent capabilities, we\nconstruct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI\no1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini\n1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing\nmodels (GPT-4o and Claude 3.5 Sonnet), we further investigate performance\nacross 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web\nsearch). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o,\nOpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that\ntook human teams up to 11 minutes to solve. In comparison, the most difficult\ntask took human teams 24 hours and 54 minutes to solve. All code and data are\npublicly available at https://cybench.github.io.\n","authors":["Andy K. Zhang","Neil Perry","Riya Dulepet","Joey Ji","Celeste Menders","Justin W. Lin","Eliot Jones","Gashon Hussein","Samantha Liu","Donovan Jasper","Pura Peetathawatchai","Ari Glenn","Vikram Sivashankar","Daniel Zamoshchin","Leo Glikbarg","Derek Askaryar","Mike Yang","Teddy Zhang","Rishi Alluri","Nathan Tran","Rinnara Sangpisit","Polycarpos Yiorkadjis","Kenny Osele","Gautham Raghupathi","Dan Boneh","Daniel E. Ho","Percy Liang"],"pdf_url":"https://arxiv.org/pdf/2408.08926v3.pdf","comment":"151 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04573v1","updated":"2024-12-05T19:35:41Z","published":"2024-12-05T19:35:41Z","title":"Give me Some Hard Questions: Synthetic Data Generation for Clinical QA","summary":" Clinical Question Answering (QA) systems enable doctors to quickly access\npatient information from electronic health records (EHRs). However, training\nthese systems requires significant annotated data, which is limited due to the\nexpertise needed and the privacy concerns associated with clinical data. This\npaper explores generating Clinical QA data using large language models (LLMs)\nin a zero-shot setting. We find that naive prompting often results in easy\nquestions that do not reflect the complexity of clinical scenarios. To address\nthis, we propose two prompting strategies: 1) instructing the model to generate\nquestions that do not overlap with the input context, and 2) summarizing the\ninput record using a predefined schema to scaffold question generation.\nExperiments on two Clinical QA datasets demonstrate that our method generates\nmore challenging questions, significantly improving fine-tuning performance\nover baselines. We compare synthetic and gold data and find a gap between their\ntraining efficacy resulting from the quality of synthetically generated\nanswers.\n","authors":["Fan Bai","Keith Harrigian","Joel Stremmel","Hamid Hassanzadeh","Ardavan Saeedi","Mark Dredze"],"pdf_url":"https://arxiv.org/pdf/2412.04573v1.pdf","comment":"Accepted to ML4H 2024 Findings"},{"id":"http://arxiv.org/abs/2410.12119v3","updated":"2024-12-05T19:22:33Z","published":"2024-10-15T23:34:22Z","title":"Scaling Laws for Post Training Quantized Large Language Models","summary":" Generalization abilities of well-trained large language models (LLMs) are\nknown to scale predictably as a function of model size. In contrast to the\nexistence of practical scaling laws governing pre-training, the quality of LLMs\nafter post-training compression remains highly unpredictable, often requiring\ncase-by-case validation in practice. In this work, we attempted to close this\ngap for post-training weight quantization of LLMs by conducting a systematic\nempirical study on multiple LLM families quantized to numerous low-precision\ntensor data types using popular weight quantization techniques. We identified\nkey scaling factors pertaining to characteristics of the local loss landscape,\nbased on which the performance of quantized LLMs can be reasonably well\npredicted by a statistical model.\n","authors":["Zifei Xu","Alexander Lan","Wanzin Yazar","Tristan Webb","Sayeh Sharify","Xin Wang"],"pdf_url":"https://arxiv.org/pdf/2410.12119v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04467v1","updated":"2024-12-05T18:59:53Z","published":"2024-12-05T18:59:53Z","title":"VisionZip: Longer is Better but Not Necessary in Vision Language Models","summary":" Recent advancements in vision-language models have enhanced performance by\nincreasing the length of visual tokens, making them much longer than text\ntokens and significantly raising computational costs. However, we observe that\nthe visual tokens generated by popular vision encoders, such as CLIP and\nSigLIP, contain significant redundancy. To address this, we introduce\nVisionZip, a simple yet effective method that selects a set of informative\ntokens for input to the language model, reducing visual token redundancy and\nimproving efficiency while maintaining model performance. The proposed\nVisionZip can be widely applied to image and video understanding tasks and is\nwell-suited for multi-turn dialogues in real-world scenarios, where previous\nmethods tend to underperform. Experimental results show that VisionZip\noutperforms the previous state-of-the-art method by at least 5% performance\ngains across nearly all settings. Moreover, our method significantly enhances\nmodel inference speed, improving the prefilling time by 8x and enabling the\nLLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while\nachieving better results. Furthermore, we analyze the causes of this redundancy\nand encourage the community to focus on extracting better visual features\nrather than merely increasing token length. Our code is available at\nhttps://github.com/dvlab-research/VisionZip .\n","authors":["Senqiao Yang","Yukang Chen","Zhuotao Tian","Chengyao Wang","Jingyao Li","Bei Yu","Jiaya Jia"],"pdf_url":"https://arxiv.org/pdf/2412.04467v1.pdf","comment":"2 columns, 28 pages, 15 figures, 18 tables"},{"id":"http://arxiv.org/abs/2412.04454v1","updated":"2024-12-05T18:58:26Z","published":"2024-12-05T18:58:26Z","title":"Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction","summary":" Graphical User Interfaces (GUIs) are critical to human-computer interaction,\nyet automating GUI tasks remains challenging due to the complexity and\nvariability of visual environments. Existing approaches often rely on textual\nrepresentations of GUIs, which introduce limitations in generalization,\nefficiency, and scalability. In this paper, we introduce Aguvis, a unified pure\nvision-based framework for autonomous GUI agents that operates across various\nplatforms. Our approach leverages image-based observations, and grounding\ninstructions in natural language to visual elements, and employs a consistent\naction space to ensure cross-platform generalization. To address the\nlimitations of previous work, we integrate explicit planning and reasoning\nwithin the model, enhancing its ability to autonomously navigate and interact\nwith complex digital environments. We construct a large-scale dataset of GUI\nagent trajectories, incorporating multimodal reasoning and grounding, and\nemploy a two-stage training pipeline that first focuses on general GUI\ngrounding, followed by planning and reasoning. Through comprehensive\nexperiments, we demonstrate that Aguvis surpasses previous state-of-the-art\nmethods in both offline and real-world online scenarios, achieving, to our\nknowledge, the first fully autonomous pure vision GUI agent capable of\nperforming tasks independently without collaboration with external\nclosed-source models. We open-sourced all datasets, models, and training\nrecipes to facilitate future research at https://aguvis-project.github.io/.\n","authors":["Yiheng Xu","Zekun Wang","Junli Wang","Dunjie Lu","Tianbao Xie","Amrita Saha","Doyen Sahoo","Tao Yu","Caiming Xiong"],"pdf_url":"https://arxiv.org/pdf/2412.04454v1.pdf","comment":"https://aguvis-project.github.io/"},{"id":"http://arxiv.org/abs/2412.04449v1","updated":"2024-12-05T18:58:03Z","published":"2024-12-05T18:58:03Z","title":"p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay","summary":" Despite the remarkable performance of multimodal large language models\n(MLLMs) across diverse tasks, the substantial training and inference costs\nimpede their advancement. The majority of computation stems from the\noverwhelming volume of vision tokens processed by the transformer decoder. In\nthis paper, we propose to build efficient MLLMs by leveraging the\nMixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects\nessential vision tokens to process while skipping redundant ones. However,\nintegrating MoD into MLLMs is non-trivial. To address the challenges of\ntraining and inference stability as well as limited training data, we adapt the\nMoD module with two novel designs: tanh-gated weight normalization (TanhNorm)\nand symmetric token reweighting (STRing). Moreover, we observe that vision\ntokens exhibit higher redundancy in deeper layer and thus design a progressive\nratio decay (PRD) strategy, which gradually reduces the token retention ratio\nlayer by layer, employing a shifted cosine schedule. This crucial design fully\nunleashes the potential of MoD, significantly boosting the efficiency and\nperformance of our models. To validate the effectiveness of our approach, we\nconduct extensive experiments with two baseline models across 14 benchmarks.\nOur model, p-MoD, matches or even surpasses the performance of the baseline\nmodels, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and\n77.7% GPU hours during training.\n","authors":["Jun Zhang","Desen Meng","Ji Qi","Zhenpeng Huang","Tao Wu","Limin Wang"],"pdf_url":"https://arxiv.org/pdf/2412.04449v1.pdf","comment":"Technical Report; Code released at https://github.com/MCG-NJU/p-MoD"},{"id":"http://arxiv.org/abs/2411.04986v2","updated":"2024-12-05T18:57:50Z","published":"2024-11-07T18:55:09Z","title":"The Semantic Hub Hypothesis: Language Models Share Semantic\n Representations Across Languages and Modalities","summary":" Modern language models can process inputs across diverse languages and\nmodalities. We hypothesize that models acquire this capability through learning\na shared representation space across heterogeneous data types (e.g., different\nlanguages and modalities), which places semantically similar inputs near one\nanother, even if they are from different modalities/languages. We term this the\nsemantic hub hypothesis, following the hub-and-spoke model from neuroscience\n(Patterson et al., 2007) which posits that semantic knowledge in the human\nbrain is organized through a transmodal semantic \"hub\" which integrates\ninformation from various modality-specific \"spokes\" regions. We first show that\nmodel representations for semantically equivalent inputs in different languages\nare similar in the intermediate layers, and that this space can be interpreted\nusing the model's dominant pretraining language via the logit lens. This\ntendency extends to other data types, including arithmetic expressions, code,\nand visual/audio inputs. Interventions in the shared representation space in\none data type also predictably affect model outputs in other data types,\nsuggesting that this shared representations space is not simply a vestigial\nbyproduct of large-scale training on broad data, but something that is actively\nutilized by the model during input processing.\n","authors":["Zhaofeng Wu","Xinyan Velocity Yu","Dani Yogatama","Jiasen Lu","Yoon Kim"],"pdf_url":"https://arxiv.org/pdf/2411.04986v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04445v1","updated":"2024-12-05T18:57:04Z","published":"2024-12-05T18:57:04Z","title":"Moto: Latent Motion Token as the Bridging Language for Robot\n Manipulation","summary":" Recent developments in Large Language Models pre-trained on extensive corpora\nhave shown significant success in various natural language processing tasks\nwith minimal fine-tuning. This success offers new promise for robotics, which\nhas long been constrained by the high cost of action-labeled data. We ask:\ngiven the abundant video data containing interaction-related knowledge\navailable as a rich \"corpus\", can a similar generative pre-training approach be\neffectively applied to enhance robot learning? The key challenge is to identify\nan effective representation for autoregressive pre-training that benefits robot\nmanipulation tasks. Inspired by the way humans learn new skills through\nobserving dynamic environments, we propose that effective robotic learning\nshould emphasize motion-related knowledge, which is closely tied to low-level\nactions and is hardware-agnostic, facilitating the transfer of learned motions\nto actual robot actions. To this end, we introduce Moto, which converts video\ncontent into latent Motion Token sequences by a Latent Motion Tokenizer,\nlearning a bridging \"language\" of motion from videos in an unsupervised manner.\nWe pre-train Moto-GPT through motion token autoregression, enabling it to\ncapture diverse visual motion knowledge. After pre-training, Moto-GPT\ndemonstrates the promising ability to produce semantically interpretable motion\ntokens, predict plausible motion trajectories, and assess trajectory\nrationality through output likelihood. To transfer learned motion priors to\nreal robot actions, we implement a co-fine-tuning strategy that seamlessly\nbridges latent motion token prediction and real robot control. Extensive\nexperiments show that the fine-tuned Moto-GPT exhibits superior robustness and\nefficiency on robot manipulation benchmarks, underscoring its effectiveness in\ntransferring knowledge from video data to downstream visual manipulation tasks.\n","authors":["Yi Chen","Yuying Ge","Yizhuo Li","Yixiao Ge","Mingyu Ding","Ying Shan","Xihui Liu"],"pdf_url":"https://arxiv.org/pdf/2412.04445v1.pdf","comment":"Project released at: https://chenyi99.github.io/moto/"},{"id":"http://arxiv.org/abs/2412.04425v1","updated":"2024-12-05T18:51:10Z","published":"2024-12-05T18:51:10Z","title":"CA-SSLR: Condition-Aware Self-Supervised Learning Representation for\n Generalized Speech Processing","summary":" We introduce Condition-Aware Self-Supervised Learning Representation\n(CA-SSLR), a generalist conditioning model broadly applicable to various\nspeech-processing tasks. Compared to standard fine-tuning methods that optimize\nfor downstream models, CA-SSLR integrates language and speaker embeddings from\nearlier layers, making the SSL model aware of the current language and speaker\ncontext. This approach reduces the reliance on input audio features while\npreserving the integrity of the base SSLR. CA-SSLR improves the model's\ncapabilities and demonstrates its generality on unseen tasks with minimal\ntask-specific tuning. Our method employs linear modulation to dynamically\nadjust internal representations, enabling fine-grained adaptability without\nsignificantly altering the original model behavior. Experiments show that\nCA-SSLR reduces the number of trainable parameters, mitigates overfitting, and\nexcels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a\n10% relative reduction in LID errors, a 37% improvement in ASR CER on the\nML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating\nits effectiveness.\n","authors":["Yen-Ju Lu","Jing Liu","Thomas Thebaud","Laureano Moro-Velazquez","Ariya Rastrow","Najim Dehak","Jesus Villalba"],"pdf_url":"https://arxiv.org/pdf/2412.04425v1.pdf","comment":"38th Conference on Neural Information Processing Systems (NeurIPS\n 2024)"},{"id":"http://arxiv.org/abs/2403.07384v2","updated":"2024-12-05T18:47:47Z","published":"2024-03-12T07:45:33Z","title":"SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large\n Language Models by Summarizing Training Trajectories of Small Models","summary":" Despite the effectiveness of data selection for large language models (LLMs)\nduring pretraining and instruction fine-tuning phases, improving data\nefficiency in supervised fine-tuning (SFT) for specialized domains poses\nsignificant challenges due to the complexity of fine-tuning data. To bridge\nthis gap, we introduce an effective and scalable data selection method for SFT,\nSmallToLarge (S2L), which leverages training trajectories from small models to\nguide the data selection for larger models. We demonstrate through extensive\nexperiments that S2L significantly improves data efficiency in SFT for\nmathematical problem-solving, reducing the training data to just 11% of the\noriginal MathInstruct dataset (Yue et al., 2023) to match full dataset\nperformance while outperforming state-of-the-art data selection algorithms by\nan average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably,\nselecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most\nchallenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et\nal., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset\n(Johnson et al., 2016), S2L again outperforms training on the full dataset\nusing only 50% of the data. Notably, S2L can perform data selection using a\nreference model 40x smaller than the target model, proportionally reducing the\ncost of data selection.\n","authors":["Yu Yang","Siddhartha Mishra","Jeffrey N Chiang","Baharan Mirzasoleiman"],"pdf_url":"https://arxiv.org/pdf/2403.07384v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04537v1","updated":"2024-12-05T18:43:11Z","published":"2024-12-05T18:43:11Z","title":"Understanding Hidden Computations in Chain-of-Thought Reasoning","summary":" Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning\nabilities of large language models. However, recent studies have shown that\nmodels can still perform complex reasoning tasks even when the CoT is replaced\nwith filler(hidden) characters (e.g., \"...\"), leaving open questions about how\nmodels internally process and represent reasoning steps. In this paper, we\ninvestigate methods to decode these hidden characters in transformer models\ntrained with filler CoT sequences. By analyzing layer-wise representations\nusing the logit lens method and examining token rankings, we demonstrate that\nthe hidden characters can be recovered without loss of performance. Our\nfindings provide insights into the internal mechanisms of transformer models\nand open avenues for improving interpretability and transparency in language\nmodel reasoning.\n","authors":["Aryasomayajula Ram Bharadwaj"],"pdf_url":"https://arxiv.org/pdf/2412.04537v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2409.12924v3","updated":"2024-12-05T18:35:26Z","published":"2024-09-04T03:17:19Z","title":"WaveletGPT: Wavelets Meet Large Language Models","summary":" Large Language Models (LLMs) have ushered in a new wave of artificial\nintelligence advancements impacting every scientific field and discipline. They\nare trained on a simple objective: to predict the next token given the previous\ncontext. We live in a world where most of the data around us, e.g., text,\naudio, and music, has a multi-scale structure associated with it. This paper\ninfuses LLMs with traditional signal processing ideas, namely wavelets, during\npre-training to take advantage of the structure. Without adding \\textbf{any\nextra parameters} to a GPT-style LLM architecture, we achieve the same\npre-training performance almost twice as fast in text, raw audio, and symbolic\nmusic. This is achieved by imposing a structure on intermediate embeddings.\nWhen trained for the same number of training steps, we achieve significant\ngains in performance, which is comparable to pre-training a larger neural\narchitecture. Our architecture allows every next token prediction access to\nintermediate embeddings at different temporal resolutions in every Transformer\ndecoder block. This work will hopefully pave the way for incorporating\nmulti-rate signal processing ideas into traditional LLM pre-training. Further,\nwe showcase pushing model performance by improving internal structure instead\nof just going after scale.\n","authors":["Prateek Verma"],"pdf_url":"https://arxiv.org/pdf/2409.12924v3.pdf","comment":"16 pages, 4 figures"},{"id":"http://arxiv.org/abs/2412.04403v1","updated":"2024-12-05T18:21:49Z","published":"2024-12-05T18:21:49Z","title":"Establishing Task Scaling Laws via Compute-Efficient Model Ladders","summary":" We develop task scaling laws and model ladders to predict the individual task\nperformance of pretrained language models (LMs) in the overtrained setting.\nStandard power laws for language modeling loss cannot accurately model task\nperformance. Therefore, we leverage a two-step prediction approach: first use\nmodel and data size to predict a task-specific loss, and then use this task\nloss to predict task performance. We train a set of small-scale \"ladder\"\nmodels, collect data points to fit the parameterized functions of the two\nprediction steps, and make predictions for two target models: a 7B model\ntrained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder\nmodels only costs 1% of the compute used for the target models. On four\nmultiple-choice tasks written in ranked classification format, we can predict\nthe accuracy of both target models within 2 points of absolute error. We have\nhigher prediction error on four other tasks (average absolute error 6.9) and\nfind that these are often tasks with higher variance in task metrics. We also\nfind that using less compute to train fewer ladder models tends to deteriorate\npredictions. Finally, we empirically show that our design choices and the\ntwo-step approach lead to superior performance in establishing scaling laws.\n","authors":["Akshita Bhagia","Jiacheng Liu","Alexander Wettig","David Heineman","Oyvind Tafjord","Ananya Harsh Jha","Luca Soldaini","Noah A. Smith","Dirk Groeneveld","Pang Wei Koh","Jesse Dodge","Hannaneh Hajishirzi"],"pdf_url":"https://arxiv.org/pdf/2412.04403v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02819v2","updated":"2024-12-05T17:51:20Z","published":"2024-12-03T20:35:57Z","title":"CNNSum: Exploring Long-Conext Summarization with Large Language Models\n in Chinese Novels","summary":" Large Language Models (LLMs) have been well-researched in many long-context\ntasks. However, due to high annotation costs, high-quality long-context summary\ndatasets for training or evaluation are scarce, limiting further research. In\nthis work, we introduce CNNSum, a new multi-scale Chinese long-context novel\nsummarization benchmark, including four subsets, length covering\n16k\\textasciitilde128k, 695 samples in total, the annotations are human-driven.\nWe evaluate commercial and open-source models on CNNSum and conduct a detailed\nanalysis. Based on the observations, we further conduct fine-tuning exploration\nwith short-context summary data. In our study: (1) GPT-4o underperformed, due\nto excessive subjective commentary. (2) Currently, long-context summarization\nmainly relies on memory ability, small LLMs with stable longer context lengths\nare the most cost-effective. Using long data concatenated from short-context\nsummaries makes a significant improvement. (3) Prompt templates may cause a\nlarge performance gap but can be mitigated through fine-tuning. (4) Fine-tuned\nChat or Instruction versions may harm the Base model and further fine-tuning\ncannot bridge performance gap. (5) while models with RoPE base scaling exhibit\nstrong extrapolation potential, their performance may vary significantly when\ncombined with other interpolation methods and need careful selection. (6)\nCNNSum provides more reliable and insightful evaluation results than other\nbenchmarks. We release CNNSum to advance research in this field.\n","authors":["Lingxiao Wei","He Yan","Xiangju Lu","Junmin Zhu","Jun Wang","Wei Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.02819v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.02589v2","updated":"2024-12-05T17:41:48Z","published":"2024-11-04T20:29:35Z","title":"Context-Informed Machine Translation of Manga using Multimodal Large\n Language Models","summary":" Due to the significant time and effort required for handcrafting\ntranslations, most manga never leave the domestic Japanese market. Automatic\nmanga translation is a promising potential solution. However, it is a budding\nand underdeveloped field and presents complexities even greater than those\nfound in standard translation due to the need to effectively incorporate visual\nelements into the translation process to resolve ambiguities. In this work, we\ninvestigate to what extent multimodal large language models (LLMs) can provide\neffective manga translation, thereby assisting manga authors and publishers in\nreaching wider audiences. Specifically, we propose a methodology that leverages\nthe vision component of multimodal LLMs to improve translation quality and\nevaluate the impact of translation unit size, context length, and propose a\ntoken efficient approach for manga translation. Moreover, we introduce a new\nevaluation dataset -- the first parallel Japanese-Polish manga translation\ndataset -- as part of a benchmark to be used in future research. Finally, we\ncontribute an open-source software suite, enabling others to benchmark LLMs for\nmanga translation. Our findings demonstrate that our proposed methods achieve\nstate-of-the-art results for Japanese-English translation and set a new\nstandard for Japanese-Polish.\n","authors":["Philip Lippmann","Konrad Skublicki","Joshua Tanner","Shonosuke Ishiwatari","Jie Yang"],"pdf_url":"https://arxiv.org/pdf/2411.02589v2.pdf","comment":"COLING 2025"},{"id":"http://arxiv.org/abs/2412.04351v1","updated":"2024-12-05T17:10:19Z","published":"2024-12-05T17:10:19Z","title":"BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages","summary":" This paper focuses on developing translation models and related applications\nfor 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj,\nBodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada,\nKangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili,\nMalayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi,\nSanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu,\nTelugu, and Urdu. Achieving this requires parallel and other types of corpora\nfor all 36 * 36 language pairs, addressing challenges like script variations,\nphonetic differences, and syntactic diversity. For instance, languages like\nKashmiri and Sindhi, which use multiple scripts, demand script normalization\nfor alignment, while low-resource languages such as Khasi and Santali require\nsynthetic data augmentation to ensure sufficient coverage and quality.\n To address these challenges, this work proposes strategies for corpus\ncreation by leveraging existing resources, developing parallel datasets,\ngenerating domain-specific corpora, and utilizing synthetic data techniques.\nAdditionally, it evaluates machine translation across various dimensions,\nincluding standard and discourse-level translation, domain-specific\ntranslation, reference-based and reference-free evaluation, error analysis, and\nautomatic post-editing. By integrating these elements, the study establishes a\ncomprehensive framework to improve machine translation quality and enable\nbetter cross-lingual communication in India's linguistically diverse ecosystem.\n","authors":["Vandan Mujadia","Dipti Misra Sharma"],"pdf_url":"https://arxiv.org/pdf/2412.04351v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04342v1","updated":"2024-12-05T17:00:32Z","published":"2024-12-05T17:00:32Z","title":"Retrieval-Augmented Machine Translation with Unstructured Knowledge","summary":" Retrieval-augmented generation (RAG) introduces additional information to\nenhance large language models (LLMs). In machine translation (MT), previous\nwork typically retrieves in-context examples from paired MT corpora, or\ndomain-specific knowledge from knowledge graphs, to enhance models' MT ability.\nHowever, a large amount of world knowledge is organized in unstructured\ndocuments, and might not be fully paired across different languages. In this\npaper, we study retrieval-augmented MT using unstructured documents.\nSpecifically, we build RAGtrans, the first benchmark to train and evaluate\nLLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples\ncollected via GPT-4o and human translators. Besides, documents from different\nlanguages are also provided to supply the knowledge to these samples. Based on\nRAGtrans, we further propose a multi-task training method to teach LLMs how to\nuse information from multilingual documents during their translation. The\nmethod uses existing multilingual corpora to create auxiliary training\nobjectives without additional labeling requirements. Extensive experiments show\nthat the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores.\n","authors":["Jiaan Wang","Fandong Meng","Yingxue Zhang","Jie Zhou"],"pdf_url":"https://arxiv.org/pdf/2412.04342v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04318v1","updated":"2024-12-05T16:34:20Z","published":"2024-12-05T16:34:20Z","title":"The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for\n Open-Ended Text Generation","summary":" This paper introduces the counter-intuitive generalization results of\noverfitting pre-trained large language models (LLMs) on very small datasets. In\nthe setting of open-ended text generation, it is well-documented that LLMs tend\nto generate repetitive and dull sequences, a phenomenon that is especially\napparent when generating using greedy decoding. This issue persists even with\nstate-of-the-art LLMs containing billions of parameters, trained via next-token\nprediction on large datasets. We find that by further fine-tuning these models\nto achieve a near-zero training loss on a small set of samples -- a process we\nrefer to as hyperfitting -- the long-sequence generative capabilities are\ngreatly enhanced. Greedy decoding with these Hyperfitted models even outperform\nTop-P sampling over long-sequences, both in terms of diversity and human\npreferences. This phenomenon extends to LLMs of various sizes, different\ndomains, and even autoregressive image generation. We further find this\nphenomena to be distinctly different from that of Grokking and double descent.\nSurprisingly, our experiments indicate that hyperfitted models rarely fall into\nrepeating sequences they were trained on, and even explicitly blocking these\nsequences results in high-quality output. All hyperfitted models produce\nextremely low-entropy predictions, often allocating nearly all probability to a\nsingle token.\n","authors":["Fredrik Carlsson","Fangyu Liu","Daniel Ward","Murathan Kurfali","Joakim Nivre"],"pdf_url":"https://arxiv.org/pdf/2412.04318v1.pdf","comment":"Under review at ICLR"},{"id":"http://arxiv.org/abs/2412.04305v1","updated":"2024-12-05T16:26:31Z","published":"2024-12-05T16:26:31Z","title":"ALMA: Alignment with Minimal Annotation","summary":" Recent approaches to large language model (LLM) alignment typically require\nmillions of human annotations or rely on external aligned models for synthetic\ndata generation. This paper introduces ALMA: Alignment with Minimal Annotation,\ndemonstrating that effective alignment can be achieved using only 9,000 labeled\nexamples -- less than 1% of conventional approaches. ALMA generates large\namounts of high-quality synthetic alignment data through new techniques:\ndiverse prompt synthesis via few-shot learning, diverse response generation\nwith multiple model checkpoints, and judge (reward model) enhancement through\nscore aggregation and self-distillation. Using only a pretrained Llama3 base\nmodel, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves\nperformance close to Llama3-Instruct across diverse alignment benchmarks (e.g.,\n0.1% difference on AlpacaEval 2.0 score). These results are achieved with a\nmulti-round, self-bootstrapped data synthesis and training recipe that\ncontinues to improve for 10 rounds, surpassing the typical 3-round ceiling of\nprevious methods. These results suggest that base models already possess\nsufficient knowledge for effective alignment, and that synthetic data\ngeneration methods can expose it.\n","authors":["Michihiro Yasunaga","Leonid Shamis","Chunting Zhou","Andrew Cohen","Jason Weston","Luke Zettlemoyer","Marjan Ghazvininejad"],"pdf_url":"https://arxiv.org/pdf/2412.04305v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.15796v5","updated":"2024-12-05T16:13:09Z","published":"2024-06-22T09:40:07Z","title":"Unveiling Entity-Level Unlearning for Large Language Models: A\n Comprehensive Analysis","summary":" Large language model unlearning has garnered increasing attention due to its\npotential to address security and privacy concerns, leading to extensive\nresearch in the field. However, much of this research has concentrated on\ninstance-level unlearning, specifically targeting the removal of predefined\ninstances containing sensitive content. This focus has left a significant gap\nin the exploration of full entity-level unlearning, which is critical in\nreal-world scenarios such as copyright protection. To this end, we propose a\nnovel task of Entity-level unlearning, which aims to erase entity-related\nknowledge from the target model completely. To thoroughly investigate this\ntask, we systematically evaluate trending unlearning algorithms, revealing that\ncurrent methods struggle to achieve effective entity-level unlearning. Then, we\nfurther explore the factors that influence the performance of the unlearning\nalgorithms, identifying that knowledge coverage and the size of the forget set\nplay pivotal roles. Notably, our analysis also uncovers that entities\nintroduced through fine-tuning are more vulnerable to unlearning than\npre-trained entities. These findings collectively offer valuable insights for\nadvancing entity-level unlearning for LLMs.\n","authors":["Weitao Ma","Xiaocheng Feng","Weihong Zhong","Lei Huang","Yangfan Ye","Xiachong Feng","Bing Qin"],"pdf_url":"https://arxiv.org/pdf/2406.15796v5.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2412.04291v1","updated":"2024-12-05T16:12:06Z","published":"2024-12-05T16:12:06Z","title":"Evolutionary Pre-Prompt Optimization for Mathematical Reasoning","summary":" Recent advancements have highlighted that large language models (LLMs), when\ngiven a small set of task-specific examples, demonstrate remarkable\nproficiency, a capability that extends to complex reasoning tasks. In\nparticular, the combination of few-shot learning with the chain-of-thought\n(CoT) approach has been pivotal in steering models towards more logically\nconsistent conclusions. This paper explores the optimization of example\nselection for designing effective CoT pre-prompts and shows that the choice of\nthe optimization algorithm, typically in favor of comparison-based methods such\nas evolutionary computation, significantly enhances efficacy and feasibility.\nSpecifically, thanks to a limited exploitative and overfitted optimization,\nEvolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the\nnaive few-shot approach exceeding 10 absolute points in exact match scores on\nbenchmark datasets such as GSM8k and MathQA. These gains are consistent across\nvarious contexts and are further amplified when integrated with\nself-consistency (SC)\n","authors":["Mathurin Videau","Alessandro Leite","Marc Schoenauer","Olivier Teytaud"],"pdf_url":"https://arxiv.org/pdf/2412.04291v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04277v1","updated":"2024-12-05T15:59:29Z","published":"2024-12-05T15:59:29Z","title":"Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic","summary":" Large Language Models (LLMs) have shown impressive results in multiple\ndomains of natural language processing (NLP) but are mainly focused on the\nEnglish language. Recently, more LLMs have incorporated a larger proportion of\nmultilingual text to represent low-resource languages. In Arabic NLP, several\nArabic-centric LLMs have shown remarkable results on multiple benchmarks in the\npast two years. However, most Arabic LLMs have more than 7 billion parameters,\nwhich increases their hardware requirements and inference latency, when\ncompared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base\nand chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable\nLM 1.6B chat model achieves impressive results on several benchmarks beating\nmultiple models with up to 8x the parameters. In addition, we show the benefit\nof mixing in synthetic instruction tuning data by augmenting our fine-tuning\ndata with a large synthetic dialogue dataset.\n","authors":["Zaid Alyafeai","Michael Pieler","Hannah Teufel","Jonathan Tow","Marco Bellagente","Duy Phung","Nikhil Pinnaparaju","Reshinth Adithyan","Paulo Rocha","Maksym Zhuravinskyi","Carlos Riquelme"],"pdf_url":"https://arxiv.org/pdf/2412.04277v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04266v1","updated":"2024-12-05T15:50:44Z","published":"2024-12-05T15:50:44Z","title":"Representation Purification for End-to-End Speech Translation","summary":" Speech-to-text translation (ST) is a cross-modal task that involves\nconverting spoken language into text in a different language. Previous research\nprimarily focused on enhancing speech translation by facilitating knowledge\ntransfer from machine translation, exploring various methods to bridge the gap\nbetween speech and text modalities. Despite substantial progress made, factors\nin speech that are not relevant to translation content, such as timbre and\nrhythm, often limit the efficiency of knowledge transfer. In this paper, we\nconceptualize speech representation as a combination of content-agnostic and\ncontent-relevant factors. We examine the impact of content-agnostic factors on\ntranslation performance through preliminary experiments and observe a\nsignificant performance deterioration when content-agnostic perturbations are\nintroduced to speech signals. To address this issue, we propose a\n\\textbf{S}peech \\textbf{R}epresentation \\textbf{P}urification with\n\\textbf{S}upervision \\textbf{E}nhancement (SRPSE) framework, which excludes the\ncontent-agnostic components within speech representations to mitigate their\nnegative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate\nthat SRPSE significantly improves translation performance across all\ntranslation directions in three settings and achieves preeminent performance\nunder a \\textit{transcript-free} setting.\n","authors":["Chengwei Zhang","Yue Zhou","Rui Zhao","Yidong Chen","Xiaodong Shi"],"pdf_url":"https://arxiv.org/pdf/2412.04266v1.pdf","comment":"Accepted by COLING 2025"},{"id":"http://arxiv.org/abs/2405.20331v2","updated":"2024-12-05T15:48:24Z","published":"2024-05-30T17:59:04Z","title":"CoSy: Evaluating Textual Explanations of Neurons","summary":" A crucial aspect of understanding the complex nature of Deep Neural Networks\n(DNNs) is the ability to explain learned concepts within their latent\nrepresentations. While methods exist to connect neurons to human-understandable\ntextual descriptions, evaluating the quality of these explanations is\nchallenging due to the lack of a unified quantitative approach. We introduce\nCoSy (Concept Synthesis), a novel, architecture-agnostic framework for\nevaluating textual explanations of latent neurons. Given textual explanations,\nour proposed framework uses a generative model conditioned on textual input to\ncreate data points representing the explanations. By comparing the neuron's\nresponse to these generated data points and control data points, we can\nestimate the quality of the explanation. We validate our framework through\nsanity checks and benchmark various neuron description methods for Computer\nVision tasks, revealing significant differences in quality.\n","authors":["Laura Kopf","Philine Lou Bommer","Anna Hedström","Sebastian Lapuschkin","Marina M. -C. Höhne","Kirill Bykov"],"pdf_url":"https://arxiv.org/pdf/2405.20331v2.pdf","comment":"10 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.04261v1","updated":"2024-12-05T15:41:06Z","published":"2024-12-05T15:41:06Z","title":"Aya Expanse: Combining Research Breakthroughs for a New Multilingual\n Frontier","summary":" We introduce the Aya Expanse model family, a new generation of 8B and 32B\nparameter multilingual language models, aiming to address the critical\nchallenge of developing highly performant multilingual models that match or\nsurpass the capabilities of monolingual models. By leveraging several years of\nresearch at Cohere For AI and Cohere, including advancements in data arbitrage,\nmultilingual preference training, and model merging, Aya Expanse sets a new\nstate-of-the-art in multilingual performance. Our evaluations on the\nArena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya\nExpanse 8B and 32B outperform leading open-weight models in their respective\nparameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to\na 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model\nwith twice as many parameters, achieving a 54.0% win-rate. In this short\ntechnical report, we present extended evaluation results for the Aya Expanse\nmodel family and release their open-weights, together with a new multilingual\nevaluation dataset m-ArenaHard.\n","authors":["John Dang","Shivalika Singh","Daniel D'souza","Arash Ahmadian","Alejandro Salamanca","Madeline Smith","Aidan Peppin","Sungjin Hong","Manoj Govindassamy","Terrence Zhao","Sandra Kublik","Meor Amer","Viraat Aryabumi","Jon Ander Campos","Yi-Chern Tan","Tom Kocmi","Florian Strub","Nathan Grinsztajn","Yannis Flet-Berliac","Acyr Locatelli","Hangyu Lin","Dwarak Talupuru","Bharat Venkitesh","David Cairuz","Bowen Yang","Tim Chung","Wei-Yin Ko","Sylvie Shang Shi","Amir Shukayev","Sammie Bae","Aleksandra Piktus","Roman Castagné","Felipe Cruz-Salinas","Eddie Kim","Lucas Crawhall-Stein","Adrien Morisot","Sudip Roy","Phil Blunsom","Ivan Zhang","Aidan Gomez","Nick Frosst","Marzieh Fadaee","Beyza Ermis","Ahmet Üstün","Sara Hooker"],"pdf_url":"https://arxiv.org/pdf/2412.04261v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.16778v2","updated":"2024-12-05T15:38:11Z","published":"2024-06-24T16:40:54Z","title":"Finding Transformer Circuits with Edge Pruning","summary":" The path to interpreting a language model often proceeds via analysis of\ncircuits -- sparse computational subgraphs of the model that capture specific\naspects of its behavior. Recent work has automated the task of discovering\ncircuits. Yet, these methods have practical limitations, as they rely either on\ninefficient search algorithms or inaccurate approximations. In this paper, we\nframe automated circuit discovery as an optimization problem and propose *Edge\nPruning* as an effective and scalable solution. Edge Pruning leverages\ngradient-based pruning techniques, but instead of removing neurons or\ncomponents, it prunes the \\emph{edges} between components. Our method finds\ncircuits in GPT-2 that use less than half the number of edges compared to\ncircuits found by previous methods while being equally faithful to the full\nmodel predictions on standard circuit-finding tasks. Edge Pruning is efficient\neven with as many as 100K examples, outperforming previous methods in speed and\nproducing substantially better circuits. It also perfectly recovers the\nground-truth circuits in two models compiled with Tracr. Thanks to its\nefficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale\nthat prior methods operate on. We use this setting for a case study comparing\nthe mechanisms behind instruction prompting and in-context learning. We find\ntwo circuits with more than 99.96% sparsity that match the performance of the\nfull model and reveal that the mechanisms in the two settings overlap\nsubstantially. Our case study shows that Edge Pruning is a practical and\nscalable tool for interpretability and sheds light on behaviors that only\nemerge in large models.\n","authors":["Adithya Bhaskar","Alexander Wettig","Dan Friedman","Danqi Chen"],"pdf_url":"https://arxiv.org/pdf/2406.16778v2.pdf","comment":"NeurIPS 2024 (Spotlight)"},{"id":"http://arxiv.org/abs/2412.04254v1","updated":"2024-12-05T15:34:02Z","published":"2024-12-05T15:34:02Z","title":"CLINICSUM: Utilizing Language Models for Generating Clinical Summaries\n from Patient-Doctor Conversations","summary":" This paper presents ClinicSum, a novel framework designed to automatically\ngenerate clinical summaries from patient-doctor conversations. It utilizes a\ntwo-module architecture: a retrieval-based filtering module that extracts\nSubjective, Objective, Assessment, and Plan (SOAP) information from\nconversation transcripts, and an inference module powered by fine-tuned\nPre-trained Language Models (PLMs), which leverage the extracted SOAP data to\ngenerate abstracted clinical summaries. To fine-tune the PLM, we created a\ntraining dataset of consisting 1,473 conversations-summaries pair by\nconsolidating two publicly available datasets, FigShare and MTS-Dialog, with\nground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's\neffectiveness is evaluated through both automatic metrics (e.g., ROUGE,\nBERTScore) and expert human assessments. Results show that ClinicSum\noutperforms state-of-the-art PLMs, demonstrating superior precision, recall,\nand F-1 scores in automatic evaluations and receiving high preference from SMEs\nin human assessment, making it a robust solution for automated clinical\nsummarization.\n","authors":["Subash Neupane","Himanshu Tripathi","Shaswata Mitra","Sean Bozorgzad","Sudip Mittal","Shahram Rahimi","Amin Amirlatifi"],"pdf_url":"https://arxiv.org/pdf/2412.04254v1.pdf","comment":"accepted at the the 2024 IEEE International Conference on Big Data\n workshop Workshop on Big Data and AI for Healthcare"},{"id":"http://arxiv.org/abs/2410.14817v2","updated":"2024-12-05T15:20:28Z","published":"2024-10-18T18:37:27Z","title":"A Complexity-Based Theory of Compositionality","summary":" Compositionality is believed to be fundamental to intelligence. In humans, it\nunderlies the structure of thought, language, and higher-level reasoning. In\nAI, compositional representations can enable a powerful form of\nout-of-distribution generalization, in which a model systematically adapts to\nnovel combinations of known concepts. However, while we have strong intuitions\nabout what compositionality is, there currently exists no formal definition for\nit that is measurable and mathematical. Here, we propose such a definition,\nwhich we call representational compositionality, that accounts for and extends\nour intuitions about compositionality. The definition is conceptually simple,\nquantitative, grounded in algorithmic information theory, and applicable to any\nrepresentation. Intuitively, representational compositionality states that a\ncompositional representation satisfies three properties. First, it must be\nexpressive. Second, it must be possible to re-describe the representation as a\nfunction of discrete symbolic sequences with re-combinable parts, analogous to\nsentences in natural language. Third, the function that relates these symbolic\nsequences to the representation, analogous to semantics in natural language,\nmust be simple. Through experiments on both synthetic and real world data, we\nvalidate our definition of compositionality and show how it unifies disparate\nintuitions from across the literature in both AI and cognitive science. We also\nshow that representational compositionality, while theoretically intractable,\ncan be readily estimated using standard deep learning tools. Our definition has\nthe potential to inspire the design of novel, theoretically-driven models that\nbetter capture the mechanisms of compositional thought.\n","authors":["Eric Elmoznino","Thomas Jiralerspong","Yoshua Bengio","Guillaume Lajoie"],"pdf_url":"https://arxiv.org/pdf/2410.14817v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04236v1","updated":"2024-12-05T15:14:16Z","published":"2024-12-05T15:14:16Z","title":"A History of Philosophy in Colombia through Topic Modelling","summary":" Data-driven approaches to philosophy have emerged as a valuable tool for\nstudying the history of the discipline. However, most studies in this area have\nfocused on a limited number of journals from specific regions and subfields. We\nexpand the scope of this research by applying dynamic topic modelling\ntechniques to explore the history of philosophy in Colombia and Latin America.\nOur study examines the Colombian philosophy journal Ideas y Valores, founded in\n1951 and currently one of the most influential academic philosophy journals in\nthe region. By analyzing the evolution of topics across the journal's history,\nwe identify various trends and specific dynamics in philosophical discourse\nwithin the Colombian and Latin American context. Our findings reveal that the\nmost prominent topics are value theory (including ethics, political philosophy,\nand aesthetics), epistemology, and the philosophy of science. We also trace the\nevolution of articles focusing on the historical and interpretive aspects of\nphilosophical texts, and we note a notable emphasis on German philosophers such\nas Kant, Husserl, and Hegel on various topics throughout the journal's\nlifetime. Additionally, we investigate whether articles with a historical focus\nhave decreased over time due to editorial pressures. Our analysis suggests no\nsignificant decline in such articles. Finally, we propose ideas for extending\nthis research to other Latin American journals and suggest improvements for\nnatural language processing workflows in non-English languages.\n","authors":["Juan R. Loaiza","Miguel González-Duque"],"pdf_url":"https://arxiv.org/pdf/2412.04236v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04235v1","updated":"2024-12-05T15:11:12Z","published":"2024-12-05T15:11:12Z","title":"Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM\n Chatbots","summary":" I combine detection and mitigation techniques to addresses hallucinations in\nLarge Language Models (LLMs). Mitigation is achieved in a question-answering\nRetrieval-Augmented Generation (RAG) framework while detection is obtained by\nintroducing the Negative Missing Information Scoring System (NMISS), which\naccounts for contextual relevance in responses. While RAG mitigates\nhallucinations by grounding answers in external data, NMISS refines the\nevaluation by identifying cases where traditional metrics incorrectly flag\ncontextually accurate responses as hallucinations. I use Italian health news\narticles as context to evaluate LLM performance. Results show that Gemma2 and\nGPT-4 outperform the other models, with GPT-4 producing answers closely aligned\nwith reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral\nbenefit significantly from NMISS, highlighting their ability to provide richer\ncontextual information. This combined approach offers new insights into the\nreduction and more accurate assessment of hallucinations in LLMs, with\napplications in real-world healthcare tasks and other domains.\n","authors":["Maria Paola Priola"],"pdf_url":"https://arxiv.org/pdf/2412.04235v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.05357v2","updated":"2024-12-05T15:08:56Z","published":"2024-10-07T15:55:55Z","title":"Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild","summary":" As Large Language Models (LLMs) excel across tasks and specialized domains,\nscaling LLMs based on existing models has garnered significant attention, which\nfaces the challenge of decreasing performance when combining disparate models.\nVarious techniques have been proposed for the aggregation of pre-trained LLMs,\nincluding model merging, Mixture-of-Experts, and stacking. Despite their\nmerits, a comprehensive comparison and synergistic application of them to a\ndiverse model zoo is yet to be adequately addressed. In light of this research\ngap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First,\nour work starts with a benchmarking of existing LLM scaling techniques,\nespecially selective merging, and variants of mixture. Utilizing the insights\nfrom the benchmark results, we formulate an optimal strategy for the selection\nand aggregation of a heterogeneous model zoo characterizing different\narchitectures and initialization.Our methodology involves the clustering of\nmergeable models and optimal merging strategy selection, and the integration of\nclusters through a model mixture. Finally, evidenced by our experiments on a\ndiverse Llama-2-based model zoo, Model-GLUE shows an average performance\nenhancement of 5.61%, achieved without additional training. Codes are available\nat: https://github.com/Model-GLUE/Model-GLUE.\n","authors":["Xinyu Zhao","Guoheng Sun","Ruisi Cai","Yukun Zhou","Pingzhi Li","Peihao Wang","Bowen Tan","Yexiao He","Li Chen","Yi Liang","Beidi Chen","Binhang Yuan","Hongyi Wang","Ang Li","Zhangyang Wang","Tianlong Chen"],"pdf_url":"https://arxiv.org/pdf/2410.05357v2.pdf","comment":"24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks\n Track"},{"id":"http://arxiv.org/abs/2410.03960v2","updated":"2024-12-05T14:56:56Z","published":"2024-10-04T22:45:26Z","title":"SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving\n Model Transformation","summary":" LLM inference for popular enterprise use cases, such as summarization, RAG,\nand code-generation, typically observes orders of magnitude longer prompt\nlengths than generation lengths. This characteristic leads to high cost of\nprefill and increased response latency. In this paper, we present SwiftKV, a\nnovel model transformation and distillation procedure specifically designed to\nreduce the time and cost of processing prompt tokens while preserving high\nquality of generated tokens. SwiftKV combines three key mechanisms: i)\nSingleInputKV, which prefills later layers' KV cache using a much earlier\nlayer's output, allowing prompt tokens to skip much of the model computation,\nii) AcrossKV, which merges the KV caches of neighboring layers to reduce the\nmemory footprint and support larger batch size for higher throughput, and iii)\na knowledge-preserving distillation procedure that can adapt existing LLMs for\nSwiftKV with minimal accuracy impact and low compute and data requirement. For\nLlama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50%\nand the memory requirement of the KV cache by 62.5% while incurring minimum\nquality degradation across a wide range of tasks. In the end-to-end inference\nserving using an optimized vLLM implementation, SwiftKV realizes up to 2x\nhigher aggregate throughput and 60% lower time per output token. It can achieve\na staggering 560 TFlops/GPU of normalized inference throughput, which\ntranslates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100\nGPUs. Our training, inference, and model implementations are open-sourced and\ncan be found through\nhttps://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb.\n","authors":["Aurick Qiao","Zhewei Yao","Samyam Rajbhandari","Yuxiong He"],"pdf_url":"https://arxiv.org/pdf/2410.03960v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02830v2","updated":"2024-12-05T14:51:35Z","published":"2024-12-03T20:52:35Z","title":"RARE: Retrieval-Augmented Reasoning Enhancement for Large Language\n Models","summary":" This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a\nversatile extension to the mutual reasoning framework (rStar), aimed at\nenhancing reasoning accuracy and factual integrity across large language models\n(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical\nreasoning. RARE incorporates two innovative actions within the Monte Carlo Tree\nSearch (MCTS) framework: A6, which generates search queries based on the\ninitial problem statement, performs information retrieval using those queries,\nand augments reasoning with the retrieved data to formulate the final answer;\nand A7, which leverages information retrieval specifically for generated\nsub-questions and re-answers these sub-questions with the relevant contextual\ninformation. Additionally, a Retrieval-Augmented Factuality Scorer is proposed\nto replace the original discriminator, prioritizing reasoning paths that meet\nhigh standards of factuality. Experimental results with LLaMA 3.1 show that\nRARE enables open-source LLMs to achieve competitive performance with top\nopen-source models like GPT-4 and GPT-4o. This research establishes RARE as a\nscalable solution for improving LLMs in domains where logical coherence and\nfactual integrity are critical.\n","authors":["Hieu Tran","Zonghai Yao","Junda Wang","Yifan Zhang","Zhichao Yang","Hong Yu"],"pdf_url":"https://arxiv.org/pdf/2412.02830v2.pdf","comment":"24 pages"},{"id":"http://arxiv.org/abs/2312.00326v4","updated":"2024-12-05T14:45:05Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of\nsimple OM tools. Our framework is implemented in a proof-of-concept system.\nEvaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks\nover state-of-the-art OM systems show that our system can achieve results very\nclose to the long-standing best performance on simple OM tasks and can\nsignificantly improve the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v4.pdf","comment":"14 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2412.04205v1","updated":"2024-12-05T14:41:05Z","published":"2024-12-05T14:41:05Z","title":"A Context-aware Framework for Translation-mediated Conversations","summary":" Effective communication is fundamental to any interaction, yet challenges\narise when participants do not share a common language. Automatic translation\nsystems offer a powerful solution to bridge language barriers in such\nscenarios, but they introduce errors that can lead to misunderstandings and\nconversation breakdown. A key issue is that current systems fail to incorporate\nthe rich contextual information necessary to resolve ambiguities and omitted\ndetails, resulting in literal, inappropriate, or misaligned translations. In\nthis work, we present a framework to improve large language model-based\ntranslation systems by incorporating contextual information in bilingual\nconversational settings. During training, we leverage context-augmented\nparallel data, which allows the model to generate translations sensitive to\nconversational history. During inference, we perform quality-aware decoding\nwith context-aware metrics to select the optimal translation from a pool of\ncandidates. We validate both components of our framework on two task-oriented\ndomains: customer chat and user-assistant interaction. Across both settings,\nour framework consistently results in better translations than state-of-the-art\nsystems like GPT-4o and TowerInstruct, as measured by multiple automatic\ntranslation quality metrics on several language pairs. We also show that the\nresulting model leverages context in an intended and interpretable way,\nimproving consistency between the conveyed message and the generated\ntranslations.\n","authors":["José Pombal","Sweta Agrawal","Patrick Fernandes","Emmanouil Zaranis","André F. T. Martins"],"pdf_url":"https://arxiv.org/pdf/2412.04205v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04193v1","updated":"2024-12-05T14:33:00Z","published":"2024-12-05T14:33:00Z","title":"AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in\n Dialectal Arabic","summary":" Dialectal Arabic (DA) varieties are under-served by language technologies,\nparticularly large language models (LLMs). This trend threatens to exacerbate\nexisting social inequalities and limits language modeling applications, yet the\nresearch community lacks operationalized LLM performance measurements in DA. We\npresent a method that comprehensively evaluates LLM fidelity, understanding,\nquality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA\nvarieties across these four dimensions and provide best practice\nrecommendations. Our evaluation suggests that LLMs do not produce DA as well as\nthey understand it, but does not suggest deterioration in quality when they do.\nFurther analysis suggests that current post-training can degrade DA\ncapabilities, that few-shot examples can overcome this and other LLM\ndeficiencies, and that otherwise no measurable features of input text correlate\nwell with LLM DA performance.\n","authors":["Nathaniel R. Robinson","Shahd Abdelmoneim","Kelly Marchisio","Sebastian Ruder"],"pdf_url":"https://arxiv.org/pdf/2412.04193v1.pdf","comment":"Pre-print"},{"id":"http://arxiv.org/abs/2409.17146v2","updated":"2024-12-05T14:28:40Z","published":"2024-09-25T17:59:51Z","title":"Molmo and PixMo: Open Weights and Open Data for State-of-the-Art\n Vision-Language Models","summary":" Today's most advanced vision-language models (VLMs) remain proprietary. The\nstrongest open-weight models rely heavily on synthetic data from proprietary\nVLMs to achieve good performance, effectively distilling these closed VLMs into\nopen ones. As a result, the community has been missing foundational knowledge\nabout how to build performant VLMs from scratch. We present Molmo, a new family\nof VLMs that are state-of-the-art in their class of openness. Our key\ncontribution is a collection of new datasets called PixMo, including a dataset\nof highly detailed image captions for pre-training, a free-form image Q&A\ndataset for fine-tuning, and an innovative 2D pointing dataset, all collected\nwithout the use of external VLMs. The success of our approach relies on careful\nmodeling choices, a well-tuned training pipeline, and, most critically, the\nquality of our newly collected datasets. Our best-in-class 72B model not only\noutperforms others in the class of open weight and data models, but also\noutperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini\n1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and\non a large human evaluation. Our model weights, new datasets, and source code\nare available at https://molmo.allenai.org/blog.\n","authors":["Matt Deitke","Christopher Clark","Sangho Lee","Rohun Tripathi","Yue Yang","Jae Sung Park","Mohammadreza Salehi","Niklas Muennighoff","Kyle Lo","Luca Soldaini","Jiasen Lu","Taira Anderson","Erin Bransom","Kiana Ehsani","Huong Ngo","YenSung Chen","Ajay Patel","Mark Yatskar","Chris Callison-Burch","Andrew Head","Rose Hendrix","Favyen Bastani","Eli VanderBilt","Nathan Lambert","Yvonne Chou","Arnavi Chheda","Jenna Sparks","Sam Skjonsberg","Michael Schmitz","Aaron Sarnat","Byron Bischoff","Pete Walsh","Chris Newell","Piper Wolters","Tanmay Gupta","Kuo-Hao Zeng","Jon Borchardt","Dirk Groeneveld","Crystal Nam","Sophie Lebrecht","Caitlin Wittlif","Carissa Schoenick","Oscar Michel","Ranjay Krishna","Luca Weihs","Noah A. Smith","Hannaneh Hajishirzi","Ross Girshick","Ali Farhadi","Aniruddha Kembhavi"],"pdf_url":"https://arxiv.org/pdf/2409.17146v2.pdf","comment":"Updated with ablations and more technical details"},{"id":"http://arxiv.org/abs/2411.16105v2","updated":"2024-12-05T14:16:57Z","published":"2024-11-25T05:32:34Z","title":"Adaptive Circuit Behavior and Generalization in Mechanistic\n Interpretability","summary":" Mechanistic interpretability aims to understand the inner workings of large\nneural networks by identifying circuits, or minimal subgraphs within the model\nthat implement algorithms responsible for performing specific tasks. These\ncircuits are typically discovered and analyzed using a narrowly defined prompt\nformat. However, given the abilities of large language models (LLMs) to\ngeneralize across various prompt formats for the same task, it remains unclear\nhow well these circuits generalize. For instance, it is unclear whether the\nmodels generalization results from reusing the same circuit components, the\ncomponents behaving differently, or the use of entirely different components.\nIn this paper, we investigate the generality of the indirect object\nidentification (IOI) circuit in GPT-2 small, which is well-studied and believed\nto implement a simple, interpretable algorithm. We evaluate its performance on\nprompt variants that challenge the assumptions of this algorithm. Our findings\nreveal that the circuit generalizes surprisingly well, reusing all of its\ncomponents and mechanisms while only adding additional input edges. Notably,\nthe circuit generalizes even to prompt variants where the original algorithm\nshould fail; we discover a mechanism that explains this which we term S2\nHacking. Our findings indicate that circuits within LLMs may be more flexible\nand general than previously recognized, underscoring the importance of studying\ncircuit generalization to better understand the broader capabilities of these\nmodels.\n","authors":["Jatin Nainani","Sankaran Vaidyanathan","AJ Yeung","Kartik Gupta","David Jensen"],"pdf_url":"https://arxiv.org/pdf/2411.16105v2.pdf","comment":"10 pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.04144v1","updated":"2024-12-05T13:12:51Z","published":"2024-12-05T13:12:51Z","title":"If You Can't Use Them, Recycle Them: Optimizing Merging at Scale\n Mitigates Performance Tradeoffs","summary":" Model merging has shown great promise at combining expert models, but the\nbenefit of merging is unclear when merging ``generalist'' models trained on\nmany tasks. We explore merging in the context of large ($\\sim100$B) models, by\n\\textit{recycling} checkpoints that exhibit tradeoffs among different tasks.\nSuch checkpoints are often created in the process of developing a frontier\nmodel, and many suboptimal ones are usually discarded. Given a pool of model\ncheckpoints obtained from different training runs (e.g., different stages,\nobjectives, hyperparameters, and data mixtures), which naturally show tradeoffs\nacross different language capabilities (e.g., instruction following vs. code\ngeneration), we investigate whether merging can recycle such suboptimal models\ninto a Pareto-optimal one. Our optimization algorithm tunes the weight of each\ncheckpoint in a linear combination, resulting in a Pareto-optimal models that\noutperforms both individual models and merge-based baselines. Further analysis\nshows that good merges tend to include almost all checkpoints with with\nnon-zero weights, indicating that even seemingly bad initial checkpoints can\ncontribute to good final merges.\n","authors":["Muhammad Khalifa","Yi-Chern Tan","Arash Ahmadian","Tom Hosking","Honglak Lee","Lu Wang","Ahmet Üstün","Tom Sherborne","Matthias Gallé"],"pdf_url":"https://arxiv.org/pdf/2412.04144v1.pdf","comment":"13 pages, 9 figures"},{"id":"http://arxiv.org/abs/2412.04141v1","updated":"2024-12-05T13:10:54Z","published":"2024-12-05T13:10:54Z","title":"Reducing Tool Hallucination via Reliability Alignment","summary":" Large Language Models (LLMs) have extended their capabilities beyond language\ngeneration to interact with external systems through tool calling, offering\npowerful potential for real-world applications. However, the phenomenon of tool\nhallucinations, which occur when models improperly select or misuse tools,\npresents critical challenges that can lead to flawed task execution and\nincreased operational costs. This paper investigates the concept of reliable\ntool calling and highlights the necessity of addressing tool hallucinations. We\nsystematically categorize tool hallucinations into two main types: tool\nselection hallucination and tool usage hallucination. To mitigate these issues,\nwe propose a reliability-focused alignment framework that enhances the model's\nability to accurately assess tool relevance and usage. By proposing a suite of\nevaluation metrics and evaluating on StableToolBench, we further demonstrate\nthe effectiveness of our framework in mitigating tool hallucination and\nimproving the overall system reliability of LLM tool calling.\n","authors":["Hongshen Xu","Su Zhu","Zihan Wang","Hang Zheng","Da Ma","Ruisheng Cao","Shuai Fan","Lu Chen","Kai Yu"],"pdf_url":"https://arxiv.org/pdf/2412.04141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04137v1","updated":"2024-12-05T13:04:10Z","published":"2024-12-05T13:04:10Z","title":"Text Change Detection in Multilingual Documents Using Image Comparison","summary":" Document comparison typically relies on optical character recognition (OCR)\nas its core technology. However, OCR requires the selection of appropriate\nlanguage models for each document and the performance of multilingual or hybrid\nmodels remains limited. To overcome these challenges, we propose text change\ndetection (TCD) using an image comparison model tailored for multilingual\ndocuments. Unlike OCR-based approaches, our method employs word-level text\nimage-to-image comparison to detect changes. Our model generates bidirectional\nchange segmentation maps between the source and target documents. To enhance\nperformance without requiring explicit text alignment or scaling preprocessing,\nwe employ correlations among multi-scale attention features. We also construct\na benchmark dataset comprising actual printed and scanned word pairs in various\nlanguages to evaluate our model. We validate our approach using our benchmark\ndataset and public benchmarks Distorted Document Images and the LRDE Document\nBinarization Dataset. We compare our model against state-of-the-art semantic\nsegmentation and change detection models, as well as to conventional OCR-based\nmodels.\n","authors":["Doyoung Park","Naresh Reddy Yarram","Sunjin Kim","Minkyu Kim","Seongho Cho","Taehee Lee"],"pdf_url":"https://arxiv.org/pdf/2412.04137v1.pdf","comment":"15pages, 11figures 6tables, wacv2025 accepted"},{"id":"http://arxiv.org/abs/2411.03906v2","updated":"2024-12-05T12:56:40Z","published":"2024-11-06T13:37:28Z","title":"Lexicalization Is All You Need: Examining the Impact of Lexical\n Knowledge in a Compositional QALD System","summary":" In this paper, we examine the impact of lexicalization on Question Answering\nover Linked Data (QALD). It is well known that one of the key challenges in\ninterpreting natural language questions with respect to SPARQL lies in bridging\nthe lexical gap, that is mapping the words in the query to the correct\nvocabulary elements. We argue in this paper that lexicalization, that is\nexplicit knowledge about the potential interpretations of a word with respect\nto the given vocabulary, significantly eases the task and increases the\nperformance of QA systems. Towards this goal, we present a compositional QA\nsystem that can leverage explicit lexical knowledge in a compositional manner\nto infer the meaning of a question in terms of a SPARQL query. We show that\nsuch a system, given lexical knowledge, has a performance well beyond current\nQA systems, achieving up to a $35.8\\%$ increase in the micro $F_1$ score\ncompared to the best QA system on QALD-9. This shows the importance and\npotential of including explicit lexical knowledge. In contrast, we show that\nLLMs have limited abilities to exploit lexical knowledge, with only marginal\nimprovements compared to a version without lexical knowledge. This shows that\nLLMs have no ability to compositionally interpret a question on the basis of\nthe meaning of its parts, a key feature of compositional approaches. Taken\ntogether, our work shows new avenues for QALD research, emphasizing the\nimportance of lexicalization and compositionality.\n","authors":["David Maria Schmidt","Mohammad Fazleh Elahi","Philipp Cimiano"],"pdf_url":"https://arxiv.org/pdf/2411.03906v2.pdf","comment":"24th International Conference on Knowledge Engineering and Knowledge\n Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands"},{"id":"http://arxiv.org/abs/2412.04119v1","updated":"2024-12-05T12:37:27Z","published":"2024-12-05T12:37:27Z","title":"GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering","summary":" Pre-trained Language Models (PLMs) have shown remarkable performances in\nrecent years, setting a new paradigm for NLP research and industry. The legal\ndomain has received some attention from the NLP community partly due to its\ntextual nature. Some tasks from this domain are represented by\nquestion-answering (QA) tasks. This work explores the legal domain\nMultiple-Choice QA (MCQA) for a low-resource language. The contribution of this\nwork is multi-fold. We first introduce JuRO, the first openly available\nRomanian legal MCQA dataset, comprising three different examinations and a\nnumber of 10,836 total questions. Along with this dataset, we introduce CROL,\nan organized corpus of laws that has a total of 93 distinct documents with\ntheir modifications from 763 time spans, that we leveraged in this work for\nInformation Retrieval (IR) techniques. Moreover, we are the first to propose\nLaw-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is\nderived from the aforementioned corpus. Lastly, we propose a novel approach for\nMCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive\nresults with generally accepted SOTA methods and even exceeds them in most\nsettings.\n","authors":["Cristian-George Crăciun","Răzvan-Alexandru Smădu","Dumitru-Clementin Cercel","Mihaela-Claudia Cercel"],"pdf_url":"https://arxiv.org/pdf/2412.04119v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.19574v2","updated":"2024-12-05T12:19:38Z","published":"2024-11-29T09:42:38Z","title":"KV Shifting Attention Enhances Language Modeling","summary":" The current large language models are mainly based on decode-only structure\ntransformers, which have great in-context learning (ICL) capabilities. It is\ngenerally believed that the important foundation of its ICL capability is the\ninduction heads mechanism, which requires at least two layers attention. In\norder to more efficiently implement the ability of the model's induction, we\nrevisit the induction heads mechanism and proposed a KV shifting attention. We\ntheoretically prove that the KV shifting attention reducing the model's\nrequirements for the depth and width of the induction heads mechanism. Our\nexperimental results demonstrate that KV shifting attention is beneficial to\nlearning induction heads and language modeling, which lead to better\nperformance or faster convergence from toy models to the pre-training models\nwith more than 10 B parameters.\n","authors":["Mingyu Xu","Wei Cheng","Bingning Wang","Weipeng Chen"],"pdf_url":"https://arxiv.org/pdf/2411.19574v2.pdf","comment":"22 pages"},{"id":"http://arxiv.org/abs/2412.04100v1","updated":"2024-12-05T12:10:42Z","published":"2024-12-05T12:10:42Z","title":"Missing Melodies: AI Music Generation and its \"Nearly\" Complete Omission\n of the Global South","summary":" Recent advances in generative AI have sparked renewed interest and expanded\npossibilities for music generation. However, the performance and versatility of\nthese systems across musical genres are heavily influenced by the availability\nof training data. We conducted an extensive analysis of over one million hours\nof audio datasets used in AI music generation research and manually reviewed\nmore than 200 papers from eleven prominent AI and music conferences and\norganizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR,\nNeurIPS, NIME, SMC) to identify a critical gap in the fair representation and\ninclusion of the musical genres of the Global South in AI research. Our\nfindings reveal a stark imbalance: approximately 86% of the total dataset hours\nand over 93% of researchers focus primarily on music from the Global North.\nHowever, around 40% of these datasets include some form of non-Western music,\ngenres from the Global South account for only 14.6% of the data. Furthermore,\napproximately 51% of the papers surveyed concentrate on symbolic music\ngeneration, a method that often fails to capture the cultural nuances inherent\nin music from regions such as South Asia, the Middle East, and Africa. As AI\nincreasingly shapes the creation and dissemination of music, the significant\nunderrepresentation of music genres in datasets and research presents a serious\nthreat to global musical diversity. We also propose some important steps to\nmitigate these risks and foster a more inclusive future for AI-driven music\ngeneration.\n","authors":["Atharva Mehta","Shivam Chauhan","Monojit Choudhury"],"pdf_url":"https://arxiv.org/pdf/2412.04100v1.pdf","comment":"Submitted to CACM, 12 pages, 2 figures"},{"id":"http://arxiv.org/abs/2412.04092v1","updated":"2024-12-05T11:56:48Z","published":"2024-12-05T11:56:48Z","title":"GEITje 7B Ultra: A Conversational Model for Dutch","summary":" Language models have rapidly evolved, predominantly focusing on English while\noften neglecting extensive pretraining in other languages. This approach has\nrequired initiatives to adapt powerful, English-centric models to other\nlinguistic contexts through finetuning. For Dutch, such a recent endeavour is\n``GEITje'' a model originally derived from the English-based Mistral 7B.\nBuilding on this fundamental work, the current research extends the\ncapabilities of GEITje by supervised finetuning on newly created high-quality\nsynthetic conversational datasets, along with an additional preference\nalignment procedure on a synthetic feedback dataset. Both the developed models\nand the created datasets are openly available.\n","authors":["Bram Vanroy"],"pdf_url":"https://arxiv.org/pdf/2412.04092v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11624v3","updated":"2024-12-05T11:47:49Z","published":"2024-06-17T15:07:55Z","title":"Words in Motion: Extracting Interpretable Control Vectors for Motion\n Transformers","summary":" Transformer-based models generate hidden states that are difficult to\ninterpret. In this work, we aim to interpret these hidden states and control\nthem at inference, with a focus on motion forecasting. We use linear probes to\nmeasure neural collapse towards interpretable motion features in hidden states.\nHigh probing accuracy implies meaningful directions and distances between\nhidden states of opposing features, which we use to fit interpretable control\nvectors for activation steering at inference. To optimize our control vectors,\nwe use sparse autoencoders with fully-connected, convolutional, MLPMixer layers\nand various activation functions. Notably, we show that enforcing sparsity in\nhidden states leads to a more linear relationship between control vector\ntemperatures and forecasts. Our approach enables mechanistic interpretability\nand zero-shot generalization to unseen dataset characteristics with negligible\ncomputational overhead. Our implementation is available at\nhttps://github.com/kit-mrt/future-motion\n","authors":["Omer Sahin Tas","Royden Wagner"],"pdf_url":"https://arxiv.org/pdf/2406.11624v3.pdf","comment":"Add autoencoders with convolutional, MLPMixer layers, and JumpReLU\n activations"},{"id":"http://arxiv.org/abs/2412.04067v1","updated":"2024-12-05T11:05:12Z","published":"2024-12-05T11:05:12Z","title":"Automated Medical Report Generation for ECG Data: Bridging Medical Text\n and Signal Processing with Deep Learning","summary":" Recent advances in deep learning and natural language generation have\nsignificantly improved image captioning, enabling automated, human-like\ndescriptions for visual content. In this work, we apply these captioning\ntechniques to generate clinician-like interpretations of ECG data. This study\nleverages existing ECG datasets accompanied by free-text reports authored by\nhealthcare professionals (HCPs) as training data. These reports, while often\ninconsistent, provide a valuable foundation for automated learning. We\nintroduce an encoder-decoder-based method that uses these reports to train\nmodels to generate detailed descriptions of ECG episodes. This represents a\nsignificant advancement in ECG analysis automation, with potential applications\nin zero-shot classification and automated clinical decision support.\n The model is tested on various datasets, including both 1- and 12-lead ECGs.\nIt significantly outperforms the state-of-the-art reference model by Qiu et\nal., achieving a METEOR score of 55.53% compared to 24.51% achieved by the\nreference model. Furthermore, several key design choices are discussed,\nproviding a comprehensive overview of current challenges and innovations in\nthis domain.\n The source codes for this research are publicly available in our Git\nrepository https://git.zib.de/ableich/ecg-comment-generation-public\n","authors":["Amnon Bleich","Antje Linnemann","Bjoern H. Diem","Tim OF Conrad"],"pdf_url":"https://arxiv.org/pdf/2412.04067v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.07122v2","updated":"2024-12-05T10:45:02Z","published":"2024-11-11T16:51:39Z","title":"SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering\n in LLMs","summary":" Large Language Models (LLMs) have demonstrated remarkable capabilities in\ngenerating human-like text, but their output may not be aligned with the user\nor even produce harmful content. This paper presents a novel approach to detect\nand steer concepts such as toxicity before generation. We introduce the Sparse\nConditioned Autoencoder (SCAR), a single trained module that extends the\notherwise untouched LLM. SCAR ensures full steerability, towards and away from\nconcepts (e.g., toxic content), without compromising the quality of the model's\ntext generation on standard evaluation benchmarks. We demonstrate the effective\napplication of our approach through a variety of concepts, including toxicity,\nsafety, and writing style alignment. As such, this work establishes a robust\nframework for controlling LLM generations, ensuring their ethical and safe\ndeployment in real-world applications.\n","authors":["Ruben Härle","Felix Friedrich","Manuel Brack","Björn Deiseroth","Patrick Schramowski","Kristian Kersting"],"pdf_url":"https://arxiv.org/pdf/2411.07122v2.pdf","comment":"Accepted at Socially Responsible Language Modelling Research (SoLaR)\n Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.04046v1","updated":"2024-12-05T10:37:38Z","published":"2024-12-05T10:37:38Z","title":"Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting\n MPs","summary":" Numerous politicians use social media platforms, particularly X, to engage\nwith their constituents. This interaction allows constituents to pose questions\nand offer feedback but also exposes politicians to a barrage of hostile\nresponses, especially given the anonymity afforded by social media. They are\ntypically targeted in relation to their governmental role, but the comments\nalso tend to attack their personal identity. This can discredit politicians and\nreduce public trust in the government. It can also incite anger and disrespect,\nleading to offline harm and violence. While numerous models exist for detecting\nhostility in general, they lack the specificity required for political\ncontexts. Furthermore, addressing hostility towards politicians demands\ntailored approaches due to the distinct language and issues inherent to each\ncountry (e.g., Brexit for the UK). To bridge this gap, we construct a dataset\nof 3,320 English tweets spanning a two-year period manually annotated for\nhostility towards UK MPs. Our dataset also captures the targeted identity\ncharacteristics (race, gender, religion, none) in hostile tweets. We perform\nlinguistic and topical analyses to delve into the unique content of the UK\npolitical data. Finally, we evaluate the performance of pre-trained language\nmodels and large language models on binary hostility detection and multi-class\ntargeted identity type classification tasks. Our study offers valuable data and\ninsights for future research on the prevalence and nature of politics-related\nhostility specific to the UK.\n","authors":["Mugdha Pandya","Mali Jin","Kalina Bontcheva","Diana Maynard"],"pdf_url":"https://arxiv.org/pdf/2412.04046v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02788v2","updated":"2024-12-05T10:30:56Z","published":"2024-12-03T19:37:00Z","title":"Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset","summary":" Existing Scholarly Question Answering (QA) methods typically target\nhomogeneous data sources, relying solely on either text or Knowledge Graphs\n(KGs). However, scholarly information often spans heterogeneous sources,\nnecessitating the development of QA systems that integrate information from\nmultiple heterogeneous data sources. To address this challenge, we introduce\nHybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale\nQA dataset designed to facilitate answering questions incorporating both text\nand KG facts. The dataset consists of 10.5K question-answer pairs generated by\na large language model, leveraging the KGs DBLP and SemOpenAlex alongside\ncorresponding text from Wikipedia. In addition, we propose a RAG-based baseline\nhybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD\ntest set.\n","authors":["Tilahun Abedissa Taffa","Debayan Banerjee","Yaregal Assabie","Ricardo Usbeck"],"pdf_url":"https://arxiv.org/pdf/2412.02788v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04026v1","updated":"2024-12-05T10:00:58Z","published":"2024-12-05T10:00:58Z","title":"M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded\n Document-level Information Extraction","summary":" Multimodal information extraction (IE) tasks have attracted increasing\nattention because many studies have shown that multimodal information benefits\ntext information extraction. However, existing multimodal IE datasets mainly\nfocus on sentence-level image-facilitated IE in English text, and pay little\nattention to video-based multimodal IE and fine-grained visual grounding.\nTherefore, in order to promote the development of multimodal IE, we constructed\na multimodal multilingual multitask dataset, named M$^{3}$D, which has the\nfollowing features: (1) It contains paired document-level text and video to\nenrich multimodal information; (2) It supports two widely-used languages,\nnamely English and Chinese; (3) It includes more multimodal IE tasks such as\nentity recognition, entity chain extraction, relation extraction and visual\ngrounding. In addition, our dataset introduces an unexplored theme, i.e.,\nbiography, enriching the domains of multimodal IE resources. To establish a\nbenchmark for our dataset, we propose an innovative hierarchical multimodal IE\nmodel. This model effectively leverages and integrates multimodal information\nthrough a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal\nscenarios, modal information is often incomplete. Thus, we designed a Missing\nModality Construction Module (MMCM) to alleviate the issues caused by missing\nmodalities. Our model achieved an average performance of 53.80% and 53.77% on\nfour tasks in English and Chinese datasets, respectively, which set a\nreasonable standard for subsequent research. In addition, we conducted more\nanalytical experiments to verify the effectiveness of our proposed module. We\nbelieve that our work can promote the development of the field of multimodal\nIE.\n","authors":["Jiang Liu","Bobo Li","Xinran Yang","Na Yang","Hao Fei","Mingyao Zhang","Fei Li","Donghong Ji"],"pdf_url":"https://arxiv.org/pdf/2412.04026v1.pdf","comment":"14 pages, 9 figures, 6 tables"},{"id":"http://arxiv.org/abs/2412.04025v1","updated":"2024-12-05T10:00:49Z","published":"2024-12-05T10:00:49Z","title":"Exploring the Influence of Label Aggregation on Minority Voices:\n Implications for Dataset Bias and Model Training","summary":" Resolving disagreement in manual annotation typically consists of removing\nunreliable annotators and using a label aggregation strategy such as majority\nvote or expert opinion to resolve disagreement. These may have the side-effect\nof silencing or under-representing minority but equally valid opinions. In this\npaper, we study the impact of standard label aggregation strategies on minority\nopinion representation in sexism detection. We investigate the quality and\nvalue of minority annotations, and then examine their effect on the class\ndistributions in gold labels, as well as how this affects the behaviour of\nmodels trained on the resulting datasets. Finally, we discuss the potential\nbiases introduced by each method and how they can be amplified by the models.\n","authors":["Mugdha Pandya","Nafise Sadat Moosavi","Diana Maynard"],"pdf_url":"https://arxiv.org/pdf/2412.04025v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.19846v6","updated":"2024-12-05T09:56:35Z","published":"2024-05-30T08:50:55Z","title":"Quest: Query-centric Data Synthesis Approach for Long-context Scaling of\n Large Language Model","summary":" Recent advancements in large language models (LLMs) have highlighted the\nimportance of extending context lengths for handling complex tasks. While\ntraditional methods for training on long contexts often use filtered long\ndocuments, these approaches lead to domain imbalances, limiting model\nperformance. To address this, techniques like random document concatenation\n(Standard) and similarity-based methods (KNN, ICLM) have been developed.\nHowever, they either sacrifice semantic coherence or diversity. To balance both\naspects, we introduce Quest, a query-centric data synthesis method aggregating\nsemantically relevant yet diverse documents. Quest uses a generative model to\npredict potential queries for each document, grouping documents with similar\nqueries and keywords. Extensive experiments demonstrate Quest's superior\nperformance on long-context tasks, achieving remarkable results with context\nlengths of up to 1M tokens and confirming its scalability across various model\nsizes.\n","authors":["Chaochen Gao","Xing Wu","Qi Fu","Songlin Hu"],"pdf_url":"https://arxiv.org/pdf/2405.19846v6.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04003v1","updated":"2024-12-05T09:26:58Z","published":"2024-12-05T09:26:58Z","title":"Marco-LLM: Bridging Languages via Massive Multilingual Training for\n Cross-Lingual Enhancement","summary":" Large Language Models (LLMs) have achieved remarkable progress in recent\nyears; however, their excellent performance is still largely limited to major\nworld languages, primarily English. Many LLMs continue to face challenges with\nmultilingual tasks, especially when it comes to low-resource languages. To\naddress this issue, we introduced Marco-LLM: Massive multilingual training for\ncross-lingual enhancement LLM. We have collected a substantial amount of\nmultilingual data for several low-resource languages and conducted extensive\ncontinual pre-training using the Qwen2 models. This effort has resulted in a\nmultilingual LLM named Marco-LLM. Through comprehensive evaluations on various\nmultilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA\nand many others, Marco-LLM has demonstrated substantial improvements over\nstate-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements\nin any-to-any machine translation tasks, showing the effectiveness of our\nmultilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not\nonly perform exceptionally well in multilingual tasks, including low-resource\nlanguages, but also maintain strong performance in English and other major\nlanguages, closing the performance gap between high- and low-resource language\ncapabilities. By bridging languages, this effort demonstrates our dedication to\nensuring LLMs work accurately across various languages.\n","authors":["Lingfeng Ming","Bo Zeng","Chenyang Lyu","Tianqi Shi","Yu Zhao","Xue Yang","Yefeng Liu","Yiyu Wang","Linlong Xu","Yangyang Liu","Xiaohu Zhao","Hao Wang","Heng Liu","Hao Zhou","Huifeng Yin","Zifu Shang","Haijun Li","Longyue Wang","Weihua Luo","Kaifu Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.04003v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03987v1","updated":"2024-12-05T09:05:30Z","published":"2024-12-05T09:05:30Z","title":"MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for\n Strengthening LLM","summary":" Large language models (LLMs) have shown limitations in tasks requiring\ncomplex logical reasoning and multi-step problem-solving. To address these\nchallenges, researchers have employed carefully designed prompts and\nflowcharts, simulating human cognitive processes to enhance LLM performance,\nsuch as the Chain of Thought approach. In this paper, we introduce MTMT\n(Multi-thinking Modes Tree), a novel method that interacts with LLMs to\nconstruct a thought tree, simulating various advanced cognitive processes,\nincluding but not limited to association, counterfactual thinking, task\ndecomposition, and comparison. By breaking down the original complex task into\nsimpler sub-questions, MTMT facilitates easier problem-solving for LLMs,\nenabling more effective utilization of the latent knowledge within LLMs. We\nevaluate the performance of MTMT under different parameter configurations,\nusing GPT-4o mini as the base model. Our results demonstrate that integrating\nmultiple modes of thinking significantly enhances the ability of LLMs to handle\ncomplex tasks.\n","authors":["Changcheng Li","Xiangyu Wang","Qiuju Chen","Xiren Zhou","Huanhuan Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03987v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03966v1","updated":"2024-12-05T08:33:52Z","published":"2024-12-05T08:33:52Z","title":"Demonstration Selection for In-Context Learning via Reinforcement\n Learning","summary":" Diversity in demonstration selection is crucial for enhancing model\ngeneralization, as it enables a broader coverage of structures and concepts.\nHowever, constructing an appropriate set of demonstrations has remained a focal\npoint of research. This paper presents the Relevance-Diversity Enhanced\nSelection (RDES), an innovative approach that leverages reinforcement learning\nto optimize the selection of diverse reference demonstrations for text\nclassification tasks using Large Language Models (LLMs), especially in few-shot\nprompting scenarios. RDES employs a Q-learning framework to dynamically\nidentify demonstrations that maximize both diversity and relevance to the\nclassification objective by calculating a diversity score based on label\ndistribution among selected demonstrations. This method ensures a balanced\nrepresentation of reference data, leading to improved classification accuracy.\nThrough extensive experiments on four benchmark datasets and involving 12\nclosed-source and open-source LLMs, we demonstrate that RDES significantly\nenhances classification accuracy compared to ten established baselines.\nFurthermore, we investigate the incorporation of Chain-of-Thought (CoT)\nreasoning in the reasoning process, which further enhances the model's\npredictive performance. The results underscore the potential of reinforcement\nlearning to facilitate adaptive demonstration selection and deepen the\nunderstanding of classification challenges.\n","authors":["Xubin Wang","Jianfei Wu","Yichen Yuan","Mingzhe Li","Deyu Cai","Weijia Jia"],"pdf_url":"https://arxiv.org/pdf/2412.03966v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03930v1","updated":"2024-12-05T07:12:53Z","published":"2024-12-05T07:12:53Z","title":"MIND: Effective Incorrect Assignment Detection through a Multi-Modal\n Structure-Enhanced Language Model","summary":" The rapid growth of academic publications has exacerbated the issue of author\nname ambiguity in online digital libraries. Despite advances in name\ndisambiguation algorithms, cumulative errors continue to undermine the\nreliability of academic systems. It is estimated that over 10% paper-author\nassignments are rectified when constructing the million-scale WhoIsWho\nbenchmark. Existing endeavors to detect incorrect assignments are either\nsemantic-based or graph-based approaches, which fall short of making full use\nof the rich text attributes of papers and implicit structural features defined\nvia the co-occurrence of paper attributes. To this end, this paper introduces a\nstructure-enhanced language model that combines key structural features from\ngraph-based methods with fine-grained semantic features from rich paper\nattributes to detect incorrect assignments. The proposed model is trained with\na highly effective multi-modal multi-turn instruction tuning framework, which\nincorporates task-guided instruction tuning, text-attribute modality, and\nstructural modality. Experimental results demonstrate that our model\noutperforms previous approaches, achieving top performance on the leaderboard\nof KDD Cup 2024. Our code has been publicly available.\n","authors":["Yunhe Pang","Bo Chen","Fanjin Zhang","Yanghui Rao","Jie Tang"],"pdf_url":"https://arxiv.org/pdf/2412.03930v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2410.21216v2","updated":"2024-12-05T07:09:27Z","published":"2024-10-28T17:01:52Z","title":"HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced\n Context Awareness and Extrapolation","summary":" Many positional encodings (PEs) are designed to exhibit long-term decay,\nbased on an entrenched and long-standing inductive opinion: tokens farther away\nfrom the current position carry less relevant information. We argue that\nlong-term decay is outdated in the era of LLMs, as LLMs are now applied to\ntasks demanding precise retrieval of in-context information from arbitrary\npositions. Firstly, we present empirical analyses on various PEs, demonstrating\nthat models inherently learn attention with only a local-decay pattern while\nforming a U-shape pattern globally, contradicting the principle of long-term\ndecay. Furthermore, we conduct a detailed analysis of rotary position encoding\n(RoPE, a prevalent relative positional encoding in LLMs), and found that the\nU-shape attention is caused by some learned components, which are also the key\nfactor limiting RoPE's expressiveness and extrapolation.Inspired by these\ninsights, we propose High-frequency rotary Position Encoding (HoPE). HoPE\nreplaces the specific components in RoPE with position-independent ones,\nretaining only high-frequency signals, which also breaks the principle of\nlong-term decay in theory. HoPE achieves two major advantages: (1) Without\nconstraints imposed by long-term decay, contradictory factors that limit\nspontaneous attention optimization and model extrapolation performance are\nremoved. (2) Components representing positions and semantics are are optimized.\nThese enhances model's context awareness and extrapolation, as validated by\nextensive experiments.\n","authors":["Yuhan Chen","Ang Lv","Jian Luan","Bin Wang","Wei Liu"],"pdf_url":"https://arxiv.org/pdf/2410.21216v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2401.00741v3","updated":"2024-12-05T07:05:59Z","published":"2024-01-01T12:49:36Z","title":"ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of\n Large Language Models in Real-world Scenarios","summary":" Existing evaluations of tool learning primarily focus on validating the\nalignment of selected tools for large language models (LLMs) with expected\noutcomes. However, these approaches rely on a limited set of scenarios where\nanswers can be pre-determined, diverging from genuine needs. Furthermore, a\nsole emphasis on outcomes disregards the complex capabilities required for LLMs\nto effectively use tools. To tackle this issue, we propose ToolEyes, a\nfine-grained system tailored for the evaluation of the LLMs' tool learning\ncapabilities in authentic scenarios. The system meticulously examines seven\nreal-world scenarios, analyzing five dimensions crucial to LLMs in tool\nlearning: format alignment, intent comprehension, behavior planning, tool\nselection, and answer organization. Additionally, ToolEyes incorporates a tool\nlibrary boasting approximately 600 tools, serving as an intermediary between\nLLMs and the physical world. Evaluations involving ten LLMs across three\ncategories reveal a preference for specific scenarios and limited cognitive\nabilities in tool learning. Intriguingly, expanding the model size even\nexacerbates the hindrance to tool learning. The code and data are available at\nhttps://github.com/Junjie-Ye/ToolEyes.\n","authors":["Junjie Ye","Guanyu Li","Songyang Gao","Caishuang Huang","Yilong Wu","Sixian Li","Xiaoran Fan","Shihan Dou","Tao Ji","Qi Zhang","Tao Gui","Xuanjing Huang"],"pdf_url":"https://arxiv.org/pdf/2401.00741v3.pdf","comment":"Accepted by COLING 2025 conference"},{"id":"http://arxiv.org/abs/2412.03331v2","updated":"2024-12-05T07:05:57Z","published":"2024-12-04T14:02:12Z","title":"LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence\n Embeddings","summary":" Sentence embedding models play a key role in various Natural Language\nProcessing tasks, such as in Topic Modeling, Document Clustering and\nRecommendation Systems. However, these models rely heavily on parallel data,\nwhich can be scarce for many low-resource languages, including Luxembourgish.\nThis scarcity results in suboptimal performance of monolingual and\ncross-lingual sentence embedding models for these languages. To address this\nissue, we compile a relatively small but high-quality human-generated\ncross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence\nembedding model for Luxembourgish with strong cross-lingual capabilities.\nAdditionally, we present evidence suggesting that including low-resource\nlanguages in parallel training datasets can be more advantageous for other\nlow-resource languages than relying solely on high-resource language pairs.\nFurthermore, recognizing the lack of sentence embedding benchmarks for\nlow-resource languages, we create a paraphrase detection benchmark specifically\nfor Luxembourgish, aiming to partially fill this gap and promote further\nresearch.\n","authors":["Fred Philippy","Siwen Guo","Jacques Klein","Tegawendé F. Bissyandé"],"pdf_url":"https://arxiv.org/pdf/2412.03331v2.pdf","comment":"Accepted at COLING 2025"},{"id":"http://arxiv.org/abs/2410.01485v2","updated":"2024-12-05T06:52:42Z","published":"2024-10-02T12:35:53Z","title":"A Little Goes a Long Way: Efficient Long Context Training and Inference\n with Partial Contexts","summary":" Training and serving long-context large language models (LLMs) incurs\nsubstantial overhead. To address this, two critical steps are often required: a\npretrained LLM typically undergoes a separate stage for context length\nextension by training on long-context data, followed by architectural\nmodifications to reduce the overhead of KV cache during serving. This paper\nargues that integrating length extension with a GPU-friendly KV cache reduction\narchitecture not only reduces training overhead during length extension, but\nalso achieves better long-context performance. This leads to our proposed\nLongGen, which finetunes a pretrained LLM into an efficient architecture during\nlength extension. LongGen builds on three key insights: (1) Sparse attention\npatterns, such as window attention (attending to recent tokens), attention sink\n(initial ones), and blockwise sparse attention (strided token blocks) are\nwell-suited for building efficient long-context models, primarily due to their\nGPU-friendly memory access patterns, enabling efficiency gains not just\ntheoretically but in practice as well. (2) It is essential for the model to\nhave direct access to all tokens. A hybrid architecture with 1/3 full attention\nlayers and 2/3 efficient ones achieves a balanced trade-off between efficiency\nand long-context performance. (3) Lightweight training on 5B long-context data\nis sufficient to extend the hybrid model's context length from 4K to 128K.\n We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its\neffectiveness across different scales. During training with 128K-long contexts,\nLongGen achieves 1.55x training speedup and reduces wall-clock time by 36%,\ncompared to a full-attention baseline. During inference, LongGen reduces KV\ncache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding\nspeedup.\n","authors":["Suyu Ge","Xihui Lin","Yunan Zhang","Jiawei Han","Hao Peng"],"pdf_url":"https://arxiv.org/pdf/2410.01485v2.pdf","comment":null}],"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.04661v1","updated":"2024-12-05T23:10:56Z","published":"2024-12-05T23:10:56Z","title":"HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and\n Representation Learning","summary":" Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by\nintegrating external document retrieval to provide domain-specific or\nup-to-date knowledge. The effectiveness of RAG depends on the relevance of\nretrieved documents, which is influenced by the semantic alignment of\nembeddings with the domain's specialized content. Although full fine-tuning can\nalign language models to specific domains, it is computationally intensive and\ndemands substantial data. This paper introduces Hierarchical Embedding\nAlignment Loss (HEAL), a novel method that leverages hierarchical fuzzy\nclustering with matrix factorization within contrastive learning to efficiently\nalign LLM embeddings with domain-specific content. HEAL computes\nlevel/depth-wise contrastive losses and incorporates hierarchical penalties to\nalign embeddings with the underlying relationships in label hierarchies. This\napproach enhances retrieval relevance and document classification, effectively\nreducing hallucinations in LLM outputs. In our experiments, we benchmark and\nevaluate HEAL across diverse domains, including Healthcare, Material Science,\nCyber-security, and Applied Maths.\n","authors":["Manish Bhattarai","Ryan Barron","Maksim Eren","Minh Vu","Vesselin Grantcharov","Ismael Boureima","Valentin Stanev","Cynthia Matuszek","Vladimir Valtchinov","Kim Rasmussen","Boian Alexandrov"],"pdf_url":"https://arxiv.org/pdf/2412.04661v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04637v1","updated":"2024-12-05T22:10:58Z","published":"2024-12-05T22:10:58Z","title":"Semantic Retrieval at Walmart","summary":" In product search, the retrieval of candidate products before re-ranking is\nmore critical and challenging than other search like web search, especially for\ntail queries, which have a complex and specific search intent. In this paper,\nwe present a hybrid system for e-commerce search deployed at Walmart that\ncombines traditional inverted index and embedding-based neural retrieval to\nbetter answer user tail queries. Our system significantly improved the\nrelevance of the search engine, measured by both offline and online\nevaluations. The improvements were achieved through a combination of different\napproaches. We present a new technique to train the neural model at scale. and\ndescribe how the system was deployed in production with little impact on\nresponse time. We highlight multiple learnings and practical tricks that were\nused in the deployment of this system.\n","authors":["Alessandro Magnani","Feng Liu","Suthee Chaidaroon","Sachin Yadav","Praveen Reddy Suram","Ajit Puthenputhussery","Sijie Chen","Min Xie","Anirudh Kashi","Tony Lee","Ciya Liao"],"pdf_url":"https://arxiv.org/pdf/2412.04637v1.pdf","comment":"9 page, 2 figures, 10 tables, KDD 2022"},{"id":"http://arxiv.org/abs/2412.04629v1","updated":"2024-12-05T21:51:05Z","published":"2024-12-05T21:51:05Z","title":"Argumentative Experience: Reducing Confirmation Bias on Controversial\n Issues through LLM-Generated Multi-Persona Debates","summary":" Large language models (LLMs) are enabling designers to give life to exciting\nnew user experiences for information access. In this work, we present a system\nthat generates LLM personas to debate a topic of interest from different\nperspectives. How might information seekers use and benefit from such a system?\nCan centering information access around diverse viewpoints help to mitigate\nthorny challenges like confirmation bias in which information seekers\nover-trust search results matching existing beliefs? How do potential biases\nand hallucinations in LLMs play out alongside human users who are also fallible\nand possibly biased?\n Our study exposes participants to multiple viewpoints on controversial issues\nvia a mixed-methods, within-subjects study. We use eye-tracking metrics to\nquantitatively assess cognitive engagement alongside qualitative feedback.\nCompared to a baseline search system, we see more creative interactions and\ndiverse information-seeking with our multi-persona debate system, which more\neffectively reduces user confirmation bias and conviction toward their initial\nbeliefs. Overall, our study contributes to the emerging design space of\nLLM-based information access systems, specifically investigating the potential\nof simulated personas to promote greater exposure to information diversity,\nemulate collective intelligence, and mitigate bias in information seeking.\n","authors":["Li Shi","Houjiang Liu","Yian Wong","Utkarsh Mujumdar","Dan Zhang","Jacek Gwizdka","Matthew Lease"],"pdf_url":"https://arxiv.org/pdf/2412.04629v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04466v1","updated":"2024-12-05T18:59:51Z","published":"2024-12-05T18:59:51Z","title":"User-item fairness tradeoffs in recommendations","summary":" In the basic recommendation paradigm, the most (predicted) relevant item is\nrecommended to each user. This may result in some items receiving lower\nexposure than they \"should\"; to counter this, several algorithmic approaches\nhave been developed to ensure item fairness. These approaches necessarily\ndegrade recommendations for some users to improve outcomes for items, leading\nto user fairness concerns. In turn, a recent line of work has focused on\ndeveloping algorithms for multi-sided fairness, to jointly optimize user\nfairness, item fairness, and overall recommendation quality. This induces the\nquestion: what is the tradeoff between these objectives, and what are the\ncharacteristics of (multi-objective) optimal solutions? Theoretically, we\ndevelop a model of recommendations with user and item fairness objectives and\ncharacterize the solutions of fairness-constrained optimization. We identify\ntwo phenomena: (a) when user preferences are diverse, there is \"free\" item and\nuser fairness; and (b) users whose preferences are misestimated can be\nespecially disadvantaged by item fairness constraints. Empirically, we\nprototype a recommendation system for preprints on arXiv and implement our\nframework, measuring the phenomena in practice and showing how these phenomena\ninform the design of markets with recommendation systems-intermediated\nmatching.\n","authors":["Sophie Greenwood","Sudalakshmee Chiniah","Nikhil Garg"],"pdf_url":"https://arxiv.org/pdf/2412.04466v1.pdf","comment":"Accepted at the Thirty-Eighth Annual Conference on Neural Information\n Processing Systems"},{"id":"http://arxiv.org/abs/2412.04276v1","updated":"2024-12-05T15:59:05Z","published":"2024-12-05T15:59:05Z","title":"Graph-Sequential Alignment and Uniformity: Toward Enhanced\n Recommendation Systems","summary":" Graph-based and sequential methods are two popular recommendation paradigms,\neach excelling in its domain but lacking the ability to leverage signals from\nthe other. To address this, we propose a novel method that integrates both\napproaches for enhanced performance. Our framework uses Graph Neural Network\n(GNN)-based and sequential recommenders as separate submodules while sharing a\nunified embedding space optimized jointly. To enable positive knowledge\ntransfer, we design a loss function that enforces alignment and uniformity both\nwithin and across submodules. Experiments on three real-world datasets\ndemonstrate that the proposed method significantly outperforms using either\napproach alone and achieves state-of-the-art results. Our implementations are\npublicly available at https://github.com/YuweiCao-UIC/GSAU.git.\n","authors":["Yuwei Cao","Liangwei Yang","Zhiwei Liu","Yuqing Liu","Chen Wang","Yueqing Liang","Hao Peng","Philip S. Yu"],"pdf_url":"https://arxiv.org/pdf/2412.04276v1.pdf","comment":"Under review"},{"id":"http://arxiv.org/abs/2412.04272v1","updated":"2024-12-05T15:54:16Z","published":"2024-12-05T15:54:16Z","title":"PoTable: Programming Standardly on Table-based Reasoning Like a Human\n Analyst","summary":" Table-based reasoning has garnered substantial research interest,\nparticularly in its integration with Large Language Model (LLM) which has\nrevolutionized the general reasoning paradigm. Numerous LLM-based studies\nintroduce symbolic tools (e.g., databases, Python) as assistants to extend\nhuman-like abilities in structured table understanding and complex arithmetic\ncomputations. However, these studies can be improved better in simulating human\ncognitive behavior when using symbolic tools, as they still suffer from\nlimitations of non-standard logical splits and constrained operation pools. In\nthis study, we propose PoTable as a novel table-based reasoning method that\nsimulates a human tabular analyst, which integrates a Python interpreter as the\nreal-time executor accompanied by an LLM-based operation planner and code\ngenerator. Specifically, PoTable follows a human-like logical stage split and\nextends the operation pool into an open-world space without any constraints.\nThrough planning and executing in each distinct stage, PoTable standardly\ncompletes the entire reasoning process and produces superior reasoning results\nalong with highly accurate, steply commented and completely executable\nprograms. Accordingly, the effectiveness and explainability of PoTable are\nfully demonstrated. Extensive experiments over three evaluation datasets from\ntwo public benchmarks on two backbones show the outstanding performance of our\napproach. In particular, GPT-based PoTable achieves over 4% higher absolute\naccuracy than runner-ups on all evaluation datasets.\n","authors":["Qingyang Mao","Qi Liu","Zhi Li","Mingyue Cheng","Zheng Zhang","Rui Li"],"pdf_url":"https://arxiv.org/pdf/2412.04272v1.pdf","comment":"12 pages, 4 figures"},{"id":"http://arxiv.org/abs/2312.00326v4","updated":"2024-12-05T14:45:05Z","published":"2023-12-01T03:44:54Z","title":"Agent-OM: Leveraging LLM Agents for Ontology Matching","summary":" Ontology matching (OM) enables semantic interoperability between different\nontologies and resolves their conceptual heterogeneity by aligning related\nentities. OM systems currently have two prevailing design paradigms:\nconventional knowledge-based expert systems and newer machine learning-based\npredictive systems. While large language models (LLMs) and LLM agents have\nrevolutionised data engineering and have been applied creatively in many\ndomains, their potential for OM remains underexplored. This study introduces a\nnovel agent-powered LLM-based design paradigm for OM systems. With\nconsideration of several specific challenges in leveraging LLM agents for OM,\nwe propose a generic framework, namely Agent-OM (Agent for Ontology Matching),\nconsisting of two Siamese agents for retrieval and matching, with a set of\nsimple OM tools. Our framework is implemented in a proof-of-concept system.\nEvaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks\nover state-of-the-art OM systems show that our system can achieve results very\nclose to the long-standing best performance on simple OM tasks and can\nsignificantly improve the performance on complex and few-shot OM tasks.\n","authors":["Zhangcheng Qiang","Weiqing Wang","Kerry Taylor"],"pdf_url":"https://arxiv.org/pdf/2312.00326v4.pdf","comment":"14 pages, 13 figures, 4 tables"},{"id":"http://arxiv.org/abs/2409.16182v3","updated":"2024-12-05T14:28:42Z","published":"2024-09-24T15:26:38Z","title":"TiM4Rec: An Efficient Sequential Recommendation Model Based on\n Time-Aware Structured State Space Duality Model","summary":" The Sequential Recommendation modeling paradigm is shifting from Transformer\nto Mamba architecture, which comprises two generations: Mamba1, based on the\nState Space Model (SSM), and Mamba2, based on State Space Duality (SSD).\nAlthough SSD offers superior computational efficiency compared to SSM, it\nsuffers performance degradation in sequential recommendation tasks, especially\nin low-dimensional scenarios that are critical for these tasks. Considering\nthat time-aware enhancement methods are commonly employed to mitigate\nperformance loss, our analysis reveals that the performance decline of SSD can\nsimilarly be fundamentally compensated by leveraging mechanisms in time-aware\nmethods. Thus, we propose integrating time-awareness into the SSD framework to\naddress these performance issues. However, integrating current time-aware\nmethods, modeled after TiSASRec, into SSD faces the following challenges: 1)\nthe complexity of integrating these transformer-based mechanisms with the SSD\narchitecture, and 2) the computational inefficiency caused by the need for\ndimensionality expansion of time-difference modeling. To overcome these\nchallenges, we introduce a novel Time-aware Structured Masked Matrix that\nefficiently incorporates time-aware capabilities into SSD. Building on this, we\npropose Time-Aware Mamba for Recommendation (TiM4Rec), which mitigates\nperformance degradation in low-dimensional SSD contexts while preserving\ncomputational efficiency. This marks the inaugural application of a time-aware\nenhancement method specifically tailored for the Mamba architecture within the\ndomain of sequential recommendation. Extensive experiments conducted on three\nreal-world datasets demonstrate the superiority of our approach. The code for\nour model is accessible at https://github.com/AlwaysFHao/TiM4Rec.\n","authors":["Hao Fan","Mengyi Zhu","Yanrong Hu","Hailin Feng","Zhijie He","Hongjiu Liu","Qingyang Liu"],"pdf_url":"https://arxiv.org/pdf/2409.16182v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.03906v2","updated":"2024-12-05T12:56:40Z","published":"2024-11-06T13:37:28Z","title":"Lexicalization Is All You Need: Examining the Impact of Lexical\n Knowledge in a Compositional QALD System","summary":" In this paper, we examine the impact of lexicalization on Question Answering\nover Linked Data (QALD). It is well known that one of the key challenges in\ninterpreting natural language questions with respect to SPARQL lies in bridging\nthe lexical gap, that is mapping the words in the query to the correct\nvocabulary elements. We argue in this paper that lexicalization, that is\nexplicit knowledge about the potential interpretations of a word with respect\nto the given vocabulary, significantly eases the task and increases the\nperformance of QA systems. Towards this goal, we present a compositional QA\nsystem that can leverage explicit lexical knowledge in a compositional manner\nto infer the meaning of a question in terms of a SPARQL query. We show that\nsuch a system, given lexical knowledge, has a performance well beyond current\nQA systems, achieving up to a $35.8\\%$ increase in the micro $F_1$ score\ncompared to the best QA system on QALD-9. This shows the importance and\npotential of including explicit lexical knowledge. In contrast, we show that\nLLMs have limited abilities to exploit lexical knowledge, with only marginal\nimprovements compared to a version without lexical knowledge. This shows that\nLLMs have no ability to compositionally interpret a question on the basis of\nthe meaning of its parts, a key feature of compositional approaches. Taken\ntogether, our work shows new avenues for QALD research, emphasizing the\nimportance of lexicalization and compositionality.\n","authors":["David Maria Schmidt","Mohammad Fazleh Elahi","Philipp Cimiano"],"pdf_url":"https://arxiv.org/pdf/2411.03906v2.pdf","comment":"24th International Conference on Knowledge Engineering and Knowledge\n Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands"},{"id":"http://arxiv.org/abs/2412.04107v1","updated":"2024-12-05T12:17:56Z","published":"2024-12-05T12:17:56Z","title":"Pre-train, Align, and Disentangle: Empowering Sequential Recommendation\n with Large Language Models","summary":" Sequential recommendation (SR) aims to model the sequential dependencies in\nusers' historical interactions to better capture their evolving interests.\nHowever, existing SR approaches primarily rely on collaborative data, which\nleads to limitations such as the cold-start problem and sub-optimal\nperformance. Meanwhile, despite the success of large language models (LLMs),\ntheir application in industrial recommender systems is hindered by high\ninference latency, inability to capture all distribution statistics, and\ncatastrophic forgetting. To this end, we propose a novel Pre-train, Align, and\nDisentangle (PAD) paradigm to empower recommendation models with LLMs.\nSpecifically, we first pre-train both the SR and LLM models to get\ncollaborative and textual embeddings. Next, a characteristic\nrecommendation-anchored alignment loss is proposed using multi-kernel maximum\nmean discrepancy with Gaussian kernels. Finally, a triple-experts architecture,\nconsisting aligned and modality-specific experts with disentangled embeddings,\nis fine-tuned in a frequency-aware manner. Experiments conducted on three\npublic datasets demonstrate the effectiveness of PAD, showing significant\nimprovements and compatibility with various SR backbone models, especially on\ncold items. The implementation code and datasets will be publicly available.\n","authors":["Yuhao Wang","Junwei Pan","Xiangyu Zhao","Pengyue Jia","Wanyu Wang","Yuan Wang","Yue Liu","Dapeng Liu","Jie Jiang"],"pdf_url":"https://arxiv.org/pdf/2412.04107v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2308.07426v4","updated":"2024-12-05T11:04:14Z","published":"2023-08-14T19:36:57Z","title":"A Survey on Point-of-Interest Recommendations Leveraging Heterogeneous\n Data","summary":" Tourism is an important application domain for recommender systems. In this\ndomain, recommender systems are for example tasked with providing personalized\nrecommendations for transportation, accommodation, points-of-interest (POIs),\netc. Among these tasks, in particular the problem of recommending POIs that are\nof likely interest to individual tourists has gained growing attention in\nrecent years. Providing POI recommendations to tourists can however be\nespecially challenging due to the variability of the user's context. With the\nrapid development of the Web and today's multitude of online services, vast\namounts of data from various sources have become available, and these\nheterogeneous data represent a huge potential to better address the challenges\nof POI recommendation problems. In this work, we provide a survey of published\nresearch on the problem of POI recommendation between 2021 and 2023. The\nliterature was surveyed to identify the information types, techniques and\nevaluation methods employed. Based on the analysis, it was observed that the\ncurrent research tends to focus on a relatively narrow range of information\ntypes and there is a significant potential in improving POI recommendation by\nleveraging heterogeneous data. As the first information-centric survey on POI\nrecommendation research, this study serves as a reference for researchers\naiming to develop increasingly accurate, personalized and context-aware POI\nrecommender systems.\n","authors":["Zehui Wang","Wolfram Höpken","Dietmar Jannach"],"pdf_url":"https://arxiv.org/pdf/2308.07426v4.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03913v1","updated":"2024-12-05T06:30:20Z","published":"2024-12-05T06:30:20Z","title":"Graph Disentangle Causal Model: Enhancing Causal Inference in Networked\n Observational Data","summary":" Estimating individual treatment effects (ITE) from observational data is a\ncritical task across various domains. However, many existing works on ITE\nestimation overlook the influence of hidden confounders, which remain\nunobserved at the individual unit level. To address this limitation,\nresearchers have utilized graph neural networks to aggregate neighbors'\nfeatures to capture the hidden confounders and mitigate confounding bias by\nminimizing the discrepancy of confounder representations between the treated\nand control groups. Despite the success of these approaches, practical\nscenarios often treat all features as confounders and involve substantial\ndifferences in feature distributions between the treated and control groups.\nConfusing the adjustment and confounder and enforcing strict balance on the\nconfounder representations could potentially undermine the effectiveness of\noutcome prediction. To mitigate this issue, we propose a novel framework called\nthe \\textit{Graph Disentangle Causal model} (GDC) to conduct ITE estimation in\nthe network setting. GDC utilizes a causal disentangle module to separate unit\nfeatures into adjustment and confounder representations. Then we design a graph\naggregation module consisting of three distinct graph aggregators to obtain\nadjustment, confounder, and counterfactual confounder representations. Finally,\na causal constraint module is employed to enforce the disentangled\nrepresentations as true causal factors. The effectiveness of our proposed\nmethod is demonstrated by conducting comprehensive experiments on two networked\ndatasets.\n","authors":["Binbin Hu","Zhicheng An","Zhengwei Wu","Ke Tu","Ziqi Liu","Zhiqiang Zhang","Jun Zhou","Yufei Feng","Jiawei Chen"],"pdf_url":"https://arxiv.org/pdf/2412.03913v1.pdf","comment":"Accepted by WSDM 2025"},{"id":"http://arxiv.org/abs/2412.03875v1","updated":"2024-12-05T05:07:19Z","published":"2024-12-05T05:07:19Z","title":"Learning to Hash for Recommendation: A Survey","summary":" With the explosive growth of users and items, Recommender Systems (RS) are\nfacing unprecedented challenges on both retrieval efficiency and storage cost.\nFortunately, Learning to Hash (L2H) techniques have been shown as a promising\nsolution to address the two dilemmas, whose core idea is encoding\nhigh-dimensional data into compact hash codes. To this end, L2H for RS (HashRec\nfor short) has recently received widespread attention to support large-scale\nrecommendations. In this survey, we present a comprehensive review of current\nHashRec algorithms. Specifically, we first introduce the commonly used\ntwo-tower models in the recall stage and identify two search strategies\nfrequently employed in L2H. Then, we categorize prior works into two-tier\ntaxonomy based on: (i) the type of loss function and (ii) the optimization\nstrategy. We also introduce some commonly used evaluation metrics to measure\nthe performance of HashRec algorithms. Finally, we shed light on the\nlimitations of the current research and outline the future research directions.\nFurthermore, the summary of HashRec methods reviewed in this survey can be\nfound at\n\\href{https://github.com/Luo-Fangyuan/HashRec}{https://github.com/Luo-Fangyuan/HashRec}.\n","authors":["Fangyuan Luo","Honglei Zhang","Tong Li","Jun Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03875v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.12008v2","updated":"2024-12-05T21:50:27Z","published":"2024-11-18T19:48:18Z","title":"Compression of Higher Order Ambisonics with Multichannel RVQGAN","summary":" A multichannel extension to the RVQGAN neural coding method is proposed, and\nrealized for data-driven compression of third-order Ambisonics audio. The\ninput- and output layers of the generator and discriminator models are modified\nto accept multiple (16) channels without increasing the model bitrate. We also\npropose a loss function for accounting for spatial perception in immersive\nreproduction, and transfer learning from single-channel models. Listening test\nresults with 7.1.4 immersive playback show that the proposed extension is\nsuitable for coding scene-based, 16-channel Ambisonics content with good\nquality at 16 kbps.\n","authors":["Toni Hirvonen","Mahmoud Namazi"],"pdf_url":"https://arxiv.org/pdf/2411.12008v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.04307v1","updated":"2024-12-05T16:26:37Z","published":"2024-12-05T16:26:37Z","title":"Feature Coding in the Era of Large Models: Dataset, Test Conditions, and\n Benchmark","summary":" Large models have achieved remarkable performance across various tasks, yet\nthey incur significant computational costs and privacy concerns during both\ntraining and inference. Distributed deployment has emerged as a potential\nsolution, but it necessitates the exchange of intermediate information between\nmodel segments, with feature representations serving as crucial information\ncarriers. To optimize information exchange, feature coding methods are applied\nto reduce transmission and storage overhead. Despite its importance, feature\ncoding for large models remains an under-explored area. In this paper, we draw\nattention to large model feature coding and make three contributions to this\nfield. First, we introduce a comprehensive dataset encompassing diverse\nfeatures generated by three representative types of large models. Second, we\nestablish unified test conditions, enabling standardized evaluation pipelines\nand fair comparisons across future feature coding studies. Third, we introduce\ntwo baseline methods derived from widely used image coding techniques and\nbenchmark their performance on the proposed dataset. These contributions aim to\nadvance the field of feature coding, facilitating more efficient large model\ndeployment. All source code and the dataset will be made available on GitHub.\n","authors":["Changsheng Gao","Yifan Ma","Qiaoxi Chen","Yenan Xu","Dong Liu","Weisi Lin"],"pdf_url":"https://arxiv.org/pdf/2412.04307v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.17440v2","updated":"2024-12-05T15:54:00Z","published":"2024-11-26T13:58:24Z","title":"Identity-Preserving Text-to-Video Generation by Frequency Decomposition","summary":" Identity-preserving text-to-video (IPT2V) generation aims to create\nhigh-fidelity videos with consistent human identity. It is an important task in\nvideo generation but remains an open problem for generative models. This paper\npushes the technical frontier of IPT2V in two directions that have not been\nresolved in literature: (1) A tuning-free pipeline without tedious case-by-case\nfinetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based\ncontrol scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V\nmodel to keep human identity consistent in the generated video. Inspired by\nprior findings in frequency analysis of diffusion transformers, it employs\nidentity-control signals in the frequency domain, where facial features can be\ndecomposed into low-frequency global features and high-frequency intrinsic\nfeatures. First, from a low-frequency perspective, we introduce a global facial\nextractor, which encodes reference images and facial key points into a latent\nspace, generating features enriched with low-frequency information. These\nfeatures are then integrated into shallow layers of the network to alleviate\ntraining challenges associated with DiT. Second, from a high-frequency\nperspective, we design a local facial extractor to capture high-frequency\ndetails and inject them into transformer blocks, enhancing the model's ability\nto preserve fine-grained features. We propose a hierarchical training strategy\nto leverage frequency information for identity preservation, transforming a\nvanilla pre-trained video generation model into an IPT2V model. Extensive\nexperiments demonstrate that our frequency-aware heuristic scheme provides an\noptimal control solution for DiT-based models. Thanks to this scheme, our\nConsisID generates high-quality, identity-preserving videos, making strides\ntowards more effective IPT2V.\n","authors":["Shenghai Yuan","Jinfa Huang","Xianyi He","Yunyuan Ge","Yujun Shi","Liuhan Chen","Jiebo Luo","Li Yuan"],"pdf_url":"https://arxiv.org/pdf/2411.17440v2.pdf","comment":"12 pages, 8 figures, Code: https://github.com/PKU-YuanGroup/ConsisID"},{"id":"http://arxiv.org/abs/2212.05005v4","updated":"2024-12-05T10:52:25Z","published":"2022-12-09T17:45:36Z","title":"Memories are One-to-Many Mapping Alleviators in Talking Face Generation","summary":" Talking face generation aims at generating photo-realistic video portraits of\na target person driven by input audio. Due to its nature of one-to-many mapping\nfrom the input audio to the output video (e.g., one speech content may have\nmultiple feasible visual appearances), learning a deterministic mapping like\nprevious works brings ambiguity during training, and thus causes inferior\nvisual results. Although this one-to-many mapping could be alleviated in part\nby a two-stage framework (i.e., an audio-to-expression model followed by a\nneural-rendering model), it is still insufficient since the prediction is\nproduced without enough information (e.g., emotions, wrinkles, etc.). In this\npaper, we propose MemFace to complement the missing information with an\nimplicit memory and an explicit memory that follow the sense of the two stages\nrespectively. More specifically, the implicit memory is employed in the\naudio-to-expression model to capture high-level semantics in the\naudio-expression shared space, while the explicit memory is employed in the\nneural-rendering model to help synthesize pixel-level details. Our experimental\nresults show that our proposed MemFace surpasses all the state-of-the-art\nresults across multiple scenarios consistently and significantly.\n","authors":["Anni Tang","Tianyu He","Xu Tan","Jun Ling","Li Song"],"pdf_url":"https://arxiv.org/pdf/2212.05005v4.pdf","comment":"IEEE Transactions on Pattern Analysis and Machine Intelligence\n (2024). Project page: see https://memoryface.github.io"}]},"2024-12-04T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.01007v2","updated":"2024-12-04T20:01:42Z","published":"2024-12-01T23:54:12Z","title":"CoRNStack: High-Quality Contrastive Data for Better Code Ranking","summary":" Effective code retrieval plays a crucial role in advancing code generation,\nbug fixing, and software maintenance, particularly as software systems increase\nin complexity. While current code embedding models have demonstrated promise in\nretrieving code snippets for small-scale, well-defined tasks, they often\nunderperform in more demanding real-world applications such as bug localization\nwithin GitHub repositories. We hypothesize that a key issue is their reliance\non noisy and inconsistent datasets for training, which impedes their ability to\ngeneralize to more complex retrieval scenarios. To address these limitations,\nwe introduce CoRNStack, a large-scale, high-quality contrastive training\ndataset for code that spans multiple programming languages. This dataset is\ncurated using consistency filtering to eliminate noisy positives and is further\nenriched with mined hard negatives, thereby facilitating more effective\nlearning. We demonstrate that contrastive training of embedding models using\nCoRNStack leads to state-of-the-art performance across a variety of code\nretrieval tasks. Furthermore, the dataset can be leveraged for training code\nreranking models, a largely underexplored area compared to text reranking. Our\nfinetuned code reranking model significantly improves the ranking quality over\nthe retrieved results. Finally, by employing our code retriever and reranker\ntogether, we demonstrate significant improvements in function localization for\nGitHub issues, an important component of real-world software development.\n","authors":["Tarun Suresh","Revanth Gangi Reddy","Yifei Xu","Zach Nussbaum","Andriy Mulyar","Brandon Duderstadt","Heng Ji"],"pdf_url":"https://arxiv.org/pdf/2412.01007v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03557v1","updated":"2024-12-04T18:52:32Z","published":"2024-12-04T18:52:32Z","title":"Freshness and Informativity Weighted Cognitive Extent and Its\n Correlation with Cumulative Citation Count","summary":" In this paper, we revisit cognitive extent, originally defined as the number\nof unique phrases in a quota. We introduce Freshness and Informative Weighted\nCognitive Extent (FICE), calculated based on two novel weighting factors, the\nlifetime ratio and informativity of scientific entities. We model the lifetime\nof each scientific entity as the time-dependent document frequency, which is\nfit by the composition of multiple Gaussian profiles. The lifetime ratio is\nthen calculated as the cumulative document frequency at the publication time\n$t_0$ divided by the cumulative document frequency over its entire lifetime.\nThe informativity is calculated by normalizing the document frequency across\nall scientific entities recognized in a title. Using the ACL Anthology, we\nverified the trend formerly observed in several other domains that the number\nof unique scientific entities per quota increased gradually at a slower rate.\nWe found that FICE exhibits a strong correlation with the average cumulative\ncitation count within a quota. Our code is available at\n\\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}\n","authors":["Zihe Wang","Jian Wu"],"pdf_url":"https://arxiv.org/pdf/2412.03557v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03465v1","updated":"2024-12-04T16:54:58Z","published":"2024-12-04T16:54:58Z","title":"YT-30M: A multi-lingual multi-category dataset of YouTube comments","summary":" This paper introduces two large-scale multilingual comment datasets, YT-30M\n(and YT-100K) from YouTube. The analysis in this paper is performed on a\nsmaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and\nYT-100K (randomly selected 100K sample from YT-30M) are publicly released for\nfurther research. YT-30M (YT-100K) contains 32236173 (108694) comments posted\nby YouTube channel that belong to YouTube categories. Each comment is\nassociated with a video ID, comment ID, commentor name, commentor channel ID,\ncomment text, upvotes, original channel ID and category of the YouTube channel\n(e.g., 'News & Politics', 'Science & Technology', etc.).\n","authors":["Hridoy Sankar Dutta"],"pdf_url":"https://arxiv.org/pdf/2412.03465v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03620v1","updated":"2024-12-04T15:03:47Z","published":"2024-12-04T15:03:47Z","title":"Recommender Systems for Sustainability: Overview and Research Issues","summary":" Sustainability development goals (SDGs) are regarded as a universal call to\naction with the overall objectives of planet protection, ending of poverty, and\nensuring peace and prosperity for all people. In order to achieve these\nobjectives, different AI technologies play a major role. Specifically,\nrecommender systems can provide support for organizations and individuals to\nachieve the defined goals. Recommender systems integrate AI technologies such\nas machine learning, explainable AI (XAI), case-based reasoning, and constraint\nsolving in order to find and explain user-relevant alternatives from a\npotentially large set of options. In this article, we summarize the state of\nthe art in applying recommender systems to support the achievement of\nsustainability development goals. In this context, we discuss open issues for\nfuture research.\n","authors":["Alexander Felfernig","Manfred Wundara","Thi Ngoc Trang Tran","Seda Polat-Erdeniz","Sebastian Lubos","Merfat El-Mansi","Damian Garber","Viet-Man Le"],"pdf_url":"https://arxiv.org/pdf/2412.03620v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03193v1","updated":"2024-12-04T10:27:35Z","published":"2024-12-04T10:27:35Z","title":"Beyond Questions: Leveraging ColBERT for Keyphrase Search","summary":" While question-like queries are gaining popularity and search engines' users\nincreasingly adopt them, keyphrase search has traditionally been the\ncornerstone of web search. This query type is also prevalent in specialised\nsearch tasks such as academic or professional search, where experts rely on\nkeyphrases to articulate their information needs. However, current dense\nretrieval models often fail with keyphrase-like queries, primarily because they\nare mostly trained on question-like ones. This paper introduces a novel model\nthat employs the ColBERT architecture to enhance document ranking for keyphrase\nqueries. For that, given the lack of large keyphrase-based retrieval datasets,\nwe first explore how Large Language Models can convert question-like queries\ninto keyphrase format. Then, using those keyphrases, we train a keyphrase-based\nColBERT ranker (ColBERTKP_QD) to improve the performance when working with\nkeyphrase queries. Furthermore, to reduce the training costs associated with\ntraining the full ColBERT model, we investigate the feasibility of training\nonly a keyphrase query encoder while keeping the document encoder weights\nstatic (ColBERTKP_Q). We assess our proposals' ranking performance using both\nautomatically generated and manually annotated keyphrases. Our results reveal\nthe potential of the late interaction architecture when working under the\nkeyphrase search scenario.\n","authors":["Jorge Gabín","Javier Parapar","Craig Macdonald"],"pdf_url":"https://arxiv.org/pdf/2412.03193v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03097v1","updated":"2024-12-04T07:50:27Z","published":"2024-12-04T07:50:27Z","title":"Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing","summary":" This paper addresses key challenges in enhancing recommendation systems by\nleveraging Graph Neural Networks (GNNs) and addressing inherent limitations\nsuch as over-smoothing, which reduces model effectiveness as network hierarchy\ndeepens. The proposed approach introduces three GNN-based recommendation\nmodels, specifically designed to mitigate over-smoothing through innovative\nmechanisms like residual connections and identity mapping within the\naggregation propagation process. These modifications enable more effective\ninformation flow across layers, preserving essential user-item interaction\ndetails to improve recommendation accuracy. Additionally, the study emphasizes\nthe critical need for interpretability in recommendation systems, aiming to\nprovide transparent and justifiable suggestions tailored to dynamic user\npreferences. By integrating collaborative filtering with GNN architectures, the\nproposed models not only enhance predictive accuracy but also align\nrecommendations more closely with individual behaviors, adapting to nuanced\nshifts in user interests. This work advances the field by tackling both\ntechnical and user-centric challenges, contributing to the development of\nrobust and explainable recommendation systems capable of managing the\ncomplexity and scale of modern online environments.\n","authors":["Wenyi Liu","Ziqi Zhang","Xinshi Li","Jiacheng Hu","Yuanshuai Luo","Junliang Du"],"pdf_url":"https://arxiv.org/pdf/2412.03097v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2408.11646v2","updated":"2024-12-04T04:52:24Z","published":"2024-08-21T14:17:24Z","title":"Mathematical Information Retrieval: Search and Question Answering","summary":" Mathematical information is essential for technical work, but its creation,\ninterpretation, and search are challenging. To help address these challenges,\nresearchers have developed multimodal search engines and mathematical question\nanswering systems. This book begins with a simple framework characterizing the\ninformation tasks that people and systems perform as we work to answer\nmath-related questions. The framework is used to organize and relate the other\ncore topics of the book, including interactions between people and systems,\nrepresenting math formulas in sources, and evaluation. We close by addressing\nsome key questions and presenting directions for future work. This book is\nintended for students, instructors, and researchers interested in systems that\nhelp us find and use mathematical information.\n","authors":["Richard Zanibbi","Behrooz Mansouri","Anurag Agarwal"],"pdf_url":"https://arxiv.org/pdf/2408.11646v2.pdf","comment":"[DRAFT] Revised (2nd) draft"},{"id":"http://arxiv.org/abs/2412.02996v1","updated":"2024-12-04T03:29:56Z","published":"2024-12-04T03:29:56Z","title":"CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D\n Design Datasets","summary":" Three-dimensional (3D) objects have wide applications. Despite the growing\ninterest in 3D modeling in academia and industries, designing and/or creating\n3D objects from scratch remains time-consuming and challenging. With the\ndevelopment of generative artificial intelligence (AI), designers discover a\nnew way to create images for ideation. However, generative AIs are less useful\nin creating 3D objects with satisfying qualities. To allow 3D designers to\naccess a wide range of 3D objects for creative activities based on their\nspecific demands, we propose a machine learning (ML) enhanced framework CLAS -\nnamed after the four-step of capture, label, associate, and search - to enable\nfully automatic retrieval of 3D objects based on user specifications leveraging\nthe existing datasets of 3D objects. CLAS provides an effective and efficient\nmethod for any person or organization to benefit from their existing but not\nutilized 3D datasets. In addition, CLAS may also be used to produce\nhigh-quality 3D object synthesis datasets for training and evaluating 3D\ngenerative models. As a proof of concept, we created and showcased a search\nsystem with a web user interface (UI) for retrieving 6,778 3D objects of chairs\nin the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our\nretrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy\nof 42.27%, and top 10 accuracy of 89.64%.\n","authors":["XiuYu Zhang","Xiaolei Ye","Jui-Che Chang","Yue Fang"],"pdf_url":"https://arxiv.org/pdf/2412.02996v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2411.08307v2","updated":"2024-12-04T22:02:25Z","published":"2024-11-13T03:14:10Z","title":"PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for\n Long-Term Expressive Symbolic Music Generation","summary":" AI-based music generation has progressed significantly in recent years.\nHowever, creating symbolic music that is both long-structured and expressive\nremains a considerable challenge. In this paper, we propose PerceiverS\n(Segmentation and Scale), a novel architecture designed to address this issue\nby leveraging both Effective Segmentation and Multi-Scale attention mechanisms.\nOur approach enhances symbolic music generation by simultaneously learning\nlong-term structural dependencies and short-term expressive details. By\ncombining cross-attention and self-attention in a Multi-Scale setting,\nPerceiverS captures long-range musical structure while preserving musical\ndiversity. The proposed model has been evaluated using the Maestro dataset and\nhas demonstrated improvements in generating music of conventional length with\nexpressive nuances. The project demos and the generated music samples can be\naccessed through the link: https://perceivers.github.io\n","authors":["Yungang Yi","Weihua Li","Matthew Kuo","Quan Bai"],"pdf_url":"https://arxiv.org/pdf/2411.08307v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.03665v1","updated":"2024-12-04T19:01:06Z","published":"2024-12-04T19:01:06Z","title":"Personalizing Multimodal Large Language Models for Image Captioning: An\n Experimental Analysis","summary":" The task of image captioning demands an algorithm to generate natural\nlanguage descriptions of visual inputs. Recent advancements have seen a\nconvergence between image captioning research and the development of Large\nLanguage Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which\nextend the capabilities of text-only LLMs to multiple modalities. This paper\ninvestigates whether Multimodal LLMs can supplant traditional image captioning\nnetworks by evaluating their performance on various image description\nbenchmarks. We explore both the zero-shot capabilities of these models and\ntheir adaptability to different semantic domains through fine-tuning methods,\nincluding prompt learning, prefix tuning, and low-rank adaptation. Our results\ndemonstrate that while Multimodal LLMs achieve impressive zero-shot\nperformance, fine-tuning for specific domains while maintaining their\ngeneralization capabilities intact remains challenging. We discuss the\nimplications of these findings for future research in image captioning and the\ndevelopment of more adaptable Multimodal LLMs.\n","authors":["Davide Bucciarelli","Nicholas Moratelli","Marcella Cornia","Lorenzo Baraldi","Rita Cucchiara"],"pdf_url":"https://arxiv.org/pdf/2412.03665v1.pdf","comment":"ECCV 2024 Workshop on Green Foundation Models"},{"id":"http://arxiv.org/abs/2412.03551v1","updated":"2024-12-04T18:49:26Z","published":"2024-12-04T18:49:26Z","title":"SPICE: Smart Projection Interface for Cooking Enhancement","summary":" Tangible User Interfaces (TUI) for human--computer interaction (HCI) provide\nthe user with physical representations of digital information with the aim to\novercome the limitations of screen-based interfaces. Although many compelling\ndemonstrations of TUIs exist in the literature, there is a lack of research on\nTUIs intended for daily two-handed tasks and processes, such as cooking. In\nresponse to this gap, we propose SPICE (Smart Projection Interface for Cooking\nEnhancement). SPICE investigates TUIs in a kitchen setting, aiming to transform\nthe recipe following experience from simply text-based to tangibly interactive.\nSPICE includes a tracking system, an agent-based software, and vision large\nlanguage models to create and interpret a kitchen environment where recipe\ninformation is projected directly onto the cooking surface. We conducted a\ncomparative usability study of SPICE and text-based recipe following with 30\nparticipants, assessing the task difficulty, total duration, and efficiency, as\nwell as user confidence and taste perception. The results indicate that SPICE\nallowed participants to perform the recipe with less stops and in shorter time\nwhile also improving self-reported efficiency, confidence, and taste. Despite\nthis, participants self-reported no change in overall difficulty, which is a\ndirection for future research. Overall, the SPICE project demonstrates the\npotential of using TUIs to improve everyday activities, paving the way for\nfuture research in HCI and new computing interfaces.\n","authors":["Vera Prohaska","Eduardo Castelló Ferrer"],"pdf_url":"https://arxiv.org/pdf/2412.03551v1.pdf","comment":"Article submitted to IUI 2025"},{"id":"http://arxiv.org/abs/2412.01064v2","updated":"2024-12-04T09:43:18Z","published":"2024-12-02T02:50:07Z","title":"FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking\n Portrait","summary":" With the rapid advancement of diffusion-based generative models, portrait\nimage animation has achieved remarkable results. However, it still faces\nchallenges in temporally consistent video generation and fast sampling due to\nits iterative sampling nature. This paper presents FLOAT, an audio-driven\ntalking portrait video generation method based on flow matching generative\nmodel. We shift the generative modeling from the pixel-based latent space to a\nlearned motion latent space, enabling efficient design of temporally consistent\nmotion. To achieve this, we introduce a transformer-based vector field\npredictor with a simple yet effective frame-wise conditioning mechanism.\nAdditionally, our method supports speech-driven emotion enhancement, enabling a\nnatural incorporation of expressive motions. Extensive experiments demonstrate\nthat our method outperforms state-of-the-art audio-driven talking portrait\nmethods in terms of visual quality, motion fidelity, and efficiency.\n","authors":["Taekyung Ki","Dongchan Min","Gyeongsu Chae"],"pdf_url":"https://arxiv.org/pdf/2412.01064v2.pdf","comment":"Project page: https://deepbrainai-research.github.io/float/"},{"id":"http://arxiv.org/abs/2406.00758v3","updated":"2024-12-04T09:36:56Z","published":"2024-06-02T14:22:09Z","title":"Once-for-All: Controllable Generative Image Compression with Dynamic\n Granularity Adaption","summary":" Although recent generative image compression methods have demonstrated\nimpressive potential in optimizing the rate-distortion-perception trade-off,\nthey still face the critical challenge of flexible rate adaption to diverse\ncompression necessities and scenarios. To overcome this challenge, this paper\nproposes a Controllable Generative Image Compression framework, termed\nControl-GIC, the first capable of fine-grained bitrate adaption across a broad\nspectrum while ensuring high-fidelity and generality compression. Control-GIC\nis grounded in a VQGAN framework that encodes an image as a sequence of\nvariable-length codes (i.e. VQ-indices), which can be losslessly compressed and\nexhibits a direct positive correlation with the bitrates. Drawing inspiration\nfrom the classical coding principle, we correlate the information density of\nlocal image patches with their granular representations. Hence, we can flexibly\ndetermine a proper allocation of granularity for the patches to achieve dynamic\nadjustment for VQ-indices, resulting in desirable compression rates. We further\ndevelop a probabilistic conditional decoder capable of retrieving historic\nencoded multi-granularity representations according to transmitted codes, and\nthen reconstruct hierarchical granular features in the formalization of\nconditional probability, enabling more informative aggregation to improve\nreconstruction realism. Our experiments show that Control-GIC allows highly\nflexible and controllable bitrate adaption where the results demonstrate its\nsuperior performance over recent state-of-the-art methods.\n","authors":["Anqi Li","Feng Li","Yuxi Liu","Runmin Cong","Yao Zhao","Huihui Bai"],"pdf_url":"https://arxiv.org/pdf/2406.00758v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2404.06220v2","updated":"2024-12-04T01:47:08Z","published":"2024-04-09T11:14:45Z","title":"Zero-Shot Relational Learning for Multimodal Knowledge Graphs","summary":" Relational learning is an essential task in the domain of knowledge\nrepresentation, particularly in knowledge graph completion (KGC). While\nrelational learning in traditional single-modal settings has been extensively\nstudied, exploring it within a multimodal KGC context presents distinct\nchallenges and opportunities. One of the major challenges is inference on newly\ndiscovered relations without any associated training data. This zero-shot\nrelational learning scenario poses unique requirements for multimodal KGC,\ni.e., utilizing multimodality to facilitate relational learning.However,\nexisting works fail to support the leverage of multimodal information and leave\nthe problem unexplored. In this paper, we propose a novel end-to-end framework,\nconsisting of three components, i.e., multimodal learner, structure\nconsolidator, and relation embedding generator, to integrate diverse multimodal\ninformation and knowledge graph structures to facilitate the zero-shot\nrelational learning. Evaluation results on three multimodal knowledge graphs\ndemonstrate the superior performance of our proposed method.\n","authors":["Rui Cai","Shichao Pei","Xiangliang Zhang"],"pdf_url":"https://arxiv.org/pdf/2404.06220v2.pdf","comment":"In the Proceedings of the 2024 IEEE International Conference on Big\n Data (IEEE BigData 2024)"},{"id":"http://arxiv.org/abs/2412.02946v1","updated":"2024-12-04T01:23:57Z","published":"2024-12-04T01:23:57Z","title":"Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large\n Vision-Language Model via Causality Analysis","summary":" Recent advancements in large vision-language models (LVLM) have significantly\nenhanced their ability to comprehend visual inputs alongside natural language.\nHowever, a major challenge in their real-world application is hallucination,\nwhere LVLMs generate non-existent visual elements, eroding user trust. The\nunderlying mechanism driving this multimodal hallucination is poorly\nunderstood. Minimal research has illuminated whether contexts such as sky,\ntree, or grass field involve the LVLM in hallucinating a frisbee. We\nhypothesize that hidden factors, such as objects, contexts, and semantic\nforeground-background structures, induce hallucination. This study proposes a\nnovel causal approach: a hallucination probing system to identify these hidden\nfactors. By analyzing the causality between images, text prompts, and network\nsaliency, we systematically explore interventions to block these factors. Our\nexperimental findings show that a straightforward technique based on our\nanalysis can significantly reduce hallucinations. Additionally, our analyses\nindicate the potential to edit network internals to minimize hallucinated\noutputs.\n","authors":["Po-Hsuan Huang","Jeng-Lin Li","Chin-Po Chen","Ming-Ching Chang","Wei-Chao Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02946v1.pdf","comment":"Accepted by WACV2025"}]},"2024-12-03T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.04506v1","updated":"2024-12-03T22:59:36Z","published":"2024-12-03T22:59:36Z","title":"Arctic-Embed 2.0: Multilingual Retrieval Without Compromise","summary":" This paper presents the training methodology of Arctic-Embed 2.0, a set of\nopen-source text embedding models built for accurate and efficient multilingual\nretrieval. While prior works have suffered from degraded English retrieval\nquality, Arctic-Embed 2.0 delivers competitive retrieval quality on\nmultilingual and English-only benchmarks, and supports Matryoshka\nRepresentation Learning (MRL) for efficient embedding storage with\nsignificantly lower compressed quality degradation compared to alternatives. We\ndetail the design and implementation, presenting several important open\nresearch questions that arose during model development. We conduct experiments\nexploring these research questions and include extensive discussion aimed at\nfostering further discussion in this field.\n","authors":["Puxuan Yu","Luke Merrick","Gaurav Nuti","Daniel Campos"],"pdf_url":"https://arxiv.org/pdf/2412.04506v1.pdf","comment":"10 pages, 5 figures, 3 tables"},{"id":"http://arxiv.org/abs/2409.11598v2","updated":"2024-12-03T22:23:53Z","published":"2024-09-17T23:10:04Z","title":"Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented\n Generation","summary":" Many language models now enhance their responses with retrieval capabilities,\nleading to the widespread adoption of retrieval-augmented generation (RAG)\nsystems. However, despite retrieval being a core component of RAG, much of the\nresearch in this area overlooks the extensive body of work on fair ranking,\nneglecting the importance of considering all stakeholders involved. This paper\npresents the first systematic evaluation of RAG systems integrated with fair\nrankings. We focus specifically on measuring the fair exposure of each relevant\nitem across the rankings utilized by RAG systems (i.e., item-side fairness),\naiming to promote equitable growth for relevant item providers. To gain a deep\nunderstanding of the relationship between item-fairness, ranking quality, and\ngeneration quality in the context of RAG, we analyze nine different RAG systems\nthat incorporate fair rankings across seven distinct datasets. Our findings\nindicate that RAG systems with fair rankings can maintain a high level of\ngeneration quality and, in many cases, even outperform traditional RAG systems,\ndespite the general trend of a tradeoff between ensuring fairness and\nmaintaining system-effectiveness. We believe our insights lay the groundwork\nfor responsible and equitable RAG systems and open new avenues for future\nresearch. We publicly release our codebase and dataset at\nhttps://github.com/kimdanny/Fair-RAG.\n","authors":["To Eun Kim","Fernando Diaz"],"pdf_url":"https://arxiv.org/pdf/2409.11598v2.pdf","comment":"Top 5 Spotlight at AFME Workshop at NeurIPS 2024"},{"id":"http://arxiv.org/abs/2412.02835v1","updated":"2024-12-03T21:00:10Z","published":"2024-12-03T21:00:10Z","title":"CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural\n Networks","summary":" We present CAISSON, a novel hierarchical approach to Retrieval-Augmented\nGeneration (RAG) that transforms traditional single-vector search into a\nmulti-view clustering framework. At its core, CAISSON leverages dual\nSelf-Organizing Maps (SOMs) to create complementary organizational views of the\ndocument space, where each view captures different aspects of document\nrelationships through specialized embeddings. The first view processes combined\ntext and metadata embeddings, while the second operates on metadata enriched\nwith concept embeddings, enabling a comprehensive multi-view analysis that\ncaptures both fine-grained semantic relationships and high-level conceptual\npatterns. This dual-view approach enables more nuanced document discovery by\ncombining evidence from different organizational perspectives. To evaluate\nCAISSON, we develop SynFAQA, a framework for generating synthetic financial\nanalyst notes and question-answer pairs that systematically tests different\naspects of information retrieval capabilities. Drawing on HotPotQA's\nmethodology for constructing multi-step reasoning questions, SynFAQA generates\ncontrolled test cases where each question is paired with the set of notes\ncontaining its ground-truth answer, progressing from simple single-entity\nqueries to complex multi-hop retrieval tasks involving multiple entities and\nconcepts. Our experimental results demonstrate substantial improvements over\nboth basic and enhanced RAG implementations, particularly for complex\nmulti-entity queries, while maintaining practical response times suitable for\ninteractive applications.\n","authors":["Igor Halperin"],"pdf_url":"https://arxiv.org/pdf/2412.02835v1.pdf","comment":"26 pages, 7 figures, 8 tables"},{"id":"http://arxiv.org/abs/2412.02588v1","updated":"2024-12-03T17:17:27Z","published":"2024-12-03T17:17:27Z","title":"Explainable CTR Prediction via LLM Reasoning","summary":" Recommendation Systems have become integral to modern user experiences, but\nlack transparency in their decision-making processes. Existing explainable\nrecommendation methods are hindered by reliance on a post-hoc paradigm, wherein\nexplanation generators are trained independently of the underlying recommender\nmodels. This paradigm necessitates substantial human effort in data\nconstruction and raises concerns about explanation reliability. In this paper,\nwe present ExpCTR, a novel framework that integrates large language model based\nexplanation generation directly into the CTR prediction process. Inspired by\nrecent advances in reinforcement learning, we employ two carefully designed\nreward mechanisms, LC alignment, which ensures explanations reflect user\nintentions, and IC alignment, which maintains consistency with traditional\nID-based CTR models. Our approach incorporates an efficient training paradigm\nwith LoRA and a three-stage iterative process. ExpCTR circumvents the need for\nextensive explanation datasets while fostering synergy between CTR prediction\nand explanation generation. Experimental results demonstrate that ExpCTR\nsignificantly enhances both recommendation accuracy and interpretability across\nthree real-world datasets.\n","authors":["Xiaohan Yu","Li Zhang","Chong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.02588v1.pdf","comment":"WSDM 2025"},{"id":"http://arxiv.org/abs/2412.00430v2","updated":"2024-12-03T15:43:49Z","published":"2024-11-30T10:56:30Z","title":"Predictive Models in Sequential Recommendations: Bridging Performance\n Laws with Data Quality Insights","summary":" Sequential Recommendation (SR) plays a critical role in predicting users'\nsequential preferences. Despite its growing prominence in various industries,\nthe increasing scale of SR models incurs substantial computational costs and\nunpredictability, challenging developers to manage resources efficiently. Under\nthis predicament, Scaling Laws have achieved significant success by examining\nthe loss as models scale up. However, there remains a disparity between loss\nand model performance, which is of greater concern in practical applications.\nMoreover, as data continues to expand, it incorporates repetitive and\ninefficient data. In response, we introduce the Performance Law for SR models,\nwhich aims to theoretically investigate and model the relationship between\nmodel performance and data quality. Specifically, we first fit the HR and NDCG\nmetrics to transformer-based SR models. Subsequently, we propose Approximate\nEntropy (ApEn) to assess data quality, presenting a more nuanced approach\ncompared to traditional data quantity metrics. Our method enables accurate\npredictions across various dataset scales and model sizes, demonstrating a\nstrong correlation in large SR models and offering insights into achieving\noptimal performance for any given model configuration.\n","authors":["Tingjia Shen","Hao Wang","Chuhan Wu","Jin Yao Chin","Wei Guo","Yong Liu","Huifeng Guo","Defu Lian","Ruiming Tang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00430v2.pdf","comment":"12 pages, 5 figures"},{"id":"http://arxiv.org/abs/2310.09401v4","updated":"2024-12-03T14:10:01Z","published":"2023-10-13T20:53:50Z","title":"A Novel Approach to Comprehending Users' Preferences for Accurate\n Personalized News Recommendation","summary":" Personalized news recommendation aims to assist users in finding news\narticles that align with their interests, which plays a pivotal role in\nmitigating users' information overload problem. Although many recent works have\nbeen studied for better personalized news recommendation, the following\nchallenges should be explored more: (C1) Comprehending manifold intents coupled\nwithin a news article, (C2) Differentiating varying post-read preferences of\nnews articles, and (C3) Addressing the cold-start user problem. To tackle the\naforementioned challenges together, in this paper, we propose a novel\npersonalized news recommendation framework (CROWN) that employs (1)\ncategory-guided intent disentanglement for (C1), (2) consistency-based news\nrepresentation for (C2), and (3) GNN-enhanced hybrid user representation for\n(C3). Furthermore, we incorporate a category prediction into the training\nprocess of CROWN as an auxiliary task, which provides supplementary supervisory\nsignals to enhance intent disentanglement. Extensive experiments on two\nreal-world datasets reveal that (1) CROWN provides consistent performance\nimprovements over ten state-of-the-art news recommendation methods and (2) the\nproposed strategies significantly improve the accuracy of CROWN.\n","authors":["Yunyong Ko","Seongeun Ryu","Sang-Wook Kim"],"pdf_url":"https://arxiv.org/pdf/2310.09401v4.pdf","comment":"10 pages, 6 figures, 8 tables"},{"id":"http://arxiv.org/abs/2412.02415v1","updated":"2024-12-03T12:20:56Z","published":"2024-12-03T12:20:56Z","title":"Knowledge-Enhanced Conversational Recommendation via Transformer-based\n Sequential Modelling","summary":" In conversational recommender systems (CRSs), conversations usually involve a\nset of items and item-related entities or attributes, e.g., director is a\nrelated entity of a movie. These items and item-related entities are often\nmentioned along the development of a dialog, leading to potential sequential\ndependencies among them. However, most of existing CRSs neglect these potential\nsequential dependencies. In this article, we first propose a Transformer-based\nsequential conversational recommendation method, named TSCR, to model the\nsequential dependencies in the conversations to improve CRS. In TSCR, we\nrepresent conversations by items and the item-related entities, and construct\nuser sequences to discover user preferences by considering both the mentioned\nitems and item-related entities. Based on the constructed sequences, we deploy\na Cloze task to predict the recommended items along a sequence. Meanwhile, in\ncertain domains, knowledge graphs formed by the items and their related\nentities are readily available, which provide various different kinds of\nassociations among them. Given that TSCR does not benefit from such knowledge\ngraphs, we then propose a knowledge graph enhanced version of TSCR, called\nTSCRKG. In specific, we leverage the knowledge graph to offline initialize our\nmodel TSCRKG, and augment the user sequence of conversations (i.e., sequence of\nthe mentioned items and item-related entities in the conversation) with\nmulti-hop paths in the knowledge graph. Experimental results demonstrate that\nour TSCR model significantly outperforms state-of-the-art baselines, and the\nenhanced version TSCRKG further improves recommendation performance on top of\nTSCR.\n","authors":["Jie Zou","Aixin Sun","Cheng Long","Evangelos Kanoulas"],"pdf_url":"https://arxiv.org/pdf/2412.02415v1.pdf","comment":"Accepted by ACM TOIS"},{"id":"http://arxiv.org/abs/2412.02310v1","updated":"2024-12-03T09:27:46Z","published":"2024-12-03T09:27:46Z","title":"Active Learning via Classifier Impact and Greedy Selection for\n Interactive Image Retrieval","summary":" Active Learning (AL) is a user-interactive approach aimed at reducing\nannotation costs by selecting the most crucial examples to label. Although AL\nhas been extensively studied for image classification tasks, the specific\nscenario of interactive image retrieval has received relatively little\nattention. This scenario presents unique characteristics, including an open-set\nand class-imbalanced binary classification, starting with very few labeled\nsamples. We introduce a novel batch-mode Active Learning framework named GAL\n(Greedy Active Learning) that better copes with this application. It\nincorporates a new acquisition function for sample selection that measures the\nimpact of each unlabeled sample on the classifier. We further embed this\nstrategy in a greedy selection approach, better exploiting the samples within\neach batch. We evaluate our framework with both linear (SVM) and non-linear\nMLP/Gaussian Process classifiers. For the Gaussian Process case, we show a\ntheoretical guarantee on the greedy approximation. Finally, we assess our\nperformance for the interactive content-based image retrieval task on several\nbenchmarks and demonstrate its superiority over existing approaches and common\nbaselines. Code is available at https://github.com/barleah/GreedyAL.\n","authors":["Leah Bar","Boaz Lerner","Nir Darshan","Rami Ben-Ari"],"pdf_url":"https://arxiv.org/pdf/2412.02310v1.pdf","comment":"Accepted to Transactions on Machine Learning Research (TMLR)"},{"id":"http://arxiv.org/abs/2412.02295v1","updated":"2024-12-03T09:09:52Z","published":"2024-12-03T09:09:52Z","title":"CADMR: Cross-Attention and Disentangled Learning for Multimodal\n Recommender Systems","summary":" The increasing availability and diversity of multimodal data in recommender\nsystems offer new avenues for enhancing recommendation accuracy and user\nsatisfaction. However, these systems must contend with high-dimensional, sparse\nuser-item rating matrices, where reconstructing the matrix with only small\nsubsets of preferred items for each user poses a significant challenge. To\naddress this, we propose CADMR, a novel autoencoder-based multimodal\nrecommender system framework. CADMR leverages multi-head cross-attention\nmechanisms and Disentangled Learning to effectively integrate and utilize\nheterogeneous multimodal data in reconstructing the rating matrix. Our approach\nfirst disentangles modality-specific features while preserving their\ninterdependence, thereby learning a joint latent representation. The multi-head\ncross-attention mechanism is then applied to enhance user-item interaction\nrepresentations with respect to the learned multimodal item latent\nrepresentations. We evaluate CADMR on three benchmark datasets, demonstrating\nsignificant performance improvements over state-of-the-art methods.\n","authors":["Yasser Khalafaoui","Martino Lovisetto","Basarab Matei","Nistor Grozavu"],"pdf_url":"https://arxiv.org/pdf/2412.02295v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02290v1","updated":"2024-12-03T09:07:13Z","published":"2024-12-03T09:07:13Z","title":"Characterizing Information Shared by Participants to Coding Challenges:\n The Case of Advent of Code","summary":" Advent of Code (AoC from now on) is a popular coding challenge requiring to\nsolve programming puzzles for a variety of skill sets and levels. AoC follows\nthe advent calendar, therefore it is an annual challenge that lasts for 25\ndays. AoC participants usually post their solutions on social networks and\ndiscuss them online. These challenges are interesting to study since they could\nhighlight the adoption of new tools, the evolution of the developer community,\nor the technological requirements of well-known companies. For these reasons,\nwe first create a dataset of the 2019-2021 AoC editions containing the\ndiscussion threads made on the subreddit {\\tt /r/adventofcode}. Then, we\npropose a model based on stream graphs to best study this context, where we\nrepresent its most important actors through time: participants, comments, and\nprogramming languages. Thanks to our model, we investigate user participation,\nadoption of new programming languages during a challenge and between two of\nthem, and resiliency of programming languages based on a Stack Overflow survey.\nWe find that the top-used programming languages are almost the same in the\nthree years, pointing out their importance. Moreover, participants tend to keep\nthe same programming language for the whole challenge, while the ones attending\ntwo AoCs usually change it in the next one. Finally, we observe interesting\nresults about the programming languages that are ``Popular'' or ``Loved''\naccording to the Stack Overflow survey. Firstly, these are the ones adopted for\nthe longest time in an AoC edition, thanks to which users have a high chance of\nreaching the end of the challenge. Secondly, they are the most chosen when a\nparticipant decides to change programming language during the same challenge.\n","authors":["Francesco Cauteruccio","Enrico Corradini","Luca Virgili"],"pdf_url":"https://arxiv.org/pdf/2412.02290v1.pdf","comment":"10 pages, 7 figures"},{"id":"http://arxiv.org/abs/2409.12161v2","updated":"2024-12-03T05:26:10Z","published":"2024-09-18T17:25:31Z","title":"Generalized compression and compressive search of large datasets","summary":" The Big Data explosion has necessitated the development of search algorithms\nthat scale sub-linearly in time and memory.\n While compression algorithms and search algorithms do exist independently,\nfew algorithms offer both, and those which do are domain-specific.\n We present panCAKES, a novel approach to compressive search, i.e., a way to\nperform $k$-NN and $\\rho$-NN search on compressed data while only decompressing\na small, relevant, portion of the data.\n panCAKES assumes the manifold hypothesis and leverages the low-dimensional\nstructure of the data to compress and search it efficiently.\n panCAKES is generic over any distance function for which the distance between\ntwo points is proportional to the memory cost of storing an encoding of one in\nterms of the other.\n This property holds for many widely-used distance functions, e.g. string edit\ndistances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures\n(Jaccard, Dice, etc.).\n We benchmark panCAKES on a variety of datasets, including genomic, proteomic,\nand set data.\n We compare compression ratios to gzip, and search performance between the\ncompressed and uncompressed versions of the same dataset.\n panCAKES achieves compression ratios close to those of gzip, while offering\nsub-linear time performance for $k$-NN and $\\rho$-NN search.\n We conclude that panCAKES is an efficient, general-purpose algorithm for\nexact compressive search on large datasets that obey the manifold hypothesis.\n We provide an open-source implementation of panCAKES in the Rust programming\nlanguage.\n","authors":["Morgan E. Prior","Thomas Howard III","Emily Light","Najib Ishaq","Noah M. Daniels"],"pdf_url":"https://arxiv.org/pdf/2409.12161v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01269v2","updated":"2024-12-03T04:37:03Z","published":"2024-12-02T08:35:54Z","title":"CPRM: A LLM-based Continual Pre-training Framework for Relevance\n Modeling in Commercial Search","summary":" Relevance modeling between queries and items stands as a pivotal component in\ncommercial search engines, directly affecting the user experience. Given the\nremarkable achievements of large language models (LLMs) in various natural\nlanguage processing (NLP) tasks, LLM-based relevance modeling is gradually\nbeing adopted within industrial search systems. Nevertheless, foundational LLMs\nlack domain-specific knowledge and do not fully exploit the potential of\nin-context learning. Furthermore, structured item text remains underutilized,\nand there is a shortage in the supply of corresponding queries and background\nknowledge. We thereby propose CPRM (Continual Pre-training for Relevance\nModeling), a framework designed for the continual pre-training of LLMs to\naddress these issues. Our CPRM framework includes three modules: 1) employing\nboth queries and multi-field item to jointly pre-train for enhancing domain\nknowledge, 2) applying in-context pre-training, a novel approach where LLMs are\npre-trained on a sequence of related queries or items, and 3) conducting\nreading comprehension on items to produce associated domain knowledge and\nbackground information (e.g., generating summaries and corresponding queries)\nto further strengthen LLMs. Results on offline experiments and online A/B\ntesting demonstrate that our model achieves convincing performance compared to\nstrong baselines.\n","authors":["Kaixin Wu","Yixin Ji","Zeyuan Chen","Qiang Wang","Cunxiang Wang","Hong Liu","Baijun Ji","Jia Xu","Zhongyi Liu","Jinjie Gu","Yuan Zhou","Linjian Mo"],"pdf_url":"https://arxiv.org/pdf/2412.01269v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02155v1","updated":"2024-12-03T04:29:27Z","published":"2024-12-03T04:29:27Z","title":"CausalMob: Causal Human Mobility Prediction with LLMs-derived Human\n Intentions toward Public Events","summary":" Large-scale human mobility exhibits spatial and temporal patterns that can\nassist policymakers in decision making. Although traditional prediction models\nattempt to capture these patterns, they often interfered by non-periodic public\nevents, such as disasters and occasional celebrations. Since regular human\nmobility patterns are heavily affected by these events, estimating their causal\neffects is critical to accurate mobility predictions. Although news articles\nprovide unique perspectives on these events in an unstructured format,\nprocessing is a challenge. In this study, we propose a causality-augmented\nprediction model, called \\textbf{CausalMob}, to analyze the causal effects of\npublic events. We first utilize large language models (LLMs) to extract human\nintentions from news articles and transform them into features that act as\ncausal treatments. Next, the model learns representations of spatio-temporal\nregional covariates from multiple data sources to serve as confounders for\ncausal inference. Finally, we present a causal effect estimation framework to\nensure event features remain independent of confounders during prediction.\nBased on large-scale real-world data, the experimental results show that the\nproposed model excels in human mobility prediction, outperforming\nstate-of-the-art models.\n","authors":["Xiaojie Yang","Hangli Ge","Jiawei Wang","Zipei Fan","Renhe Jiang","Ryosuke Shibasaki","Noboru Koshizuka"],"pdf_url":"https://arxiv.org/pdf/2412.02155v1.pdf","comment":"Accepted by KDD 2025"},{"id":"http://arxiv.org/abs/2412.02149v1","updated":"2024-12-03T04:09:36Z","published":"2024-12-03T04:09:36Z","title":"Leveraging Large Language Models for Comparative Literature\n Summarization with Reflective Incremental Mechanisms","summary":" In this paper, we introduce ChatCite, a novel method leveraging large\nlanguage models (LLMs) for generating comparative literature summaries. The\nability to summarize research papers with a focus on key comparisons between\nstudies is an essential task in academic research. Existing summarization\nmodels, while effective at generating concise summaries, fail to provide deep\ncomparative insights. ChatCite addresses this limitation by incorporating a\nmulti-step reasoning mechanism that extracts critical elements from papers,\nincrementally builds a comparative summary, and refines the output through a\nreflective memory process. We evaluate ChatCite on a custom dataset,\nCompLit-LongContext, consisting of 1000 research papers with annotated\ncomparative summaries. Experimental results show that ChatCite outperforms\nseveral baseline methods, including GPT-4, BART, T5, and CoT, across various\nautomatic evaluation metrics such as ROUGE and the newly proposed G-Score.\nHuman evaluation further confirms that ChatCite generates more coherent,\ninsightful, and fluent summaries compared to these baseline models. Our method\nprovides a significant advancement in automatic literature review generation,\noffering researchers a powerful tool for efficiently comparing and synthesizing\nscientific research.\n","authors":["Fernando Gabriela Garcia","Spencer Burns","Harrison Fuller"],"pdf_url":"https://arxiv.org/pdf/2412.02149v1.pdf","comment":"8 pages"},{"id":"http://arxiv.org/abs/2412.02142v1","updated":"2024-12-03T03:59:03Z","published":"2024-12-03T03:59:03Z","title":"Personalized Multimodal Large Language Models: A Survey","summary":" Multimodal Large Language Models (MLLMs) have become increasingly important\ndue to their state-of-the-art performance and ability to integrate multiple\ndata modalities, such as text, images, and audio, to perform complex tasks with\nhigh accuracy. This paper presents a comprehensive survey on personalized\nmultimodal large language models, focusing on their architecture, training\nmethods, and applications. We propose an intuitive taxonomy for categorizing\nthe techniques used to personalize MLLMs to individual users, and discuss the\ntechniques accordingly. Furthermore, we discuss how such techniques can be\ncombined or adapted when appropriate, highlighting their advantages and\nunderlying rationale. We also provide a succinct summary of personalization\ntasks investigated in existing research, along with the evaluation metrics\ncommonly used. Additionally, we summarize the datasets that are useful for\nbenchmarking personalized MLLMs. Finally, we outline critical open challenges.\nThis survey aims to serve as a valuable resource for researchers and\npractitioners seeking to understand and advance the development of personalized\nmultimodal large language models.\n","authors":["Junda Wu","Hanjia Lyu","Yu Xia","Zhehao Zhang","Joe Barrow","Ishita Kumar","Mehrnoosh Mirtaheri","Hongjie Chen","Ryan A. Rossi","Franck Dernoncourt","Tong Yu","Ruiyi Zhang","Jiuxiang Gu","Nesreen K. Ahmed","Yu Wang","Xiang Chen","Hanieh Deilamsalehy","Namyong Park","Sungchul Kim","Huanrui Yang","Subrata Mitra","Zhengmian Hu","Nedim Lipka","Dang Nguyen","Yue Zhao","Jiebo Luo","Julian McAuley"],"pdf_url":"https://arxiv.org/pdf/2412.02142v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02122v1","updated":"2024-12-03T03:20:40Z","published":"2024-12-03T03:20:40Z","title":"Improving Sequential Recommender Systems with Online and In-store User\n Behavior","summary":" Online e-commerce platforms have been extending in-store shopping, which\nallows users to keep the canonical online browsing and checkout experience\nwhile exploring in-store shopping. However, the growing transition between\nonline and in-store becomes a challenge to sequential recommender systems for\nfuture online interaction prediction due to the lack of holistic modeling of\nhybrid user behaviors (online and in-store). The challenges are twofold. First,\ncombining online and in-store user behavior data into a single data schema and\nsupporting multiple stages in the model life cycle (pre-training, training,\ninference, etc.) organically needs a new data pipeline design. Second, online\nrecommender systems, which solely rely on online user behavior sequences, must\nbe redesigned to support online and in-store user data as input under the\nsequential modeling setting. To overcome the first challenge, we propose a\nhybrid, omnichannel data pipeline to compile online and in-store user behavior\ndata by caching information from diverse data sources. Later, we introduce a\nmodel-agnostic encoder module to the sequential recommender system to interpret\nthe user in-store transaction and augment the modeling capacity for better\nonline interaction prediction given the hybrid user behavior.\n","authors":["Luyi Ma","Aashika Padmanabhan","Anjana Ganesh","Shengwei Tang","Jiao Chen","Xiaohan Li","Lalitesh Morishetti","Kaushiki Nag","Malay Patel","Jason Cho","Sushant Kumar","Kannan Achan"],"pdf_url":"https://arxiv.org/pdf/2412.02122v1.pdf","comment":"6 pages, IEEE BigData 2024 Workshop"},{"id":"http://arxiv.org/abs/2412.02043v1","updated":"2024-12-03T00:01:48Z","published":"2024-12-03T00:01:48Z","title":"Future of Information Retrieval Research in the Age of Generative AI","summary":" In the fast-evolving field of information retrieval (IR), the integration of\ngenerative AI technologies such as large language models (LLMs) is transforming\nhow users search for and interact with information. Recognizing this paradigm\nshift at the intersection of IR and generative AI (IR-GenAI), a visioning\nworkshop supported by the Computing Community Consortium (CCC) was held in July\n2024 to discuss the future of IR in the age of generative AI. This workshop\nconvened 44 experts in information retrieval, natural language processing,\nhuman-computer interaction, and artificial intelligence from academia,\nindustry, and government to explore how generative AI can enhance IR and vice\nversa, and to identify the major challenges and opportunities in this rapidly\nadvancing field.\n This report contains a summary of discussions as potentially important\nresearch topics and contains a list of recommendations for academics, industry\npractitioners, institutions, evaluation campaigns, and funding agencies.\n","authors":["James Allan","Eunsol Choi","Daniel P. Lopresti","Hamed Zamani"],"pdf_url":"https://arxiv.org/pdf/2412.02043v1.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2409.08489v2","updated":"2024-12-03T23:17:44Z","published":"2024-09-13T02:32:10Z","title":"Resource-Efficient Reference-Free Evaluation of Audio Captions","summary":" To establish the trustworthiness of systems that automatically generate text\ncaptions for audio, images and video, existing reference-free metrics rely on\nlarge pretrained models which are impractical to accommodate in\nresource-constrained settings. To address this, we propose some metrics to\nelicit the model's confidence in its own generation. To assess how well these\nmetrics replace correctness measures that leverage reference captions, we test\ntheir calibration with correctness measures. We discuss why some of these\nconfidence metrics align better with certain correctness measures. Further, we\nprovide insight into why temperature scaling of confidence metrics is\neffective. Our main contribution is a suite of well-calibrated lightweight\nconfidence metrics for reference-free evaluation of captions in\nresource-constrained settings.\n","authors":["Rehana Mahfuz","Yinyi Guo","Erik Visser"],"pdf_url":"https://arxiv.org/pdf/2409.08489v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.02611v1","updated":"2024-12-03T17:41:23Z","published":"2024-12-03T17:41:23Z","title":"AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand\n Audio-Visual Information?","summary":" Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini\n1.5 Pro, and Reka Core, have expanded their capabilities to include vision and\naudio modalities. While these models demonstrate impressive performance across\na wide range of audio-visual applications, our proposed DeafTest reveals that\nMLLMs often struggle with simple tasks humans find trivial: 1) determining\nwhich of two sounds is louder, and 2) determining which of two sounds has a\nhigher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a\ncomprehensive audio-visual benchmark designed to assess whether those MLLMs can\ntruly understand the audio-visual information. This benchmark encompasses 4,555\ncarefully crafted problems, each incorporating text, visual, and audio\ncomponents. To successfully infer answers, models must effectively leverage\nclues from both visual and audio inputs. To ensure precise and objective\nevaluation of MLLM responses, we have structured the questions as\nmultiple-choice, eliminating the need for human evaluation or LLM-assisted\nassessment. We benchmark a series of closed-source and open-source models and\nsummarize the observations. By revealing the limitations of current models, we\naim to provide useful insight for future dataset collection and model\ndevelopment.\n","authors":["Kaixiong Gong","Kaituo Feng","Bohao Li","Yibing Wang","Mofan Cheng","Shijia Yang","Jiaming Han","Benyou Wang","Yutong Bai","Zhuoran Yang","Xiangyu Yue"],"pdf_url":"https://arxiv.org/pdf/2412.02611v1.pdf","comment":"Project page: https://av-odyssey.github.io/"},{"id":"http://arxiv.org/abs/2412.02575v1","updated":"2024-12-03T17:02:40Z","published":"2024-12-03T17:02:40Z","title":"Copy-Move Forgery Detection and Question Answering for Remote Sensing\n Image","summary":" This paper introduces the task of Remote Sensing Copy-Move Question Answering\n(RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA),\nRSCMQA focuses on interpreting complex tampering scenarios and inferring\nrelationships between objects. Based on the practical needs of national defense\nsecurity and land resource monitoring, we have developed an accurate and\ncomprehensive global dataset for remote sensing image copy-move question\nanswering, named RS-CMQA-2.1M. These images were collected from 29 different\nregions across 14 countries. Additionally, we have refined a balanced dataset,\nRS-CMQA-B, to address the long-standing issue of long-tail data in the remote\nsensing field. Furthermore, we propose a region-discriminative guided\nmultimodal CMQA model, which enhances the accuracy of answering questions about\ntampered images by leveraging prompt about the differences and connections\nbetween the source and tampered domains. Extensive experiments demonstrate that\nour method provides a stronger benchmark for RS-CMQA compared to general VQA\nand RSVQA models. Our dataset and code are available at\nhttps://github.com/shenyedepisa/RSCMQA.\n","authors":["Ze Zhang","Enyuan Zhao","Ziyi Wan","Jie Nie","Xinyue Liang","Lei Huang"],"pdf_url":"https://arxiv.org/pdf/2412.02575v1.pdf","comment":"7 figs, 7 tables"},{"id":"http://arxiv.org/abs/2412.02419v1","updated":"2024-12-03T12:31:44Z","published":"2024-12-03T12:31:44Z","title":"It Takes Two: Real-time Co-Speech Two-person's Interaction Generation\n via Reactive Auto-regressive Diffusion Model","summary":" Conversational scenarios are very common in real-world settings, yet existing\nco-speech motion synthesis approaches often fall short in these contexts, where\none person's audio and gestures will influence the other's responses.\nAdditionally, most existing methods rely on offline sequence-to-sequence\nframeworks, which are unsuitable for online applications. In this work, we\nintroduce an audio-driven, auto-regressive system designed to synthesize\ndynamic movements for two characters during a conversation. At the core of our\napproach is a diffusion-based full-body motion synthesis model, which is\nconditioned on the past states of both characters, speech audio, and a\ntask-oriented motion trajectory input, allowing for flexible spatial control.\nTo enhance the model's ability to learn diverse interactions, we have enriched\nexisting two-person conversational motion datasets with more dynamic and\ninteractive motions. We evaluate our system through multiple experiments to\nshow it outperforms across a variety of tasks, including single and two-person\nco-speech motion generation, as well as interactive motion generation. To the\nbest of our knowledge, this is the first system capable of generating\ninteractive full-body motions for two characters from speech in an online\nmanner.\n","authors":["Mingyi Shi","Dafei Qin","Leo Ho","Zhouyingcheng Liao","Yinghao Huang","Junichi Yamagishi","Taku Komura"],"pdf_url":"https://arxiv.org/pdf/2412.02419v1.pdf","comment":"15 pages, 10 figures"}]},"2024-12-02T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2402.16886v2","updated":"2024-12-02T21:35:55Z","published":"2024-02-07T22:15:15Z","title":"Using text embedding models as text classifiers with medical data","summary":" The advent of Large Language Models (LLMs) is promising and LLMs have been\napplied to numerous fields. However, it is not trivial to implement LLMs in the\nmedical field, due to the high standards for precision and accuracy. Currently,\nthe diagnosis of medical ailments must be done by hand, as it is costly to\nbuild a sufficiently broad LLM that can diagnose a wide range of diseases.\nHere, we explore the use of vector databases and embedding models as a means of\nencoding and classifying text with medical text data without the need to train\na new model altogether. We used various LLMs to generate the medical data, then\nencoded the data with a text embedding model and stored it in a vector\ndatabase. We hypothesized that higher embedding dimensions coupled with\ndescriptive data in the vector database would lead to better classifications\nand designed a robustness test to test our hypothesis. By using vector\ndatabases and text embedding models to classify a clinician's notes on a\npatient presenting with a certain ailment, we showed that these tools can be\nsuccessful at classifying medical text data. We found that a higher embedding\ndimension did indeed yield better results, however, querying with simple data\nin the database was optimal for performance. We have shown in this study the\napplicability of text embedding models and vector databases on a small scale,\nand our work lays the groundwork for applying these tools on a larger scale.\n","authors":["Rishabh Goel"],"pdf_url":"https://arxiv.org/pdf/2402.16886v2.pdf","comment":"15 pages, 6 figures"},{"id":"http://arxiv.org/abs/2412.01985v1","updated":"2024-12-02T21:29:16Z","published":"2024-12-02T21:29:16Z","title":"Improving feature interactions at Pinterest under industry constraints","summary":" Adopting advances in recommendation systems is often challenging in\nindustrial settings due to unique constraints. This paper aims to highlight\nthese constraints through the lens of feature interactions. Feature\ninteractions are critical for accurately predicting user behavior in\nrecommendation systems and online advertising. Despite numerous novel\ntechniques showing superior performance on benchmark datasets like Criteo,\ntheir direct application in industrial settings is hindered by constraints such\nas model latency, GPU memory limitations and model reproducibility. In this\npaper, we share our learnings from improving feature interactions in\nPinterest's Homefeed ranking model under such constraints. We provide details\nabout the specific challenges encountered, the strategies employed to address\nthem, and the trade-offs made to balance performance with practical\nlimitations. Additionally, we present a set of learning experiments that help\nguide the feature interaction architecture selection. We believe these insights\nwill be useful for engineers who are interested in improving their model\nthrough better feature interaction learning.\n","authors":["Siddarth Malreddy","Matthew Lawhon","Usha Amrutha Nookala","Aditya Mantha","Dhruvil Deven Badani"],"pdf_url":"https://arxiv.org/pdf/2412.01985v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01979v1","updated":"2024-12-02T21:16:47Z","published":"2024-12-02T21:16:47Z","title":"FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph\n Attention Networks and Transformer Encoders","summary":" Missing data is a pervasive challenge in wireless networks and many other\ndomains, often compromising the performance of machine learning and deep\nlearning models. To address this, we propose a novel framework, FGATT, that\ncombines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder\nto perform robust and accurate data imputation. FGAT leverages fuzzy rough sets\nand graph attention mechanisms to capture spatial dependencies dynamically,\neven in scenarios where predefined spatial information is unavailable. The\nTransformer encoder is employed to model temporal dependencies, utilizing its\nself-attention mechanism to focus on significant time-series patterns. A\nself-adaptive graph construction method is introduced to enable dynamic\nconnectivity learning, ensuring the framework's applicability to a wide range\nof wireless datasets. Extensive experiments demonstrate that our approach\noutperforms state-of-the-art methods in imputation accuracy and robustness,\nparticularly in scenarios with substantial missing data. The proposed model is\nwell-suited for applications in wireless sensor networks and IoT environments,\nwhere data integrity is critical.\n","authors":["Jinming Xing","Ruilin Xing","Yan Sun"],"pdf_url":"https://arxiv.org/pdf/2412.01979v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01940v1","updated":"2024-12-02T20:04:06Z","published":"2024-12-02T20:04:06Z","title":"Down with the Hierarchy: The 'H' in HNSW Stands for \"Hubs\"","summary":" Driven by recent breakthrough advances in neural representation learning,\napproximate near-neighbor (ANN) search over vector embeddings has emerged as a\ncritical computational workload. With the introduction of the seminal\nHierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have\nestablished themseves as the overwhelmingly dominant paradigm for efficient and\nscalable ANN search. As the name suggests, HNSW searches a layered hierarchical\ngraph to quickly identify neighborhoods of similar points to a given query\nvector. But is this hierarchy even necessary? A rigorous experimental analysis\nto answer this question would provide valuable insights into the nature of\nalgorithm design for ANN search and motivate directions for future work in this\nincreasingly crucial domain. To that end, we conduct an extensive benchmarking\nstudy covering more large-scale datasets than prior investigations of this\nquestion. We ultimately find that a flat graph retains all of the benefits of\nHNSW on high-dimensional datasets, with latency and recall performance\nessentially \\emph{identical} to the original algorithm but with less memory\noverhead. Furthermore, we go a step further and study \\emph{why} the hierarchy\nof HNSW provides no benefit in high dimensions, hypothesizing that navigable\nsmall world graphs contain a well-connected, frequently traversed ``highway\" of\nhub nodes that maintain the same purported function as the hierarchical layers.\nWe present compelling empirical evidence that the \\emph{Hub Highway Hypothesis}\nholds for real datasets and investigate the mechanisms by which the highway\nforms. The implications of this hypothesis may also provide future research\ndirections in developing enhancements to graph-based ANN search.\n","authors":["Blaise Munyampirwa","Vihan Lakshman","Benjamin Coleman"],"pdf_url":"https://arxiv.org/pdf/2412.01940v1.pdf","comment":"10 pages"},{"id":"http://arxiv.org/abs/2409.05677v2","updated":"2024-12-02T18:13:28Z","published":"2024-09-09T14:44:19Z","title":"RIRAG: Regulatory Information Retrieval and Answer Generation","summary":" Regulatory documents, issued by governmental regulatory bodies, establish\nrules, guidelines, and standards that organizations must adhere to for legal\ncompliance. These documents, characterized by their length, complexity and\nfrequent updates, are challenging to interpret, requiring significant\nallocation of time and expertise on the part of organizations to ensure ongoing\ncompliance. Regulatory Natural Language Processing (RegNLP) is a\nmultidisciplinary field aimed at simplifying access to and interpretation of\nregulatory rules and obligations. We introduce a task of generating\nquestion-passages pairs, where questions are automatically created and paired\nwith relevant regulatory passages, facilitating the development of regulatory\nquestion-answering systems. We create the ObliQA dataset, containing 27,869\nquestions derived from the collection of Abu Dhabi Global Markets (ADGM)\nfinancial regulation documents, design a baseline Regulatory Information\nRetrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a\nnovel evaluation metric that tests whether generated answers accurately capture\nall relevant obligations while avoiding contradictions.\n","authors":["Tuba Gokhan","Kexin Wang","Iryna Gurevych","Ted Briscoe"],"pdf_url":"https://arxiv.org/pdf/2409.05677v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01626v1","updated":"2024-12-02T15:44:19Z","published":"2024-12-02T15:44:19Z","title":"Using Large Language Models in Automatic Hint Ranking and Generation\n Tasks","summary":" The use of Large Language Models (LLMs) has increased significantly recently,\nwith individuals frequently interacting with chatbots to receive answers to a\nwide range of questions. In an era where information is readily accessible, it\nis crucial to stimulate and preserve human cognitive abilities and maintain\nstrong reasoning skills. This paper addresses such challenges by promoting the\nuse of hints as an alternative or a supplement to direct answers. We first\nintroduce a manually constructed hint dataset, WIKIHINT, which includes 5,000\nhints created for 1,000 questions. We then finetune open-source LLMs such as\nLLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We\nassess the effectiveness of the hints with human participants who try to answer\nquestions with and without the aid of hints. Additionally, we introduce a\nlightweight evaluation method, HINTRANK, to evaluate and rank hints in both\nanswer-aware and answer-agnostic settings. Our findings show that (a) the\ndataset helps generate more effective hints, (b) including answer information\nalong with questions generally improves hint quality, and (c) encoder-based\nmodels perform better than decoder-based models in hint ranking.\n","authors":["Jamshid Mozafari","Florian Gerhold","Adam Jatowt"],"pdf_url":"https://arxiv.org/pdf/2412.01626v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2406.11251v2","updated":"2024-12-02T13:26:14Z","published":"2024-06-17T06:27:35Z","title":"Unifying Multimodal Retrieval via Document Screenshot Embedding","summary":" In the real world, documents are organized in different formats and varied\nmodalities. Traditional retrieval pipelines require tailored document parsing\ntechniques and content extraction modules to prepare input for indexing. This\nprocess is tedious, prone to errors, and has information loss. To this end, we\npropose Document Screenshot Embedding (DSE), a novel retrieval paradigm that\nregards document screenshots as a unified input format, which does not require\nany content extraction preprocess and preserves all the information in a\ndocument (e.g., text, image and layout). DSE leverages a large vision-language\nmodel to directly encode document screenshots into dense representations for\nretrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a\n1.3M Wikipedia web page screenshots as the corpus to answer the questions from\nthe Natural Questions dataset. In such a text-intensive document retrieval\nsetting, DSE shows competitive effectiveness compared to other text retrieval\nmethods relying on parsing. For example, DSE outperforms BM25 by 17 points in\ntop-1 retrieval accuracy. Additionally, in a mixed-modality task of slide\nretrieval, DSE significantly outperforms OCR text retrieval methods by over 15\npoints in nDCG@10. These experiments show that DSE is an effective document\nretrieval paradigm for diverse types of documents. Model checkpoints, code, and\nWiki-SS collection will be released.\n","authors":["Xueguang Ma","Sheng-Chieh Lin","Minghan Li","Wenhu Chen","Jimmy Lin"],"pdf_url":"https://arxiv.org/pdf/2406.11251v2.pdf","comment":"EMNLP2024 main"},{"id":"http://arxiv.org/abs/2412.01443v1","updated":"2024-12-02T12:32:19Z","published":"2024-12-02T12:32:19Z","title":"Multi-Facet Blending for Faceted Query-by-Example Retrieval","summary":" With the growing demand to fit fine-grained user intents, faceted\nquery-by-example (QBE), which retrieves similar documents conditioned on\nspecific facets, has gained recent attention. However, prior approaches mainly\ndepend on document-level comparisons using basic indicators like citations due\nto the lack of facet-level relevance datasets; yet, this limits their use to\ncitation-based domains and fails to capture the intricacies of facet\nconstraints. In this paper, we propose a multi-facet blending (FaBle)\naugmentation method, which exploits modularity by decomposing and recomposing\nto explicitly synthesize facet-specific training sets. We automatically\ndecompose documents into facet units and generate (ir)relevant pairs by\nleveraging LLMs' intrinsic distinguishing capabilities; then, dynamically\nrecomposing the units leads to facet-wise relevance-informed document pairs.\nOur modularization eliminates the need for pre-defined facet knowledge or\nlabels. Further, to prove the FaBle's efficacy in a new domain beyond\ncitation-based scientific paper retrieval, we release a benchmark dataset for\neducational exam item QBE. FaBle augmentation on 1K documents remarkably\nassists training in obtaining facet conditional embeddings.\n","authors":["Heejin Do","Sangwon Ryu","Jonghwi Kim","Gary Geunbae Lee"],"pdf_url":"https://arxiv.org/pdf/2412.01443v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01291v1","updated":"2024-12-02T09:04:16Z","published":"2024-12-02T09:04:16Z","title":"Global Estimation of Building-Integrated Facade and Rooftop Photovoltaic\n Potential by Integrating 3D Building Footprint and Spatio-Temporal Datasets","summary":" This research tackles the challenges of estimating Building-Integrated\nPhotovoltaics (BIPV) potential across various temporal and spatial scales,\naccounting for different geographical climates and urban morphology. We\nintroduce a holistic methodology for evaluating BIPV potential, integrating 3D\nbuilding footprint models with diverse meteorological data sources to account\nfor dynamic shadow effects. The approach enables the assessment of PV potential\non facades and rooftops at different levels-individual buildings, urban blocks,\nand cities globally. Through an analysis of 120 typical cities, we highlight\nthe importance of 3D building forms, cityscape morphology, and geographic\npositioning in measuring BIPV potential at various levels. In particular, our\nsimulation study reveals that among cities with optimal facade PV performance,\nthe average ratio of facade PV potential to rooftop PV potential is\napproximately 68.2%. Additionally, approximately 17.5% of the analyzed samples\ndemonstrate even higher facade PV potentials compared to rooftop installations.\nThis finding underscores the strategic value of incorporating facade PV\napplications into urban sustainable energy systems.\n","authors":["Qing Yu","Kechuan Dong","Zhiling Guo","Jiaxing Li","Hongjun Tan","Yanxiu Jin","Jian Yuan","Haoran Zhang","Junwei Liu","Qi Chen","Jinyue Yan"],"pdf_url":"https://arxiv.org/pdf/2412.01291v1.pdf","comment":"17 pages, 5 figures"},{"id":"http://arxiv.org/abs/2412.01290v1","updated":"2024-12-02T09:03:05Z","published":"2024-12-02T09:03:05Z","title":"Learning Smooth Distance Functions via Queries","summary":" In this work, we investigate the problem of learning distance functions\nwithin the query-based learning framework, where a learner is able to pose\ntriplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We\nestablish formal guarantees on the query complexity required to learn smooth,\nbut otherwise general, distance functions under two notions of approximation:\n$\\omega$-additive approximation and $(1 + \\omega)$-multiplicative\napproximation. For the additive approximation, we propose a global method whose\nquery complexity is quadratic in the size of a finite cover of the sample\nspace. For the (stronger) multiplicative approximation, we introduce a method\nthat combines global and local approaches, utilizing multiple Mahalanobis\ndistance functions to capture local geometry. This method has a query\ncomplexity that scales quadratically with both the size of the cover and the\nambient space dimension of the sample space.\n","authors":["Akash Kumar","Sanjoy Dasgupta"],"pdf_url":"https://arxiv.org/pdf/2412.01290v1.pdf","comment":"40 pages, 1 figure"},{"id":"http://arxiv.org/abs/2409.10825v3","updated":"2024-12-02T07:00:57Z","published":"2024-09-17T01:37:57Z","title":"Unveiling and Mitigating Bias in Large Language Model Recommendations: A\n Path to Fairness","summary":" excel in delivering comprehensive suggestions by deeply analyzing content and\nuser behavior. However, they often inherit biases from skewed training data,\nfavoring mainstream content while underrepresenting diverse or non-traditional\noptions. This study explores the interplay between bias and LLM-based\nrecommendation systems, focusing on music, song, and book recommendations\nacross diverse demographic and cultural groups. This paper analyzes bias in\nLLM-based recommendation systems across multiple models (GPT, LLaMA, and\nGemini), revealing its deep and pervasive impact on outcomes. Intersecting\nidentities and contextual factors, like socioeconomic status, further amplify\nbiases, complicating fair recommendations across diverse groups. Our findings\nreveal that bias in these systems is deeply ingrained, yet even simple\ninterventions like prompt engineering can significantly reduce it. We further\npropose a retrieval-augmented generation strategy to mitigate bias more\neffectively. Numerical experiments validate these strategies, demonstrating\nboth the pervasive nature of bias and the impact of the proposed solutions.\n","authors":["Anindya Bijoy Das","Shahnewaz Karim Sakib"],"pdf_url":"https://arxiv.org/pdf/2409.10825v3.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01141v1","updated":"2024-12-02T05:31:22Z","published":"2024-12-02T05:31:22Z","title":"Lossless and Privacy-Preserving Graph Convolution Network for Federated\n Item Recommendation","summary":" Graph neural network (GNN) has emerged as a state-of-the-art solution for\nitem recommendation. However, existing GNN-based recommendation methods rely on\na centralized storage of fragmented user-item interaction sub-graphs and\ntraining on an aggregated global graph, which will lead to privacy concerns. As\na response, some recent works develop GNN-based federated recommendation\nmethods by exploiting decentralized and fragmented user-item sub-graphs in\norder to preserve user privacy. However, due to privacy constraints, the graph\nconvolution process in existing federated recommendation methods is incomplete\ncompared with the centralized counterpart, causing a degradation of the\nrecommendation performance. In this paper, we propose a novel lossless and\nprivacy-preserving graph convolution network (LP-GCN), which fully completes\nthe graph convolution process with decentralized user-item interaction\nsub-graphs while ensuring privacy. It is worth mentioning that its performance\nis equivalent to that of the non-federated (i.e., centralized) counterpart.\nMoreover, we validate its effectiveness through both theoretical analysis and\nempirical studies. Extensive experiments on three real-world datasets show that\nour LP-GCN outperforms the existing federated recommendation methods. The code\nwill be publicly available once the paper is accepted.\n","authors":["Guowei Wu","Weike Pan","Qiang Yang","Zhong Ming"],"pdf_url":"https://arxiv.org/pdf/2412.01141v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01127v1","updated":"2024-12-02T05:03:56Z","published":"2024-12-02T05:03:56Z","title":"Precision Profile Pollution Attack on Sequential Recommenders via\n Influence Function","summary":" Sequential recommendation approaches have demonstrated remarkable proficiency\nin modeling user preferences. Nevertheless, they are susceptible to profile\npollution attacks (PPA), wherein items are introduced into a user's interaction\nhistory deliberately to influence the recommendation list. Since retraining the\nmodel for each polluted item is time-consuming, recent PPAs estimate item\ninfluence based on gradient directions to identify the most effective attack\ncandidates. However, the actual item representations diverge significantly from\nthe gradients, resulting in disparate outcomes.To tackle this challenge, we\nintroduce an INFluence Function-based Attack approach INFAttack that offers a\nmore accurate estimation of the influence of polluting items. Specifically, we\ncalculate the modifications to the original model using the influence function\nwhen generating polluted sequences by introducing specific items. Subsequently,\nwe choose the sequence that has been most significantly influenced to\nsubstitute the original sequence, thus promoting the target item. Comprehensive\nexperiments conducted on five real-world datasets illustrate that INFAttack\nsurpasses all baseline methods and consistently delivers stable attack\nperformance for both popular and unpopular items.\n","authors":["Xiaoyu Du","Yingying Chen","Yang Zhang","Jinhui Tang"],"pdf_url":"https://arxiv.org/pdf/2412.01127v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01093v1","updated":"2024-12-02T04:05:49Z","published":"2024-12-02T04:05:49Z","title":"Automated Extraction of Acronym-Expansion Pairs from Scientific Papers","summary":" This project addresses challenges posed by the widespread use of\nabbreviations and acronyms in digital texts. We propose a novel method that\ncombines document preprocessing, regular expressions, and a large language\nmodel to identify abbreviations and map them to their corresponding expansions.\nThe regular expressions alone are often insufficient to extract expansions, at\nwhich point our approach leverages GPT-4 to analyze the text surrounding the\nacronyms. By limiting the analysis to only a small portion of the surrounding\ntext, we mitigate the risk of obtaining incorrect or multiple expansions for an\nacronym. There are several known challenges in processing text with acronyms,\nincluding polysemous acronyms, non-local and ambiguous acronyms. Our approach\nenhances the precision and efficiency of NLP techniques by addressing these\nissues with automated acronym identification and disambiguation. This study\nhighlights the challenges of working with PDF files and the importance of\ndocument preprocessing. Furthermore, the results of this work show that neither\nregular expressions nor GPT-4 alone can perform well. Regular expressions are\nsuitable for identifying acronyms but have limitations in finding their\nexpansions within the paper due to a variety of formats used for expressing\nacronym-expansion pairs and the tendency of authors to omit expansions within\nthe text. GPT-4, on the other hand, is an excellent tool for obtaining\nexpansions but struggles with correctly identifying all relevant acronyms.\nAdditionally, GPT-4 poses challenges due to its probabilistic nature, which may\nlead to slightly different results for the same input. Our algorithm employs\npreprocessing to eliminate irrelevant information from the text, regular\nexpressions for identifying acronyms, and a large language model to help find\nacronym expansions to provide the most accurate and consistent results.\n","authors":["Izhar Ali","Million Haileyesus","Serhiy Hnatyshyn","Jan-Lucas Ott","Vasil Hnatyshin"],"pdf_url":"https://arxiv.org/pdf/2412.01093v1.pdf","comment":"9 pages, 1 figure"},{"id":"http://arxiv.org/abs/2412.01011v1","updated":"2024-12-02T00:05:20Z","published":"2024-12-02T00:05:20Z","title":"e-Fold Cross-Validation for Recommender-System Evaluation","summary":" To combat the rising energy consumption of recommender systems we implement a\nnovel alternative for k-fold cross validation. This alternative, named e-fold\ncross validation, aims to minimize the number of folds to achieve a reduction\nin power usage while keeping the reliability and robustness of the test results\nhigh. We tested our method on 5 recommender system algorithms across 6 datasets\nand compared it with 10-fold cross validation. On average e-fold cross\nvalidation only needed 41.5% of the energy that 10-fold cross validation would\nneed, while it's results only differed by 1.81%. We conclude that e-fold cross\nvalidation is a promising approach that has the potential to be an energy\nefficient but still reliable alternative to k-fold cross validation.\n","authors":["Moritz Baumgart","Lukas Wegmeth","Tobias Vente","Joeran Beel"],"pdf_url":"https://arxiv.org/pdf/2412.01011v1.pdf","comment":"This preprint has not undergone peer review (when applicable) or any\n post-submission improvements or corrections. The Version of Record of this\n contribution is published in [TBA], and is available online at [TBA]"}],"Multimedia":[{"id":"http://arxiv.org/abs/2412.01986v1","updated":"2024-12-02T21:35:33Z","published":"2024-12-02T21:35:33Z","title":"HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh\n Quality Assessment","summary":" Mesh quality assessment (MQA) models play a critical role in the design,\noptimization, and evaluation of mesh operation systems in a wide variety of\napplications. Current MQA models, whether model-based methods using\ntopology-aware features or projection-based approaches working on rendered 2D\nprojections, often fail to capture the intricate interactions between texture\nand 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid\nfull-reference colored MQA framework that integrates model-based and\nprojection-based approaches, capturing complex interactions between textural\ninformation and 3D structures for enriched quality representations. Our method\nemploys graph learning to extract detailed 3D representations, which are then\nprojected to 2D using a novel feature rendering process that precisely aligns\nthem with colored projections. This enables the exploration of geometry-texture\ninteractions via cross-attention, producing comprehensive mesh quality\nrepresentations. Extensive experiments demonstrate HybridMQA's superior\nperformance across diverse datasets, highlighting its ability to effectively\nleverage geometry-texture interactions for a thorough understanding of mesh\nquality. Our implementation will be made publicly available.\n","authors":["Armin Shafiee Sarvestani","Sheyang Tang","Zhou Wang"],"pdf_url":"https://arxiv.org/pdf/2412.01986v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01824v1","updated":"2024-12-02T18:59:26Z","published":"2024-12-02T18:59:26Z","title":"X-Prompt: Towards Universal In-Context Image Generation in\n Auto-Regressive Vision Language Foundation Models","summary":" In-context generation is a key component of large language models' (LLMs)\nopen-task generalization capability. By leveraging a few examples as context,\nLLMs can perform both in-domain and out-of-domain tasks. Recent advancements in\nauto-regressive vision-language models (VLMs) built upon LLMs have showcased\nimpressive performance in text-to-image generation. However, the potential of\nin-context learning for general image generation tasks remains largely\nunexplored. To address this, we introduce X-Prompt, a purely auto-regressive\nlarge-vision language model designed to deliver competitive performance across\na wide range of both seen and unseen image generation tasks, all within a\nunified in-context learning framework. X-Prompt incorporates a specialized\ndesign that efficiently compresses valuable features from in-context examples,\nsupporting longer in-context token sequences and improving its ability to\ngeneralize to unseen tasks. A unified training task for both text and image\nprediction enables X-Prompt to handle general image generation with enhanced\ntask awareness from in-context examples. Extensive experiments validate the\nmodel's performance across diverse seen image generation tasks and its capacity\nto generalize to previously unseen tasks.\n","authors":["Zeyi Sun","Ziyang Chu","Pan Zhang","Tong Wu","Xiaoyi Dong","Yuhang Zang","Yuanjun Xiong","Dahua Lin","Jiaqi Wang"],"pdf_url":"https://arxiv.org/pdf/2412.01824v1.pdf","comment":"code: https://github.com/SunzeY/X-Prompt"},{"id":"http://arxiv.org/abs/2412.01556v1","updated":"2024-12-02T14:44:39Z","published":"2024-12-02T14:44:39Z","title":"Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient\n Object Detection","summary":" RGB-Thermal Salient Object Detection aims to pinpoint prominent objects\nwithin aligned pairs of visible and thermal infrared images. Traditional\nencoder-decoder architectures, while designed for cross-modality feature\ninteractions, may not have adequately considered the robustness against noise\noriginating from defective modalities. Inspired by hierarchical human visual\nsystems, we propose the ConTriNet, a robust Confluent Triple-Flow Network\nemploying a Divide-and-Conquer strategy. Specifically, ConTriNet comprises\nthree flows: two modality-specific flows explore cues from RGB and Thermal\nmodalities, and a third modality-complementary flow integrates cues from both\nmodalities. ConTriNet presents several notable advantages. It incorporates a\nModality-induced Feature Modulator in the modality-shared union encoder to\nminimize inter-modality discrepancies and mitigate the impact of defective\nsamples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in\nthe separated flows enlarges the receptive field, allowing for the capture of\nmulti-scale contextual information. Furthermore, a Modality-aware Dynamic\nAggregation Module in the modality-complementary flow dynamically aggregates\nsaliency-related cues from both modality-specific flows. Leveraging the\nproposed parallel triple-flow framework, we further refine saliency maps\nderived from different flows through a flow-cooperative fusion strategy,\nyielding a high-quality, full-resolution saliency map for the final prediction.\nTo evaluate the robustness and stability of our approach, we collect a\ncomprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world\nchallenging scenarios. Extensive experiments on public benchmarks and our\nVT-IMAG dataset demonstrate that ConTriNet consistently outperforms\nstate-of-the-art competitors in both common and challenging scenarios.\n","authors":["Hao Tang","Zechao Li","Dong Zhang","Shengfeng He","Jinhui Tang"],"pdf_url":"https://arxiv.org/pdf/2412.01556v1.pdf","comment":"Accepted by IEEE TPAMI. Project page:\n https://cser-tang-hao.github.io/contrinet.html"},{"id":"http://arxiv.org/abs/2303.17550v6","updated":"2024-12-02T10:06:28Z","published":"2023-03-30T17:18:31Z","title":"DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with\n Diffusion Autoencoder","summary":" While recent research has made significant progress in speech-driven talking\nface generation, the quality of the generated video still lags behind that of\nreal recordings. One reason for this is the use of handcrafted intermediate\nrepresentations like facial landmarks and 3DMM coefficients, which are designed\nbased on human knowledge and are insufficient to precisely describe facial\nmovements. Additionally, these methods require an external pretrained model for\nextracting these representations, whose performance sets an upper bound on\ntalking face generation. To address these limitations, we propose a novel\nmethod called DAE-Talker that leverages data-driven latent representations\nobtained from a diffusion autoencoder (DAE). DAE contains an image encoder that\nencodes an image into a latent vector and a DDIM image decoder that\nreconstructs the image from it. We train our DAE on talking face video frames\nand then extract their latent representations as the training target for a\nConformer-based speech2latent model. This allows DAE-Talker to synthesize full\nvideo frames and produce natural head movements that align with the content of\nspeech, rather than relying on a predetermined head pose from a template video.\nWe also introduce pose modelling in speech2latent for pose controllability.\nAdditionally, we propose a novel method for generating continuous video frames\nwith the DDIM image decoder trained on individual frames, eliminating the need\nfor modelling the joint distribution of consecutive frames directly. Our\nexperiments show that DAE-Talker outperforms existing popular methods in\nlip-sync, video fidelity, and pose naturalness. We also conduct ablation\nstudies to analyze the effectiveness of the proposed techniques and demonstrate\nthe pose controllability of DAE-Talker.\n","authors":["Chenpeng Du","Qi Chen","Tianyu He","Xu Tan","Xie Chen","Kai Yu","Sheng Zhao","Jiang Bian"],"pdf_url":"https://arxiv.org/pdf/2303.17550v6.pdf","comment":"Accepted to ACM Multimedia 2023"},{"id":"http://arxiv.org/abs/2412.01316v1","updated":"2024-12-02T09:32:36Z","published":"2024-12-02T09:32:36Z","title":"Long Video Diffusion Generation with Segmented Cross-Attention and\n Content-Rich Video Data Curation","summary":" We introduce Presto, a novel video diffusion model designed to generate\n15-second videos with long-range coherence and rich content. Extending video\ngeneration methods to maintain scenario diversity over long durations presents\nsignificant challenges. To address this, we propose a Segmented Cross-Attention\n(SCA) strategy, which splits hidden states into segments along the temporal\ndimension, allowing each segment to cross-attend to a corresponding\nsub-caption. SCA requires no additional parameters, enabling seamless\nincorporation into current DiT-based architectures. To facilitate high-quality\nlong video generation, we build the LongTake-HD dataset, consisting of 261k\ncontent-rich videos with scenario coherence, annotated with an overall video\ncaption and five progressive sub-captions. Experiments show that our Presto\nachieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree,\noutperforming existing state-of-the-art video generation methods. This\ndemonstrates that our proposed Presto significantly enhances content richness,\nmaintains long-range coherence, and captures intricate textual details. More\ndetails are displayed on our project page: https://presto-video.github.io/.\n","authors":["Xin Yan","Yuxuan Cai","Qiuyue Wang","Yuan Zhou","Wenhao Huang","Huan Yang"],"pdf_url":"https://arxiv.org/pdf/2412.01316v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01202v1","updated":"2024-12-02T07:14:15Z","published":"2024-12-02T07:14:15Z","title":"Neuron Abandoning Attention Flow: Visual Explanation of Dynamics inside\n CNN Models","summary":" In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method\nto address the open problem of visually explaining the attention evolution\ndynamics inside CNNs when making their classification decisions. A novel\ncascading neuron abandoning back-propagation algorithm is designed to trace\nneurons in all layers of a CNN that involve in making its prediction to address\nthe problem of significant interference from abandoned neurons. Firstly, a\nNeuron Abandoning Back-Propagation (NA-BP) module is proposed to generate\nBack-Propagated Feature Maps (BPFM) by using the inverse function of the\nintermediate layers of CNN models, on which the neurons not used for\ndecision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate\nthe tensors of importance coefficients which are linearly combined with the\ntensors of BPFMs to form the NAFlow. Secondly, to be able to visualize\nattention flow for similarity metric-based CNN models, a new channel\ncontribution weights module is proposed to calculate the importance\ncoefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is\nvalidated on nine widely-used CNN models for various tasks of general image\nclassification, contrastive learning classification, few-shot image\nclassification, and image retrieval.\n","authors":["Yi Liao","Yongsheng Gao","Weichuan Zhang"],"pdf_url":"https://arxiv.org/pdf/2412.01202v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.01169v1","updated":"2024-12-02T06:13:01Z","published":"2024-12-02T06:13:01Z","title":"OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows","summary":" We introduce OmniFlow, a novel generative model designed for any-to-any\ngeneration tasks such as text-to-image, text-to-audio, and audio-to-image\nsynthesis. OmniFlow advances the rectified flow (RF) framework used in\ntext-to-image models to handle the joint distribution of multiple modalities.\nIt outperforms previous any-to-any models on a wide range of tasks, such as\ntext-to-image and text-to-audio synthesis. Our work offers three key\ncontributions: First, we extend RF to a multi-modal setting and introduce a\nnovel guidance mechanism, enabling users to flexibly control the alignment\nbetween different modalities in the generated outputs. Second, we propose a\nnovel architecture that extends the text-to-image MMDiT architecture of Stable\nDiffusion 3 and enables audio and text generation. The extended modules can be\nefficiently pretrained individually and merged with the vanilla text-to-image\nMMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design\nchoices of rectified flow transformers for large-scale audio and text\ngeneration, providing valuable insights into optimizing performance across\ndiverse modalities. The Code will be available at\nhttps://github.com/jacklishufan/OmniFlows.\n","authors":["Shufan Li","Konstantinos Kallidromitis","Akash Gokul","Zichun Liao","Yusuke Kato","Kazuki Kozuka","Aditya Grover"],"pdf_url":"https://arxiv.org/pdf/2412.01169v1.pdf","comment":"12 pages, 14 figures"}]},"2024-12-01T00:00:00Z":{"Information Retrieval":[{"id":"http://arxiv.org/abs/2412.00978v1","updated":"2024-12-01T21:58:44Z","published":"2024-12-01T21:58:44Z","title":"Patent-publication pairs for the detection of knowledge transfer from\n research to industry: reducing ambiguities with word embeddings and\n references","summary":" The performance of medical research can be viewed and evaluated not only from\nthe perspective of publication output, but also from the perspective of\neconomic exploitability. Patents can represent the exploitation of research\nresults and thus the transfer of knowledge from research to industry. In this\nstudy, we set out to identify publication-patent pairs in order to use patents\nas a proxy for the economic impact of research. To identify these pairs, we\nmatched scholarly publications and patents by comparing the names of authors\nand investors. To resolve the ambiguities that arise in this name-matching\nprocess, we expanded our approach with two additional filter features, one used\nto assess the similarity of text content, the other to identify common\nreferences in the two document types. To evaluate text similarity, we extracted\nand transformed technical terms from a medical ontology (MeSH) into numerical\nvectors using word embeddings. We then calculated the results of the two\nsupporting features over an example five-year period. Furthermore, we developed\na statistical procedure which can be used to determine valid patent classes for\nthe domain of medicine. Our complete data processing pipeline is freely\navailable, from the raw data of the two document types right through to the\nvalidated publication-patent pairs.\n","authors":["Klaus Lippert","Konrad U. Förstner"],"pdf_url":"https://arxiv.org/pdf/2412.00978v1.pdf","comment":"16 Pages, 8 figures"},{"id":"http://arxiv.org/abs/2412.00934v1","updated":"2024-12-01T18:58:17Z","published":"2024-12-01T18:58:17Z","title":"QABISAR: Query-Article Bipartite Interactions for Statutory Article\n Retrieval","summary":" In this paper, we introduce QABISAR, a novel framework for statutory article\nretrieval, to overcome the semantic mismatch problem when modeling each\nquery-article pair in isolation, making it hard to learn representation that\ncan effectively capture multi-faceted information. QABISAR leverages bipartite\ninteractions between queries and articles to capture diverse aspects inherent\nin them. Further, we employ knowledge distillation to transfer enriched query\nrepresentations from the graph network into the query bi-encoder, to capture\nthe rich semantics present in the graph representations, despite absence of\ngraph-based supervision for unseen queries during inference. Our experiments on\na real-world expert-annotated dataset demonstrate its effectiveness.\n","authors":["T. Y. S. S. Santosh","Hassan Sarwat","Matthias Grabmair"],"pdf_url":"https://arxiv.org/pdf/2412.00934v1.pdf","comment":"Accepted to COLING 2025"},{"id":"http://arxiv.org/abs/2209.05227v5","updated":"2024-12-01T16:50:02Z","published":"2022-09-12T13:26:26Z","title":"DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation\n Framework for Efficient Device Model Generalization","summary":" Device Model Generalization (DMG) is a practical yet under-investigated\nresearch topic for on-device machine learning applications. It aims to improve\nthe generalization ability of pre-trained models when deployed on\nresource-constrained devices, such as improving the performance of pre-trained\ncloud models on smart mobiles. While quite a lot of works have investigated the\ndata distribution shift across clouds and devices, most of them focus on model\nfine-tuning on personalized data for individual devices to facilitate DMG.\nDespite their promising, these approaches require on-device re-training, which\nis practically infeasible due to the overfitting problem and high time delay\nwhen performing gradient calculation on real-time data. In this paper, we argue\nthat the computational cost brought by fine-tuning can be rather unnecessary.\nWe consequently present a novel perspective to improving DMG without increasing\ncomputational cost, i.e., device-specific parameter generation which directly\nmaps data distribution to parameters. Specifically, we propose an efficient\nDevice-cloUd collaborative parametErs generaTion framework DUET. DUET is\ndeployed on a powerful cloud server that only requires the low cost of\nforwarding propagation and low time delay of data transmission between the\ndevice and the cloud. By doing so, DUET can rehearse the device-specific model\nweight realizations conditioned on the personalized real-time data for an\nindividual device. Importantly, our DUET elegantly connects the cloud and\ndevice as a 'duet' collaboration, frees the DMG from fine-tuning, and enables a\nfaster and more accurate DMG paradigm. We conduct an extensive experimental\nstudy of DUET on three public datasets, and the experimental results confirm\nour framework's effectiveness and generalisability for different DMG tasks.\n","authors":["Zheqi Lv","Wenqiao Zhang","Shengyu Zhang","Kun Kuang","Feng Wang","Yongwei Wang","Zhengyu Chen","Tao Shen","Hongxia Yang","Beng Chin Ooi","Fei Wu"],"pdf_url":"https://arxiv.org/pdf/2209.05227v5.pdf","comment":"Published on WWW'23: Proceedings of the ACM on Web Conference 2023\n (pp. 3077 - 3085)"},{"id":"http://arxiv.org/abs/2302.07335v3","updated":"2024-12-01T16:41:49Z","published":"2023-02-14T20:44:12Z","title":"Intelligent Model Update Strategy for Sequential Recommendation","summary":" Modern online platforms are increasingly employing recommendation systems to\naddress information overload and improve user engagement. There is an evolving\nparadigm in this research field that recommendation network learning occurs\nboth on the cloud and on edges with knowledge transfer in between (i.e.,\nedge-cloud collaboration). Recent works push this field further by enabling\nedge-specific context-aware adaptivity, where model parameters are updated in\nreal-time based on incoming on-edge data. However, we argue that frequent data\nexchanges between the cloud and edges often lead to inefficiency and waste of\ncommunication/computation resources, as considerable parameter updates might be\nredundant. To investigate this problem, we introduce Intelligent Edge-Cloud\nParameter Request Model, abbreviated as IntellectReq.\n IntellectReq is designed to operate on edge, evaluating the cost-benefit\nlandscape of parameter requests with minimal computation and communication\noverhead. We formulate this as a novel learning task, aimed at the detection of\nout-of-distribution data, thereby fine-tuning adaptive communication\nstrategies. Further, we employ statistical mapping techniques to convert\nreal-time user behavior into a normal distribution, thereby employing\nmulti-sample outputs to quantify the model's uncertainty and thus its\ngeneralization capabilities. Rigorous empirical validation on four\nwidely-adopted benchmarks evaluates our approach, evidencing a marked\nimprovement in the efficiency and generalizability of edge-cloud collaborative\nand dynamic recommendation systems.\n","authors":["Zheqi Lv","Wenqiao Zhang","Zhengyu Chen","Shengyu Zhang","Kun Kuang"],"pdf_url":"https://arxiv.org/pdf/2302.07335v3.pdf","comment":"Published on WWW'24(Oral): Proceedings of the ACM on Web Conference\n 2024 (pp. 3117-3128)"},{"id":"http://arxiv.org/abs/2412.00813v1","updated":"2024-12-01T14:01:17Z","published":"2024-12-01T14:01:17Z","title":"Oracle-guided Dynamic User Preference Modeling for Sequential\n Recommendation","summary":" Sequential recommendation methods can capture dynamic user preferences from\nuser historical interactions to achieve better performance. However, most\nexisting methods only use past information extracted from user historical\ninteractions to train the models, leading to the deviations of user preference\nmodeling. Besides past information, future information is also available during\ntraining, which contains the ``oracle'' user preferences in the future and will\nbe beneficial to model dynamic user preferences. Therefore, we propose an\noracle-guided dynamic user preference modeling method for sequential\nrecommendation (Oracle4Rec), which leverages future information to guide model\ntraining on past information, aiming to learn ``forward-looking'' models.\nSpecifically, Oracle4Rec first extracts past and future information through two\nseparate encoders, then learns a forward-looking model through an\noracle-guiding module which minimizes the discrepancy between past and future\ninformation. We also tailor a two-phase model training strategy to make the\nguiding more effective. Extensive experiments demonstrate that Oracle4Rec is\nsuperior to state-of-the-art sequential methods. Further experiments show that\nOracle4Rec can be leveraged as a generic module in other sequential\nrecommendation methods to improve their performance with a considerable margin.\n","authors":["Jiafeng Xia","Dongsheng Li","Hansu Gu","Tun Lu","Peng Zhang","Li Shang","Ning Gu"],"pdf_url":"https://arxiv.org/pdf/2412.00813v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.06237v2","updated":"2024-12-01T13:31:14Z","published":"2024-11-09T17:38:01Z","title":"Leveraging Retrieval-Augmented Generation for Persian University\n Knowledge Retrieval","summary":" This paper introduces an innovative approach using Retrieval-Augmented\nGeneration (RAG) pipelines with Large Language Models (LLMs) to enhance\ninformation retrieval and query response systems for university-related\nquestion answering. By systematically extracting data from the university\nofficial webpage and employing advanced prompt engineering techniques, we\ngenerate accurate, contextually relevant responses to user queries.\n We developed a comprehensive university benchmark, UniversityQuestionBench\n(UQB), to rigorously evaluate our system performance, based on common key\nmetrics in the filed of RAG pipelines, assessing accuracy and reliability\nthrough various metrics and real-world scenarios. Our experimental results\ndemonstrate significant improvements in the precision and relevance of\ngenerated responses, enhancing user experience and reducing the time required\nto obtain relevant answers. In summary, this paper presents a novel application\nof RAG pipelines and LLMs, supported by a meticulously prepared university\nbenchmark, offering valuable insights into advanced AI techniques for academic\ndata retrieval and setting the stage for future research in this domain.\n","authors":["Arshia Hemmat","Kianoosh Vadaei","Mohammad Hassan Heydari","Afsaneh Fatemi"],"pdf_url":"https://arxiv.org/pdf/2411.06237v2.pdf","comment":"6 pages, 2 figures, 1 table, Submitted to 15th IKT conference"},{"id":"http://arxiv.org/abs/2411.17229v2","updated":"2024-12-01T13:20:02Z","published":"2024-11-26T08:51:46Z","title":"Efficient Data-aware Distance Comparison Operations for High-Dimensional\n Approximate Nearest Neighbor Search","summary":" High-dimensional approximate $K$ nearest neighbor search (AKNN) is a\nfundamental task for various applications, including information retrieval.\nMost existing algorithms for AKNN can be decomposed into two main components,\ni.e., candidate generation and distance comparison operations (DCOs). While\ndifferent methods have unique ways of generating candidates, they all share the\nsame DCO process. In this study, we focus on accelerating the process of DCOs\nthat dominates the time cost in most existing AKNN algorithms. To achieve this,\nwe propose an Data-Aware Distance Estimation approach, called DADE, which\napproximates the exact distance in a lower-dimensional space. We theoretically\nprove that the distance estimation in DADE is unbiased in terms of data\ndistribution. Furthermore, we propose an optimized estimation based on the\nunbiased distance estimation formulation. In addition, we propose a hypothesis\ntesting approach to adaptively determine the number of dimensions needed to\nestimate the exact distance with sufficient confidence. We integrate DADE into\nwidely-used AKNN search algorithms, e.g., IVF and HNSW, and conduct extensive\nexperiments to demonstrate the superiority.\n","authors":["Liwei Deng","Penghao Chen","Ximu Zeng","Tianfu Wang","Yan Zhao","Kai Zheng"],"pdf_url":"https://arxiv.org/pdf/2411.17229v2.pdf","comment":"Accepted by VLDB 2025"},{"id":"http://arxiv.org/abs/2412.00714v1","updated":"2024-12-01T07:27:20Z","published":"2024-12-01T07:27:20Z","title":"Scaling New Frontiers: Insights into Large Recommendation Models","summary":" Recommendation systems are essential for filtering data and retrieving\nrelevant information across various applications. Recent advancements have seen\nthese systems incorporate increasingly large embedding tables, scaling up to\ntens of terabytes for industrial use. However, the expansion of network\nparameters in traditional recommendation models has plateaued at tens of\nmillions, limiting further benefits from increased embedding parameters.\nInspired by the success of large language models (LLMs), a new approach has\nemerged that scales network parameters using innovative structures, enabling\ncontinued performance improvements. A significant development in this area is\nMeta's generative recommendation model HSTU, which illustrates the scaling laws\nof recommendation systems by expanding parameters to thousands of billions.\nThis new paradigm has achieved substantial performance gains in online\nexperiments. In this paper, we aim to enhance the understanding of scaling laws\nby conducting comprehensive evaluations of large recommendation models.\nFirstly, we investigate the scaling laws across different backbone\narchitectures of the large recommendation models. Secondly, we conduct\ncomprehensive ablation studies to explore the origins of these scaling laws. We\nthen further assess the performance of HSTU, as the representative of large\nrecommendation models, on complex user behavior modeling tasks to evaluate its\napplicability. Notably, we also analyze its effectiveness in ranking tasks for\nthe first time. Finally, we offer insights into future directions for large\nrecommendation models. Supplementary materials for our research are available\non GitHub at https://github.com/USTC-StarTeam/Large-Recommendation-Models.\n","authors":["Wei Guo","Hao Wang","Luankang Zhang","Jin Yao Chin","Zhongzhou Liu","Kai Cheng","Qiushi Pan","Yi Quan Lee","Wanqi Xue","Tingjia Shen","Kenan Song","Kefan Wang","Wenjia Xie","Yuyang Ye","Huifeng Guo","Yong Liu","Defu Lian","Ruiming Tang","Enhong Chen"],"pdf_url":"https://arxiv.org/pdf/2412.00714v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2405.18560v2","updated":"2024-12-01T05:22:22Z","published":"2024-05-28T20:10:06Z","title":"Potential Field Based Deep Metric Learning","summary":" Deep metric learning (DML) involves training a network to learn a\nsemantically meaningful representation space. Many current approaches mine\nn-tuples of examples and model interactions within each tuplets. We present a\nnovel, compositional DML model, inspired by electrostatic fields in physics\nthat, instead of in tuples, represents the influence of each example\n(embedding) by a continuous potential field, and superposes the fields to\nobtain their combined global potential field. We use attractive/repulsive\npotential fields to represent interactions among embeddings from images of the\nsame/different classes. Contrary to typical learning methods, where mutual\ninfluence of samples is proportional to their distance, we enforce reduction in\nsuch influence with distance, leading to a decaying field. We show that such\ndecay helps improve performance on real world datasets with large intra-class\nvariations and label noise. Like other proxy-based methods, we also use proxies\nto succinctly represent sub-populations of examples. We evaluate our method on\nthree standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where\nit outperforms state-of-the-art baselines.\n","authors":["Shubhang Bhatnagar","Narendra Ahuja"],"pdf_url":"https://arxiv.org/pdf/2405.18560v2.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00657v1","updated":"2024-12-01T03:28:26Z","published":"2024-12-01T03:28:26Z","title":"Improving Vietnamese Legal Document Retrieval using Synthetic Data","summary":" In the field of legal information retrieval, effective embedding-based models\nare essential for accurate question-answering systems. However, the scarcity of\nlarge annotated datasets poses a significant challenge, particularly for\nVietnamese legal texts. To address this issue, we propose a novel approach that\nleverages large language models to generate high-quality, diverse synthetic\nqueries for Vietnamese legal passages. This synthetic data is then used to\npre-train retrieval models, specifically bi-encoder and ColBERT, which are\nfurther fine-tuned using contrastive loss with mined hard negatives. Our\nexperiments demonstrate that these enhancements lead to strong improvement in\nretrieval accuracy, validating the effectiveness of synthetic data and\npre-training techniques in overcoming the limitations posed by the lack of\nlarge labeled datasets in the Vietnamese legal domain.\n","authors":["Son Pham Tien","Hieu Nguyen Doan","An Nguyen Dai","Sang Dinh Viet"],"pdf_url":"https://arxiv.org/pdf/2412.00657v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2412.00639v1","updated":"2024-12-01T01:36:41Z","published":"2024-12-01T01:36:41Z","title":"Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex\n Natural Language Queries on Multi-modal Data","summary":" Multi-modal data, such as image data sets, often miss the detailed\ndescriptions that properly capture the rich information encoded in them. This\nmakes answering complex natural language queries a major challenge in these\ndomains. In particular, unlike the traditional nearest-neighbor search, where\nthe tuples and the query are modeled as points in a data cube, the query and\nthe tuples are of different natures, making the traditional query answering\nsolutions not directly applicable for such settings. Existing literature\naddresses this challenge for image data through vector representations jointly\ntrained on natural language and images. This technique, however, underperforms\nfor complex queries due to various reasons.\n This paper takes a step towards addressing this challenge by introducing a\nGenerative-AI (GenAI) powered Monte Carlo method that utilizes foundation\nmodels to generate synthetic samples that capture the complexity of the natural\nlanguage query and transform it to the same space of the multi-modal data.\nFollowing this method, we develop a system for image data retrieval and propose\npractical solutions that enable leveraging future advancements in GenAI and\nvector representations for improving our system's performance. Our\ncomprehensive experiments on various benchmark datasets verify that our system\nsignificantly outperforms state-of-the-art techniques.\n","authors":["Mahdi Erfanian","Mohsen Dehghankar","Abolfazl Asudeh"],"pdf_url":"https://arxiv.org/pdf/2412.00639v1.pdf","comment":null},{"id":"http://arxiv.org/abs/2411.14592v2","updated":"2024-12-01T01:21:24Z","published":"2024-11-21T21:22:58Z","title":"G-RAG: Knowledge Expansion in Material Science","summary":" In the field of Material Science, effective information retrieval systems are\nessential for facilitating research. Traditional Retrieval-Augmented Generation\n(RAG) approaches in Large Language Models (LLMs) often encounter challenges\nsuch as outdated information, hallucinations, limited interpretability due to\ncontext constraints, and inaccurate retrieval. To address these issues, Graph\nRAG integrates graph databases to enhance the retrieval process. Our proposed\nmethod processes Material Science documents by extracting key entities\n(referred to as MatIDs) from sentences, which are then utilized to query\nexternal Wikipedia knowledge bases (KBs) for additional relevant information.\nWe implement an agent-based parsing technique to achieve a more detailed\nrepresentation of the documents. Our improved version of Graph RAG called G-RAG\nfurther leverages a graph database to capture relationships between these\nentities, improving both retrieval accuracy and contextual understanding. This\nenhanced approach demonstrates significant improvements in performance for\ndomains that require precise information retrieval, such as Material Science.\n","authors":["Radeen Mostafa","Mirza Nihal Baig","Mashaekh Tausif Ehsan","Jakir Hasan"],"pdf_url":"https://arxiv.org/pdf/2411.14592v2.pdf","comment":null}],"Multimedia":[{"id":"http://arxiv.org/abs/2308.05037v3","updated":"2024-12-01T15:17:03Z","published":"2023-08-09T16:09:44Z","title":"Separate Anything You Describe","summary":" Language-queried audio source separation (LASS) is a new paradigm for\ncomputational auditory scene analysis (CASA). LASS aims to separate a target\nsound from an audio mixture given a natural language query, which provides a\nnatural and scalable interface for digital audio applications. Recent works on\nLASS, despite attaining promising separation performance on specific sources\n(e.g., musical instruments, limited classes of audio events), are unable to\nseparate audio concepts in the open domain. In this work, we introduce\nAudioSep, a foundation model for open-domain audio source separation with\nnatural language queries. We train AudioSep on large-scale multimodal datasets\nand extensively evaluate its capabilities on numerous tasks including audio\nevent separation, musical instrument separation, and speech enhancement.\nAudioSep demonstrates strong separation performance and impressive zero-shot\ngeneralization ability using audio captions or text labels as queries,\nsubstantially outperforming previous audio-queried and language-queried sound\nseparation models. For reproducibility of this work, we will release the source\ncode, evaluation benchmark and pre-trained model at:\nhttps://github.com/Audio-AGI/AudioSep.\n","authors":["Xubo Liu","Qiuqiang Kong","Yan Zhao","Haohe Liu","Yi Yuan","Yuzhuo Liu","Rui Xia","Yuxuan Wang","Mark D. Plumbley","Wenwu Wang"],"pdf_url":"https://arxiv.org/pdf/2308.05037v3.pdf","comment":"Code, benchmark and pre-trained models:\n https://github.com/Audio-AGI/AudioSep"},{"id":"http://arxiv.org/abs/2401.17133v2","updated":"2024-12-01T04:06:27Z","published":"2024-01-30T16:07:44Z","title":"SongBsAb: A Dual Prevention Approach against Singing Voice Conversion\n based Illegal Song Covers","summary":" Singing voice conversion (SVC) automates song covers by converting a source\nsinging voice from a source singer into a new singing voice with the same\nlyrics and melody as the source, but sounds like being covered by the target\nsinger of some given target singing voices. However, it raises serious concerns\nabout copyright and civil right infringements. We propose SongBsAb, the first\nproactive approach to tackle SVC-based illegal song covers. SongBsAb adds\nperturbations to singing voices before releasing them, so that when they are\nused, the process of SVC will be interfered, leading to unexpected singing\nvoices. Perturbations are carefully crafted to (1) provide a dual prevention,\ni.e., preventing the singing voice from being used as the source and target\nsinging voice in SVC, by proposing a gender-transformation loss and a high/low\nhierarchy multi-target loss, respectively; and (2) be harmless, i.e., no\nside-effect on the enjoyment of protected songs, by refining a psychoacoustic\nmodel-based loss with the backing track as an additional masker, a unique\naccompanying element for singing voices compared to ordinary speech voices. We\nalso adopt a frame-level interaction reduction-based loss and encoder ensemble\nto enhance the transferability of SongBsAb to unknown SVC models. We\ndemonstrate the prevention effectiveness, harmlessness, and robustness of\nSongBsAb on five diverse and promising SVC models, using both English and\nChinese datasets, and both objective and human study-based subjective metrics.\nOur work fosters an emerging research direction for mitigating illegal\nautomated song covers.\n","authors":["Guangke Chen","Yedi Zhang","Fu Song","Ting Wang","Xiaoning Du","Yang Liu"],"pdf_url":"https://arxiv.org/pdf/2401.17133v2.pdf","comment":"In Proceedings of the 32nd Network and Distributed System Security\n (NDSS) Symposium 2025"}]}} \ No newline at end of file diff --git a/favicon.ico b/favicon.ico new file mode 100644 index 00000000..7f5166c7 Binary files /dev/null and b/favicon.ico differ diff --git a/index.css b/index.css new file mode 100644 index 00000000..9ded9d94 --- /dev/null +++ b/index.css @@ -0,0 +1,355 @@ +:root { + /* Palette: Nord (https://www.nordtheme.com)*/ + --nord00: #2e3440; + --nord01: #3b4252; + --nord02: #434c5e; + --nord03: #4c566a; + --nord04: #d8dee9; + --nord05: #e5e9f0; + --nord06: #eceff4; + --nord07: #8fbcbb; + --nord08: #88c0d0; + --nord09: #81a1c1; + --nord0A: #5e81ac; + --nord0B: #bf616a; + --nord0C: #d08770; + --nord0D: #ebcb8b; + --nord0E: #a3be8c; + --nord0F: #b48ead; + + + /* Typograph */ + --font-family-default: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen-Sans, Ubuntu, Cantarell, "Helvetica Neue", + sans-serif; + --font-size-scaler: 62.5%; + --font-size-m: 1.6rem; + --font-size-s: 1.4rem; + + /* Components */ + --body-color: var(--nord06); + --body-bg: var(--nord00); + + --header-title: var(--nord06); + --header-container: var(--nord00); + --header-title-preffix: var(--nord0F); + + --chip-font: var(--nord08); + --chip-color: var(--nord0B); + + --icons: var(--nord06); + --icons-hover: var(--nord0F); + + --day-container: var(--nord01); + --date: var(--nord09); + + --summary: var(--nord0E); + --summary-hover: var(--nord0F); + + --details-open: var(--nord02); + --details-content: var(--nord05); + --details-a: var(--nord07); + --details-a-hover: var(--nord0F); + + --highlight-title: var(--nord0B); + --highlight-author: var(--nord0B); + + --article-summary-hover-color: var(--nord0D); + --article-summary-color: var(--nord04); + + --article-title-color: var(--nord05); + --article-title-hover-color: var(--nord0E); + + --accordion-content-rail-color: var(--nord01); + --accordion-content-hover-rail-color: var(--nord0D); + --accordion-title-marker-color: var(--nord01); + --accordion-title-hover-marker-color: var(--nord0E); + + --footer-color: var(--nord04); + --footer-link-hover-color: var(--nord0D); +} + +[data-theme="light"] { + /* Theme design */ + + --color-primary: var(--nord07); + --color-primary-second: var(--nord00); + --color-info: var(--nord0A); + --color-success: var(--nord0E); + --color-warning: var(--nord0C); + --color-danger: var(--nord0B); + + --color-text: var(--nord00); + --color-hover: var(--nord0D); + --color-shadow: var(--nord03); + + --color-primary-h: var(--nord09); + --color-primary-s: var(--nord08); + --color-primary-l: var(--nord07); + + --color-contrast-higher-h: var(--nord01); + --color-contrast-higher-l: var(--nord02); + --color-contrast-higher-s: var(--nord03); + + --color-content: white; + + --background: var(--nord06); + --background-content: var(--nord05); + --background-color: var(--nord04); + + /* Components */ + + --chip-font: var(--nord06); + --chip-color: var(--nord09); + + --body-color: var(--background-color); + --body-bg: var(--background); + + --header-title: var(--color-shadow); + --header-container: var(--background); + --header-title-preffix: var(--color-primary-h); + + --icons: var(--color-shadow); + --icons-hover: var(--color-hover); + + --day-container: var(--background-content); + --date: var(--color-primary-l); + + --summary: var(--color-info); + --summary-hover: var(--color-success); + + --details-open: var(--color-content); + --details-content: var(--color-text); + --details-a: var(--color-primary-h); + --details-a-hover: var(--color-hover); + + --highlight-title: var(--color-danger); + --highlight-author: var(--color-warning); + + --article-summary-color: var(--color-text); + --article-summary-hover-color: var(--color-primary-s); + + --article-title-color: var(--color-primary); + --article-title-hover-color: var(--color-success); + + --accordion-content-rail-color: var(--color-warning); + --accordion-content-hover-rail-color: var(--color-warning); + --accordion-title-marker-color: var(--color-success); + --accordion-title-hover-marker-color: var(--color-success); + + --footer-color: var(--color-text); + --footer-link-hover-color: var(--color-hover); +} + +html { + font-size: var(--font-size-scaler); +} + +body { + background-color: var(--body-bg); + font-family: var(--font-family-default); + color: var(--body-color); + margin: 0; + padding-top: 16px; + display: grid; +} + +.header-container { + width: 90%; + max-width: 1200px; + background: var(--header-container); + margin: 0 auto; +} + +.header-title { + font-size: 32px; + font-weight: bold; + color: var(--header-title); + margin: 0; + padding-bottom: 14px; +} + +.header-title-preffix { + color: var(--header-title-preffix); +} + +.icons { + color: var(--icons); + padding-bottom: 16px; +} + +.icons a { + color: var(--icons); + text-decoration: none; +} + +.icons a:hover { + color: var(--icons-hover); +} + +.day-container { + padding: 16px 16px 16px 16px; + background: var(--day-container); + width: 90%; + max-width: 1200px; + margin: 0 auto; + margin-bottom: 8px; + border-radius: 10px; +} + +.date { + font-size: 24px; + font-weight: 700; + margin: 0; + color: var(--date); +} + +p { + margin: 0; +} + +summary { + font-weight: 600; + color: var(--summary); +} + +summary:hover { + text-decoration: underline; + cursor: pointer; + color: var(--summary-hover); +} + +details { + --border-color: transparent; + + padding: 2px 4px; + font-size: 20px; + border: 1px solid var(--border-color); + border-radius: 4px; +} + +details[open] { + background-color: var(--details-open); + margin-bottom: 8px; +} + +.details-content { + padding: 12px 3px; + gap: 16px; + color: var(--details-content); +} + +details a { + color: var(--details-a); +} + +details a:hover { + color: var(--details-a-hover); +} + +footer { + margin: 0 auto; + color: var(--footer-color); + font-size: var(--font-size-s); + display: flex; + padding: 0 16px; + justify-content: space-between; +} + +.description { + margin: 0 auto; + color: var(--footer-color); + font-size: var(--font-size-s); + display: flex; + padding: 0 16px; + text-align: center; +} + +.highlight-author { + color: var(--highlight-author); + font-weight: bold; +} + +.highlight-title { + color: var(--highlight-title); + font-weight: bold; +} + +.channel-description { + text-align: center; + font-size: var(--font-size-scaler); +} + +.article-summary-link { + color: var(--article-summary-color); + font-size: var(--font-size-s); + text-decoration: none; +} + +.article-summary-link:hover { + color: var(--article-summary-hover-color); + --accordion-content-rail-color: var(--accordion-content-hover-rail-color); +} + +.article-summary-box-outer { + display: block; + padding: 4px 8px 8px 4px; +} + +.article-summary-box-inner { + padding-left: 8px; + border-left: 1px solid var(--accordion-content-rail-color); + font-size: var(--font-size-m); +} + +.article-expander { + padding: 10px 4px; + border-radius: 4px; +} + +.article-authors { + font-size: var(--font-size-m); + padding: 0.25em 1em; +} + +.article-authors a { + text-decoration: none; +} + +.article-expander-title { + font-size: var(--font-size-m); + font-weight: 600; +} + +.article-expander-title:hover { + cursor: pointer; +} + +.article-expander-title::marker { + color: var(--accordion-title-marker-color); +} + +.article-expander-title:hover::marker { + color: var(--accordion-title-hover-marker-color); +} + +/* for switcher */ +.theme-switch { + display: inline-block; + position: relative; +} + +.theme-switch input { + display: none; +} + +/* chip */ +.chip { + font-size: 90%; + align-items: center; + color: var(--chip-font); + background: var(--chip-color); + border-radius: 5rem; + display: inline-flex; + padding: .2rem .4rem; + vertical-align: middle; +} \ No newline at end of file diff --git a/index.html b/index.html new file mode 100644 index 00000000..b64d838f --- /dev/null +++ b/index.html @@ -0,0 +1,21784 @@ + + + + + MyArxiv + + + + + + + + + + + + + + + +
+
+
+
+ MyArxiv +
+
+ +
+ +
+
+
+ +
+
+ +
+
+
+ + Computation and Language 72 + +
+
+
+ + ☆ TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft + + +
+ Collaboration is a cornerstone of society. In the real world, human teammates +make use of multi-sensory data to tackle challenging tasks in ever-changing +environments. It is essential for embodied agents collaborating in +visually-rich environments replete with dynamic interactions to understand +multi-modal observations and task specifications. To evaluate the performance +of generalizable multi-modal collaborative agents, we present TeamCraft, a +multi-modal multi-agent benchmark built on top of the open-world video game +Minecraft. The benchmark features 55,000 task variants specified by multi-modal +prompts, procedurally-generated expert demonstrations for imitation learning, +and carefully designed protocols to evaluate model generalization capabilities. +We also perform extensive analyses to better understand the limitations and +strengths of existing approaches. Our results indicate that existing models +continue to face significant challenges in generalizing to novel goals, scenes, +and unseen numbers of agents. These findings underscore the need for further +research in this area. The TeamCraft platform and dataset are publicly +available at https://github.com/teamcraft-bench/teamcraft. + +
+
+
+
+
+ + ☆ Uncertainty Quantification for Transformer Models for Dark-Pattern + Detection + + +
+ The opaque nature of transformer-based models, particularly in applications +susceptible to unethical practices such as dark-patterns in user interfaces, +requires models that integrate uncertainty quantification to enhance trust in +predictions. This study focuses on dark-pattern detection, deceptive design +choices that manipulate user decisions, undermining autonomy and consent. We +propose a differential fine-tuning approach implemented at the final +classification head via uncertainty quantification with transformer-based +pre-trained models. Employing a dense neural network (DNN) head architecture as +a baseline, we examine two methods capable of quantifying uncertainty: +Spectral-normalized Neural Gaussian Processes (SNGPs) and Bayesian Neural +Networks (BNNs). These methods are evaluated on a set of open-source +foundational models across multiple dimensions: model performance, variance in +certainty of predictions and environmental impact during training and inference +phases. Results demonstrate that integrating uncertainty quantification +maintains performance while providing insights into challenging instances +within the models. Moreover, the study reveals that the environmental impact +does not uniformly increase with the incorporation of uncertainty +quantification techniques. The study's findings demonstrate that uncertainty +quantification enhances transparency and provides measurable confidence in +predictions, improving the explainability and clarity of black-box models. This +facilitates informed decision-making and mitigates the influence of +dark-patterns on user interfaces. These results highlight the importance of +incorporating uncertainty quantification techniques in developing machine +learning models, particularly in domains where interpretability and +trustworthiness are critical. + +
+
+
+
+
+ + ☆ Enhancing FKG.in: automating Indian food composition analysis + + +
+ This paper presents a novel approach to compute food composition data for +Indian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The +primary focus is to provide a broad overview of an automated food composition +analysis workflow and describe its core functionalities: nutrition data +aggregation, food composition analysis, and LLM-augmented information +resolution. This workflow aims to complement FKG.in and iteratively supplement +food composition data from verified knowledge bases. Additionally, this paper +highlights the challenges of representing Indian food and accessing food +composition data digitally. It also reviews three key sources of food +composition data: the Indian Food Composition Tables, the Indian Nutrient +Databank, and the Nutritionix API. Furthermore, it briefly outlines how users +can interact with the workflow to obtain diet-based health recommendations and +detailed food composition information for numerous recipes. We then explore the +complex challenges of analyzing Indian recipe information across dimensions +such as structure, multilingualism, and uncertainty as well as present our +ongoing work on LLM-based solutions to address these issues. The methods +proposed in this workshop paper for AI-driven knowledge curation and +information resolution are application-agnostic, generalizable, and replicable +for any domain. + +
+
+ comment: 15 pages, 3 figures, 30 references, International Conference on + Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop +
+
+
+
+
+ + ☆ MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at + Scale + + +
+ Open-source multimodal large language models (MLLMs) have shown significant +potential in a broad range of multimodal tasks. However, their reasoning +capabilities remain constrained by existing instruction-tuning datasets, which +were predominately repurposed from academic datasets such as VQA, AI2D, and +ChartQA. These datasets target simplistic tasks, and only provide phrase-level +answers without any intermediate rationales. To address these challenges, we +introduce a scalable and cost-effective method to construct a large-scale +multimodal instruction-tuning dataset with rich intermediate rationales +designed to elicit CoT reasoning. Using only open models, we create a dataset +containing 12M instruction-response pairs to cover diverse, reasoning-intensive +tasks with detailed and faithful rationales. Experiments demonstrate that +training MLLMs on this dataset significantly improves reasoning capabilities, +achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), +MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates +notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation +studies further highlight the importance of key components, such as rewriting +and self-filtering, in the dataset construction process. + +
+
+
+
+
+ + ☆ LIAR: Leveraging Alignment (Best-of-N) to Jailbreak LLMs in Seconds + + +
+ Many existing jailbreak techniques rely on solving discrete combinatorial +optimization, while more recent approaches involve training LLMs to generate +multiple adversarial prompts. However, both approaches require significant +computational resources to produce even a single adversarial prompt. We +hypothesize that the inefficiency of current approaches stems from an +inadequate characterization of the jailbreak problem. To address this gap, we +formulate the jailbreak problem in terms of alignment. By starting from an +available safety-aligned model, we leverage an unsafe reward to guide the safe +model towards generating unsafe outputs using alignment techniques (e.g., +reinforcement learning from human feedback), effectively performing +jailbreaking via alignment. We propose a novel jailbreak method called LIAR +(LeveragIng Alignment to jailbReak). To demonstrate the simplicity and +effectiveness of our approach, we employ a best-of-N method to solve the +alignment problem. LIAR offers significant advantages: lower computational +requirements without additional training, fully black-box operation, +competitive attack success rates, and more human-readable prompts. We provide +theoretical insights into the possibility of jailbreaking a safety-aligned +model, revealing inherent vulnerabilities in current alignment strategies for +LLMs. We also provide sub-optimality guarantees for the proposed \algo. +Experimentally, we achieve ASR comparable to the SoTA with a 10x improvement to +perplexity and a Time-to-Attack measured in seconds rather than tens of hours. + +
+
+
+
+
+ + ☆ BEExformer: A Fast Inferencing Transformer Architecture via Binarization + with Multiple Early Exits + + +
+ Large Language Models (LLMs) based on transformers achieve cutting-edge +results on a variety of applications. However, their enormous size and +processing requirements make deployment on devices with constrained resources +extremely difficult. Among various efficiency considerations, model +binarization and Early Exit (EE) are common effective solutions. However, +binarization may lead to performance loss due to reduced precision affecting +gradient estimation and parameter updates. Besides, the present early-exit +mechanisms are still in the nascent stages of research. To ameliorate these +issues, we propose Binarized Early Exit Transformer (BEExformer), the +first-ever selective learning transformer architecture to combine early exit +with binarization for textual inference. It improves the binarization process +through a differentiable second-order approximation to the impulse function. +This enables gradient computation concerning both the sign as well as the +magnitude of the weights. In contrast to absolute threshold-based EE, the +proposed EE mechanism hinges on fractional reduction in entropy among +intermediate transformer blocks with soft-routing loss estimation. While +binarization results in 18.44 times reduction in model size, early exit reduces +the FLOPs during inference by 54.85% and even improves accuracy by 5.98% +through resolving the "overthinking" problem inherent in deep networks. +Moreover, the proposed BEExformer simplifies training by not requiring +knowledge distillation from a full-precision LLM. Extensive evaluation on the +GLUE dataset and comparison with the SOTA works showcase its pareto-optimal +performance-efficiency trade-off. + +
+
+ comment: 15 pages, 15 figures, 3 tables +
+
+
+
+
+ + ☆ 100% Hallucination Elimination Using Acurai + + +
+ The issue of hallucinations in large language models (LLMs) remains a +critical barrier to the adoption of AI in enterprise and other high-stakes +applications. Despite advancements in retrieval-augmented generation (RAG) +systems, current state-of-the-art methods fail to achieve more than 80% +accuracy in generating faithful and factually correct outputs, even when +provided with relevant and accurate context. In this work, we introduce Acurai, +a novel systematic approach that achieves 100% hallucination-free responses in +LLMs by reformatting queries and context data prior to input. Leveraging a deep +understanding of LLM internal representations, the importance of noun-phrase +dominance, and the role of discrete functional units (DFUs), Acurai ensures +alignment between input context and generated output. We validate this method +using the RAGTruth corpus, demonstrating its ability to eliminate 100% +hallucinations for both GPT-4 and GPT-3.5 Turbo. Acurai sets a new standard for +achieving consistent, accurate, and faithful AI responses, marking a +significant step forward in the development of trustworthy AI systems. + +
+
+
+
+
+ + ☆ Evaluating and Aligning CodeLLMs on Human Preference + + +
+ Code large language models (codeLLMs) have made significant strides in code +generation. Most previous code-related benchmarks, which consist of various +programming exercises along with the corresponding test cases, are used as a +common measure to evaluate the performance and capabilities of code LLMs. +However, the current code LLMs focus on synthesizing the correct code snippet, +ignoring the alignment with human preferences, where the query should be +sampled from the practical application scenarios and the model-generated +responses should satisfy the human preference. To bridge the gap between the +model-generated response and human preference, we present a rigorous +human-curated benchmark CodeArena to emulate the complexity and diversity of +real-world coding tasks, where 397 high-quality samples spanning 40 categories +and 44 programming languages, carefully curated from user queries. Further, we +propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B +tokens) by scaling instructions from the website to verify the effectiveness of +the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder +totally trained on synthetic instruction data can achieve top-tier performance +of open-source code LLMs. The results find performance differences between +execution-based benchmarks and CodeArena. Our systematic experiments of +CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code +LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring +the importance of the human preference +alignment.\footnote{\url{https://codearenaeval.github.io/ }} + +
+
+
+
+
+ + ☆ ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented + Argumentation with LLM Judges + + +
+ Computational argumentation, which involves generating answers or summaries +for controversial topics like abortion bans and vaccination, has become +increasingly important in today's polarized environment. Sophisticated LLM +capabilities offer the potential to provide nuanced, evidence-based answers to +such questions through Retrieval-Augmented Argumentation (RAArg), leveraging +real-world evidence for high-quality, grounded arguments. However, evaluating +RAArg remains challenging, as human evaluation is costly and difficult for +complex, lengthy answers on complicated topics. At the same time, re-using +existing argumentation datasets is no longer sufficient, as they lack long, +complex arguments and realistic evidence from potentially misleading sources, +limiting holistic evaluation of retrieval effectiveness and argument quality. +To address these gaps, we investigate automated evaluation methods using +multiple fine-grained LLM judges, providing better and more interpretable +assessments than traditional single-score metrics and even previously reported +human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, +a new benchmark featuring long and complex human-authored arguments on debated +topics, grounded in real-world websites, allowing an exhaustive evaluation +across retrieval effectiveness, argument quality, and groundedness. We validate +our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed +LLM Judges and the ConQRet benchmark can enable rapid progress in computational +argumentation and can be naturally extended to other complex +retrieval-augmented generation tasks. + +
+
+
+
+
+ + ☆ QueEn: A Large Language Model for Quechua-English Translation + + +
+ Recent studies show that large language models (LLMs) are powerful tools for +working with natural language, bringing advances in many areas of computational +linguistics. However, these models face challenges when applied to low-resource +languages due to limited training data and difficulty in understanding cultural +nuances. In this paper, we propose QueEn, a novel approach for Quechua-English +translation that combines Retrieval-Augmented Generation (RAG) with +parameter-efficient fine-tuning techniques. Our method leverages external +linguistic resources through RAG and uses Low-Rank Adaptation (LoRA) for +efficient model adaptation. Experimental results show that our approach +substantially exceeds baseline models, with a BLEU score of 17.6 compared to +1.5 for standard GPT models. The integration of RAG with fine-tuning allows our +system to address the challenges of low-resource language translation while +maintaining computational efficiency. This work contributes to the broader goal +of preserving endangered languages through advanced language technologies. + +
+
+
+
+
+ + ☆ Benchmarking Open-ended Audio Dialogue Understanding for Large + Audio-Language Models + + +
+ Large Audio-Language Models (LALMs) have unclocked audio dialogue +capabilities, where audio dialogues are a direct exchange of spoken language +between LALMs and humans. Recent advances, such as GPT-4o, have enabled LALMs +in back-and-forth audio dialogues with humans. This progression not only +underscores the potential of LALMs but also broadens their applicability across +a wide range of practical scenarios supported by audio dialogues. However, +given these advancements, a comprehensive benchmark to evaluate the performance +of LALMs in the open-ended audio dialogue understanding remains absent +currently. To address this gap, we propose an Audio Dialogue Understanding +Benchmark (ADU-Bench), which consists of 4 benchmark datasets. They assess the +open-ended audio dialogue ability for LALMs in 3 general scenarios, 12 skills, +9 multilingual languages, and 4 categories of ambiguity handling. Notably, we +firstly propose the evaluation of ambiguity handling in audio dialogues that +expresses different intentions beyond the same literal meaning of sentences, +e.g., "Really!?" with different intonations. In summary, ADU-Bench includes +over 20,000 open-ended audio dialogues for the assessment of LALMs. Through +extensive experiments conducted on 13 LALMs, our analysis reveals that there is +still considerable room for improvement in the audio dialogue understanding +abilities of existing LALMs. In particular, they struggle with mathematical +symbols and formulas, understanding human behavior such as roleplay, +comprehending multiple languages, and handling audio dialogue ambiguities from +different phonetic elements, such as intonations, pause positions, and +homophones. + +
+
+
+
+
+ + ☆ Multimodal Fact-Checking with Vision Language Models: A Probing + Classifier based Solution with Embedding Strategies COLING2025 + + +
+ This study evaluates the effectiveness of Vision Language Models (VLMs) in +representing and utilizing multimodal content for fact-checking. To be more +specific, we investigate whether incorporating multimodal content improves +performance compared to text-only models and how well VLMs utilize text and +image information to enhance misinformation detection. Furthermore we propose a +probing classifier based solution using VLMs. Our approach extracts embeddings +from the last hidden layer of selected VLMs and inputs them into a neural +probing classifier for multi-class veracity classification. Through a series of +experiments on two fact-checking datasets, we demonstrate that while +multimodality can enhance performance, fusing separate embeddings from text and +image encoders yielded superior results compared to using VLM embeddings. +Furthermore, the proposed neural classifier significantly outperformed KNN and +SVM baselines in leveraging extracted embeddings, highlighting its +effectiveness for multimodal fact-checking. + +
+
+ comment: Accepted to COLING2025 +
+
+
+
+
+ + ☆ Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on + Developmentally Plausible Corpora + + +
+ The BabyLM Challenge is a community effort to close the data-efficiency gap +between human and computational language learners. Participants compete to +optimize language model training on a fixed language data budget of 100 million +words or less. This year, we released improved text corpora, as well as a +vision-and-language corpus to facilitate research into cognitively plausible +vision language models. Submissions were compared on evaluation tasks targeting +grammatical ability, (visual) question answering, pragmatic abilities, and +grounding, among other abilities. Participants could submit to a 10M-word +text-only track, a 100M-word text-only track, and/or a 100M-word and image +multimodal track. From 31 submissions employing diverse methods, a hybrid +causal-masked language model architecture outperformed other approaches. No +submissions outperformed the baselines in the multimodal track. In follow-up +analyses, we found a strong relationship between training FLOPs and average +performance across tasks, and that the best-performing submissions proposed +changes to the training data, training objective, and model architecture. This +year's BabyLM Challenge shows that there is still significant room for +innovation in this setting, in particular for image-text modeling, but +community-driven research can yield actionable insights about effective +strategies for small-scale language modeling. + +
+
+
+
+
+ + ☆ Explingo: Explaining AI Predictions using Large Language Models + + +
+ Explanations of machine learning (ML) model predictions generated by +Explainable AI (XAI) techniques such as SHAP are essential for people using ML +outputs for decision-making. We explore the potential of Large Language Models +(LLMs) to transform these explanations into human-readable, narrative formats +that align with natural communication. We address two key research questions: +(1) Can LLMs reliably transform traditional explanations into high-quality +narratives? and (2) How can we effectively evaluate the quality of narrative +explanations? To answer these questions, we introduce Explingo, which consists +of two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML +explanations and transforms them into natural-language descriptions. The Grader +scores these narratives on a set of metrics including accuracy, completeness, +fluency, and conciseness. + Our experiments demonstrate that LLMs can generate high-quality narratives +that achieve high scores across all metrics, particularly when guided by a +small number of human-labeled and bootstrapped examples. We also identified +areas that remain challenging, in particular for effectively scoring narratives +in complex domains. The findings from this work have been integrated into an +open-source tool that makes narrative explanations available for further +applications. + +
+
+ comment: To be presented in the 2024 IEEE International Conference on Big Data + (IEEE BigData) +
+
+
+
+
+ + ☆ A Practical Examination of AI-Generated Text Detectors for Large + Language Models + + +
+ The proliferation of large language models has raised growing concerns about +their misuse, particularly in cases where AI-generated text is falsely +attributed to human authors. Machine-generated content detectors claim to +effectively identify such text under various conditions and from any language +model. This paper critically evaluates these claims by assessing several +popular detectors (RADAR, Wild, T5Sentinel, Fast-DetectGPT, GPTID, LogRank, +Binoculars) on a range of domains, datasets, and models that these detectors +have not previously encountered. We employ various prompting strategies to +simulate adversarial attacks, demonstrating that even moderate efforts can +significantly evade detection. We emphasize the importance of the true positive +rate at a specific false positive rate (TPR@FPR) metric and demonstrate that +these detectors perform poorly in certain settings, with TPR@.01 as low as 0\%. +Our findings suggest that both trained and zero-shot detectors struggle to +maintain high sensitivity while achieving a reasonable true positive rate. + +
+
+ comment: 8 pages. Submitted to ARR October cycle +
+
+
+
+
+ + ☆ Unifying Dual-Space Embedding for Entity Alignment via Contrastive + Learning COLING2025 + + +
+ Entity alignment aims to match identical entities across different knowledge +graphs (KGs). Graph neural network-based entity alignment methods have achieved +promising results in Euclidean space. However, KGs often contain complex +structures, including both local and hierarchical ones, which make it +challenging to efficiently represent them within a single space. In this paper, +we proposed a novel method UniEA, which unifies dual-space embedding to +preserve the intrinsic structure of KGs. Specifically, we learn graph structure +embedding in both Euclidean and hyperbolic spaces simultaneously to maximize +the consistency between the embedding in both spaces. Moreover, we employ +contrastive learning to mitigate the misalignment issues caused by similar +entities, where embedding of similar neighboring entities within the KG become +too close in distance. Extensive experiments on benchmark datasets demonstrate +that our method achieves state-of-the-art performance in structure-based EA. +Our code is available at https://github.com/wonderCS1213/UniEA. + +
+
+ comment: Accepted by COLING2025 +
+
+
+
+
+ + ☆ Steps are all you need: Rethinking STEM Education with Prompt + Engineering + + +
+ Few shot and Chain-of-Thought prompting have shown promise when applied to +Physics Question Answering Tasks, but are limited by the lack of mathematical +ability inherent to LLMs, and are prone to hallucination. By utilizing a +Mixture of Experts (MoE) Model, along with analogical prompting, we are able to +show improved model performance when compared to the baseline on standard LLMs. +We also survey the limits of these prompting techniques and the effects they +have on model performance. Additionally, we propose Analogical CoT prompting, a +prompting technique designed to allow smaller, open source models to leverage +Analogical prompting, something they have struggled with, possibly due to a +lack of specialist training data. + +
+
+
+
+
+ + ☆ PETapter: Leveraging PET-style classification heads for modular few-shot + parameter-efficient fine-tuning + + +
+ Few-shot learning and parameter-efficient fine-tuning (PEFT) are crucial to +overcome the challenges of data scarcity and ever growing language model sizes. +This applies in particular to specialized scientific domains, where researchers +might lack expertise and resources to fine-tune high-performing language models +to nuanced tasks. We propose PETapter, a novel method that effectively combines +PEFT methods with PET-style classification heads to boost few-shot learning +capabilities without the significant computational overhead typically +associated with full model training. We validate our approach on three +established NLP benchmark datasets and one real-world dataset from +communication research. We show that PETapter not only achieves comparable +performance to full few-shot fine-tuning using pattern-exploiting training +(PET), but also provides greater reliability and higher parameter efficiency +while enabling higher modularity and easy sharing of the trained modules, which +enables more researchers to utilize high-performing NLP-methods in their +research. + +
+
+
+
+
+ + ☆ Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for + Radiology Report Generation ACL 2024 + + +
+ We introduce a radiology-focused visual language model designed to generate +radiology reports from chest X-rays. Building on previous findings that large +language models (LLMs) can acquire multimodal capabilities when aligned with +pretrained vision encoders, we demonstrate similar potential with chest X-ray +images. This integration enhances the ability of model to understand and +describe chest X-ray images. Our model combines an image encoder with a +fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate +different sections of a radiology report with notable accuracy. The training +process involves a two-stage approach: (i) initial alignment of chest X-ray +features with the LLM (ii) followed by fine-tuning for radiology report +generation. + +
+
+ comment: Accepted by BioNLP@ACL 2024 +
+
+
+
+
+ + ☆ KaLM: Knowledge-aligned Autoregressive Language Modeling via Dual-view + Knowledge Graph Contrastive Learning + + +
+ Autoregressive large language models (LLMs) pre-trained by next token +prediction are inherently proficient in generative tasks. However, their +performance on knowledge-driven tasks such as factual knowledge querying +remains unsatisfactory. Knowledge graphs (KGs), as high-quality structured +knowledge bases, can provide reliable knowledge for LLMs, potentially +compensating for their knowledge deficiencies. Aligning LLMs with explicit, +structured knowledge from KGs has been a challenge; previous attempts either +failed to effectively align knowledge representations or compromised the +generative capabilities of LLMs, leading to less-than-optimal outcomes. This +paper proposes \textbf{KaLM}, a \textit{Knowledge-aligned Language Modeling} +approach, which fine-tunes autoregressive LLMs to align with KG knowledge via +the joint objective of explicit knowledge alignment and implicit knowledge +alignment. The explicit knowledge alignment objective aims to directly optimize +the knowledge representation of LLMs through dual-view knowledge graph +contrastive learning. The implicit knowledge alignment objective focuses on +incorporating textual patterns of knowledge into LLMs through triple completion +language modeling. Notably, our method achieves a significant performance boost +in evaluations of knowledge-driven tasks, specifically embedding-based +knowledge graph completion and generation-based knowledge graph question +answering. + +
+
+
+
+
+ + ☆ C$^2$LEVA: Toward Comprehensive and Contamination-Free Language Model + Evaluation + + +
+ Recent advances in large language models (LLMs) have shown significant +promise, yet their evaluation raises concerns, particularly regarding data +contamination due to the lack of access to proprietary training data. To +address this issue, we present C$^2$LEVA, a comprehensive bilingual benchmark +featuring systematic contamination prevention. C$^2$LEVA firstly offers a +holistic evaluation encompassing 22 tasks, each targeting a specific +application or ability of LLMs, and secondly a trustworthy assessment due to +our contamination-free tasks, ensured by a systematic contamination prevention +strategy that fully automates test data renewal and enforces data protection +during benchmark data release. Our large-scale evaluation of 15 open-source and +proprietary models demonstrates the effectiveness of C$^2$LEVA. + +
+
+
+
+
+ + ☆ A Federated Approach to Few-Shot Hate Speech Detection for Marginalized + Communities + + +
+ Hate speech online remains an understudied issue for marginalized +communities, and has seen rising relevance, especially in the Global South, +which includes developing societies with increasing internet penetration. In +this paper, we aim to provide marginalized communities living in societies +where the dominant language is low-resource with a privacy-preserving tool to +protect themselves from hate speech on the internet by filtering offensive +content in their native languages. Our contribution in this paper is twofold: +1) we release REACT (REsponsive hate speech datasets Across ConTexts), a +collection of high-quality, culture-specific hate speech detection datasets +comprising seven distinct target groups in eight low-resource languages, +curated by experienced data collectors; 2) we propose a solution to few-shot +hate speech detection utilizing federated learning (FL), a privacy-preserving +and collaborative learning approach, to continuously improve a central model +that exhibits robustness when tackling different target groups and languages. +By keeping the training local to the users' devices, we ensure the privacy of +the users' data while benefitting from the efficiency of federated learning. +Furthermore, we personalize client models to target-specific training data and +evaluate their performance. Our results indicate the effectiveness of FL across +different target groups, whereas the benefits of personalization on few-shot +learning are not clear. + +
+
+
+
+
+ + ☆ Who Speaks Next? Multi-party AI Discussion Leveraging the Systematics of + Turn-taking in Murder Mystery Games + + +
+ Multi-agent systems utilizing large language models (LLMs) have shown great +promise in achieving natural dialogue. However, smooth dialogue control and +autonomous decision making among agents still remain challenges. In this study, +we focus on conversational norms such as adjacency pairs and turn-taking found +in conversation analysis and propose a new framework called "Murder Mystery +Agents" that applies these norms to AI agents' dialogue control. As an +evaluation target, we employed the "Murder Mystery" game, a reasoning-type +table-top role-playing game that requires complex social reasoning and +information manipulation. In this game, players need to unravel the truth of +the case based on fragmentary information through cooperation and bargaining. +The proposed framework integrates next speaker selection based on adjacency +pairs and a self-selection mechanism that takes agents' internal states into +account to achieve more natural and strategic dialogue. To verify the +effectiveness of this new approach, we analyzed utterances that led to dialogue +breakdowns and conducted automatic evaluation using LLMs, as well as human +evaluation using evaluation criteria developed for the Murder Mystery game. +Experimental results showed that the implementation of the next speaker +selection mechanism significantly reduced dialogue breakdowns and improved the +ability of agents to share information and perform logical reasoning. The +results of this study demonstrate that the systematics of turn-taking in human +conversation are also effective in controlling dialogue among AI agents, and +provide design guidelines for more advanced multi-agent dialogue systems. + +
+
+
+
+
+ + ☆ Probing the contents of semantic representations from text, behavior, + and brain data using the psychNorms metabase + + +
+ Semantic representations are integral to natural language processing, +psycholinguistics, and artificial intelligence. Although often derived from +internet text, recent years have seen a rise in the popularity of +behavior-based (e.g., free associations) and brain-based (e.g., fMRI) +representations, which promise improvements in our ability to measure and model +human representations. We carry out the first systematic evaluation of the +similarities and differences between semantic representations derived from +text, behavior, and brain data. Using representational similarity analysis, we +show that word vectors derived from behavior and brain data encode information +that differs from their text-derived cousins. Furthermore, drawing on our +psychNorms metabase, alongside an interpretability method that we call +representational content analysis, we find that, in particular, behavior +representations capture unique variance on certain affective, agentic, and +socio-moral dimensions. We thus establish behavior as an important complement +to text for capturing human representations and behavior. These results are +broadly relevant to research aimed at learning human-aligned semantic +representations, including work on evaluating and aligning large language +models. + +
+
+ comment: 13 pages, 5 figures, 2 tables +
+
+
+
+
+ + ☆ Large Language Models for Ingredient Substitution in Food Recipes using + Supervised Fine-tuning and Direct Preference Optimization + + +
+ In this paper, we address the challenge of recipe personalization through +ingredient substitution. We make use of Large Language Models (LLMs) to build +an ingredient substitution system designed to predict plausible substitute +ingredients within a given recipe context. Given that the use of LLMs for this +task has been barely done, we carry out an extensive set of experiments to +determine the best LLM, prompt, and the fine-tuning setups. We further +experiment with methods such as multi-task learning, two-stage fine-tuning, and +Direct Preference Optimization (DPO). The experiments are conducted using the +publicly available Recipe1MSub corpus. The best results are produced by the +Mistral7-Base LLM after fine-tuning and DPO. This result outperforms the strong +baseline available for the same corpus with a Hit@1 score of 22.04. Thus we +believe that this research represents a significant step towards enabling +personalized and creative culinary experiences by utilizing LLM-based +ingredient substitution. + +
+
+
+
+
+ + ☆ DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling + + +
+ Large language models (LLMs) have made dialogue one of the central modes of +human-machine interaction, leading to the accumulation of vast amounts of +conversation logs and increasing demand for dialogue generation. A +conversational life-cycle spans from the Prelude through the Interlocution to +the Epilogue, encompassing various elements. Despite the existence of numerous +dialogue-related studies, there is a lack of benchmarks that encompass +comprehensive dialogue elements, hindering precise modeling and systematic +evaluation. To bridge this gap, we introduce an innovative research task +$\textbf{D}$ialogue $\textbf{E}$lement $\textbf{MO}$deling, including +$\textit{Element Awareness}$ and $\textit{Dialogue Agent Interaction}$, and +propose a novel benchmark, $\textbf{DEMO}$, designed for a comprehensive +dialogue modeling and assessment. Inspired by imitation learning, we further +build the agent which possesses the adept ability to model dialogue elements +based on the DEMO benchmark. Extensive experiments indicate that existing LLMs +still exhibit considerable potential for enhancement, and our DEMO agent has +superior performance in both in-domain and out-of-domain tasks. + +
+
+ comment: We release the code and data at https://github.com/MozerWang/DEMO +
+
+
+
+
+ + ☆ EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation + + +
+ Multimodal large language models (MLLMs) have achieved remarkable progress on +various visual question answering and reasoning tasks leveraging instruction +fine-tuning specific datasets. They can also learn from preference data +annotated by human to enhance their reasoning ability and mitigate +hallucinations. Most of preference data is generated from the model itself. +However, existing methods require high-quality critical labels, which are +costly and rely on human or proprietary models like GPT-4V. In this work, we +propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which +aligns MLLMs by self-generated preference data using only 5k images +economically. Our approach begins with collecting and refining a Scoring +Evaluation Instruction-tuning dataset to train a critical evaluation model, +termed the Critic. This Critic observes model responses across multiple +dimensions, selecting preferred and non-preferred outputs for refined Direct +Preference Optimization (DPO) tuning. To further enhance model performance, we +employ an additional supervised fine-tuning stage after preference tuning. EACO +reduces the overall hallucinations by 65.6% on HallusionBench and improves the +reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement +over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also +shows the potential critical ability in open-source MLLMs, demonstrating that +EACO is a viable path to boost the competence of MLLMs. + +
+
+ comment: 19 pages +
+
+
+
+
+ + ☆ Building a Family of Data Augmentation Models for Low-cost LLM + Fine-tuning on the Cloud + + +
+ Specializing LLMs in various domain-specific tasks has emerged as a critical +step towards achieving high performance. However, the construction and +annotation of datasets in specific domains are always very costly. Apart from +using superior and expensive closed-source LLM APIs to construct datasets, some +open-source models have become strong enough to handle dataset construction in +many scenarios. Thus, we present a family of data augmentation models designed +to significantly improve the efficiency for model fine-tuning. These models, +trained based on sufficiently small LLMs, support key functionalities with low +inference costs: instruction expansion, instruction refinement, and +instruction-response pair expansion. To fulfill this goal, we first construct +an automatic data collection system with seed datasets generated from both +public repositories and our in-house datasets. This system leverages powerful +LLMs to expand, refine and re-write the instructions and responses, +incorporating quality assessment techniques. Following this, we introduce the +training process of our models, which effectively distills task-solving and +text synthesis abilities from teacher LLMs. Finally, we demonstrate how we +integrate these functionalities into a machine learning platform to support +low-cost LLM fine-tuning from both dataset preparation and training +perspectives for users. Experiments and an application study prove the +effectiveness of our approach. + +
+
+ comment: coling 2025 industry track +
+
+
+
+
+ + ☆ EXAONE 3.5: Series of Large Language Models for Real-world Use Cases + + +
+ This technical report introduces the EXAONE 3.5 instruction-tuned language +models, developed and released by LG AI Research. The EXAONE 3.5 language +models are offered in three configurations: 32B, 7.8B, and 2.4B. These models +feature several standout capabilities: 1) exceptional instruction following +capabilities in real-world scenarios, achieving the highest scores across seven +benchmarks, 2) outstanding long-context comprehension, attaining the top +performance in four benchmarks, and 3) competitive results compared to +state-of-the-art open models of similar sizes across nine general benchmarks. +The EXAONE 3.5 language models are open to anyone for research purposes and can +be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, +please reach out to the official contact point of LG AI Research: +contact_us@lgresearch.ai. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2408.03541 +
+
+
+
+
+ + ☆ Breaking Event Rumor Detection via Stance-Separated Multi-Agent Debate + + +
+ The rapid spread of rumors on social media platforms during breaking events +severely hinders the dissemination of the truth. Previous studies reveal that +the lack of annotated resources hinders the direct detection of unforeseen +breaking events not covered in yesterday's news. Leveraging large language +models (LLMs) for rumor detection holds significant promise. However, it is +challenging for LLMs to provide comprehensive responses to complex or +controversial issues due to limited diversity. In this work, we propose the +Stance Separated Multi-Agent Debate (S2MAD) to address this issue. +Specifically, we firstly introduce Stance Separation, categorizing comments as +either supporting or opposing the original claim. Subsequently, claims are +classified as subjective or objective, enabling agents to generate reasonable +initial viewpoints with different prompt strategies for each type of claim. +Debaters then follow specific instructions through multiple rounds of debate to +reach a consensus. If a consensus is not reached, a judge agent evaluates the +opinions and delivers a final verdict on the claim's veracity. Extensive +experiments conducted on two real-world datasets demonstrate that our proposed +model outperforms state-of-the-art methods in terms of performance and +effectively improves the performance of LLMs in breaking event rumor detection. + +
+
+
+
+
+ + ☆ Adaptive Dropout for Pruning Conformers + + +
+ This paper proposes a method to effectively perform joint +training-and-pruning based on adaptive dropout layers with unit-wise retention +probabilities. The proposed method is based on the estimation of a unit-wise +retention probability in a dropout layer. A unit that is estimated to have a +small retention probability can be considered to be prunable. The retention +probability of the unit is estimated using back-propagation and the +Gumbel-Softmax technique. This pruning method is applied at several application +points in Conformers such that the effective number of parameters can be +significantly reduced. Specifically, adaptive dropout layers are introduced in +three locations in each Conformer block: (a) the hidden layer of the +feed-forward-net component, (b) the query vectors and the value vectors of the +self-attention component, and (c) the input vectors of the LConv component. The +proposed method is evaluated by conducting a speech recognition experiment on +the LibriSpeech task. It was shown that this approach could simultaneously +achieve a parameter reduction and accuracy improvement. The word error rates +improved by approx 1% while reducing the number of parameters by 54%. + +
+
+
+
+
+ + ☆ Rethinking Time Series Forecasting with LLMs via Nearest Neighbor + Contrastive Learning + + +
+ Adapting Large Language Models (LLMs) that are extensively trained on +abundant text data, and customizing the input prompt to enable time series +forecasting has received considerable attention. While recent work has shown +great potential for adapting the learned prior of LLMs, the formulation of the +prompt to finetune LLMs remains challenging as prompt should be aligned with +time series data. Additionally, current approaches do not effectively leverage +word token embeddings which embody the rich representation space learned by +LLMs. This emphasizes the need for a robust approach to formulate the prompt +which utilizes the word token embeddings while effectively representing the +characteristics of the time series. To address these challenges, we propose +NNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting +via LLMs. First, we generate time series compatible text prototypes such that +each text prototype represents both word token embeddings in its neighborhood +and time series characteristics via end-to-end finetuning. Next, we draw +inspiration from Nearest Neighbor Contrastive Learning to formulate the prompt +while obtaining the top-$k$ nearest neighbor time series compatible text +prototypes. We then fine-tune the layer normalization and positional embeddings +of the LLM, keeping the other layers intact, reducing the trainable parameters +and decreasing the computational cost. Our comprehensive experiments +demonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving +competitive or superior performance over the state-of-the-art methods in +long-term and short-term forecasting tasks. + +
+
+
+
+
+ + ☆ Direct Quantized Training of Language Models with Stochastic Rounding + + +
+ Although recent quantized Large Language Models (LLMs), such as BitNet, have +paved the way for significant reduction in memory usage during deployment with +binary or ternary weights, training these models still demands substantial +memory footprints. This is partly because high-precision (i.e., unquantized) +weight matrices required for straight-through estimation must be maintained +throughout the whole training process. To address this, we explore the +potential of directly updating the quantized low-precision weight matrices +without relying on the straight-through estimator during backpropagation, +thereby saving memory usage during training. Specifically, we employ a +stochastic rounding technique to minimize information loss caused by the use of +low-bit weights throughout training. Experimental results on our +LLaMA-structured models indicate that (1) training with only low-precision +weights is feasible even when they are constrained to ternary values, (2) +extending the bit width to 8 bits results in only a 5% loss degradation +compared to BitNet b1.58 while offering the potential for reduced memory usage +during training, and (3) our models can also perform inference using ternary +weights, showcasing their flexibility in deployment. + +
+
+ comment: work in progress +
+
+
+
+
+ + ☆ NLP-ADBench: NLP Anomaly Detection Benchmark SC + + +
+ Anomaly detection (AD) is a critical machine learning task with diverse +applications in web systems, including fraud detection, content moderation, and +user behavior analysis. Despite its significance, AD in natural language +processing (NLP) remains underexplored, limiting advancements in detecting +anomalies in text data such as harmful content, phishing attempts, or spam +reviews. In this paper, we introduce NLP-ADBench, the most comprehensive +benchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets +and evaluations of nineteen state-of-the-art algorithms. These include three +end-to-end methods and sixteen two-step algorithms that apply traditional +anomaly detection techniques to language embeddings generated by +bert-base-uncased and OpenAI's text-embedding-3-large models. + Our results reveal critical insights and future directions for NLP-AD. +Notably, no single model excels across all datasets, highlighting the need for +automated model selection. Moreover, two-step methods leveraging +transformer-based embeddings consistently outperform specialized end-to-end +approaches, with OpenAI embeddings demonstrating superior performance over BERT +embeddings. By releasing NLP-ADBench at +https://github.com/USC-FORTIS/NLP-ADBench, we provide a standardized framework +for evaluating NLP-AD methods, fostering the development of innovative +approaches. This work fills a crucial gap in the field and establishes a +foundation for advancing NLP anomaly detection, particularly in the context of +improving the safety and reliability of web-based systems. + +
+
+ comment: The project is available at https://github.com/USC-FORTIS/NLP-ADBench +
+
+
+
+
+ + ☆ Foundation Models for Low-Resource Language Education (Vision Paper) + + +
+ Recent studies show that large language models (LLMs) are powerful tools for +working with natural language, bringing advances in many areas of computational +linguistics. However, these models face challenges when applied to low-resource +languages due to limited training data and difficulty in understanding cultural +nuances. Research is now focusing on multilingual models to improve LLM +performance for these languages. Education in these languages also struggles +with a lack of resources and qualified teachers, particularly in underdeveloped +regions. Here, LLMs can be transformative, supporting innovative methods like +community-driven learning and digital platforms. This paper discusses how LLMs +could enhance education for low-resource languages, emphasizing practical +applications and benefits. + +
+
+
+
+
+ + ☆ Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free + Dynamic Triangular Attention Pattern + + +
+ The quadratic computational complexity of the attention mechanism in current +Large Language Models (LLMs) renders inference with long contexts prohibitively +expensive. To address this challenge, various approaches aim to retain critical +portions of the context to optimally approximate Full Attention (FA) through +Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of +virtually unlimited text lengths in a streaming manner. However, these methods +struggle to achieve performance levels comparable to FA, particularly in +retrieval tasks. In this paper, our analysis of attention head patterns reveals +that LLMs' attention distributions show strong local correlations, naturally +reflecting a chunking mechanism for input context. We propose Ltri-LLM +framework, which divides KVs into spans, stores them in an offline index, and +retrieves the relevant KVs into memory for various queries. Experimental +results on popular long text benchmarks show that Ltri-LLM can achieve +performance close to FA while maintaining efficient, streaming-based inference. + +
+
+
+
+
+ + ☆ ChatNVD: Advancing Cybersecurity Vulnerability Assessment with Large + Language Models + + +
+ The increasing frequency and sophistication of cybersecurity vulnerabilities +in software systems underscore the urgent need for robust and effective methods +of vulnerability assessment. However, existing approaches often rely on highly +technical and abstract frameworks, which hinders understanding and increases +the likelihood of exploitation, resulting in severe cyberattacks. Given the +growing adoption of Large Language Models (LLMs) across diverse domains, this +paper explores their potential application in cybersecurity, specifically for +enhancing the assessment of software vulnerabilities. We propose ChatNVD, an +LLM-based cybersecurity vulnerability assessment tool leveraging the National +Vulnerability Database (NVD) to provide context-rich insights and streamline +vulnerability analysis for cybersecurity professionals, developers, and +non-technical users. We develop three variants of ChatNVD, utilizing three +prominent LLMs: GPT-4o mini by OpenAI, Llama 3 by Meta, and Gemini 1.5 Pro by +Google. To evaluate their efficacy, we conduct a comparative analysis of these +models using a comprehensive questionnaire comprising common security +vulnerability questions, assessing their accuracy in identifying and analyzing +software vulnerabilities. This study provides valuable insights into the +potential of LLMs to address critical challenges in understanding and +mitigation of software vulnerabilities. + +
+
+
+
+
+ + ☆ Question Answering for Decisionmaking in Green Building Design: A + Multimodal Data Reasoning Method Driven by Large Language Models + + +
+ In recent years, the critical role of green buildings in addressing energy +consumption and environmental issues has become widely acknowledged. Research +indicates that over 40% of potential energy savings can be achieved during the +early design stage. Therefore, decision-making in green building design (DGBD), +which is based on modeling and performance simulation, is crucial for reducing +building energy costs. However, the field of green building encompasses a broad +range of specialized knowledge, which involves significant learning costs and +results in low decision-making efficiency. Many studies have already applied +artificial intelligence (AI) methods to this field. Based on previous research, +this study innovatively integrates large language models with DGBD, creating +GreenQA, a question answering framework for multimodal data reasoning. +Utilizing Retrieval Augmented Generation, Chain of Thought, and Function Call +methods, GreenQA enables multimodal question answering, including weather data +analysis and visualization, retrieval of green building cases, and knowledge +query. Additionally, this study conducted a user survey using the GreenQA web +platform. The results showed that 96% of users believed the platform helped +improve design efficiency. This study not only effectively supports DGBD but +also provides inspiration for AI-assisted design. + +
+
+ comment: Published at Association for Computer Aided Design in Architecture + (ACADIA) 2024 +
+
+
+
+
+ + ☆ BESSTIE: A Benchmark for Sentiment and Sarcasm Classification for + Varieties of English + + +
+ Despite large language models (LLMs) being known to exhibit bias against +non-mainstream varieties, there are no known labeled datasets for sentiment +analysis of English. To address this gap, we introduce BESSTIE, a benchmark for +sentiment and sarcasm classification for three varieties of English: Australian +(en-AU), Indian (en-IN), and British (en-UK). Using web-based content from two +domains, namely, Google Place reviews and Reddit comments, we collect datasets +for these language varieties using two methods: location-based and topic-based +filtering. Native speakers of the language varieties manually annotate the +datasets with sentiment and sarcasm labels. Subsequently, we fine-tune nine +large language models (LLMs) (representing a range of encoder/decoder and +mono/multilingual models) on these datasets, and evaluate their performance on +the two tasks. Our results reveal that the models consistently perform better +on inner-circle varieties (i.e., en-AU and en-UK), with significant performance +drops for en-IN, particularly in sarcasm detection. We also report challenges +in cross-variety generalisation, highlighting the need for language +variety-specific datasets such as ours. BESSTIE promises to be a useful +evaluative benchmark for future research in equitable LLMs, specifically in +terms of language varieties. The BESSTIE datasets, code, and models are +currently available on request, while the paper is under review. Please email +aditya.joshi@unsw.edu.au. + +
+
+ comment: 10 pages, 7 figures, under review +
+
+
+
+
+ + ☆ NoLoR: An ASR-Based Framework for Expedited Endangered Language + Documentation with Neo-Aramaic as a Case Study + + +
+ The documentation of the Neo-Aramaic dialects before their extinction has +been described as the most urgent task in all of Semitology today. The death of +this language will be an unfathomable loss to the descendents of the indigenous +speakers of Aramaic, now predominantly diasporic after forced displacement due +to violence. This paper develops an ASR model to expedite the documentation of +this endangered language and generalizes the strategy in a new framework we +call NoLoR. + +
+
+
+
+
+ + ☆ Transformers Struggle to Learn to Search + + +
+ Search is an ability foundational in many important tasks, and recent studies +have shown that large language models (LLMs) struggle to perform search +robustly. It is unknown whether this inability is due to a lack of data, +insufficient model parameters, or fundamental limitations of the transformer +architecture. In this work, we use the foundational graph connectivity problem +as a testbed to generate effectively limitless high-coverage data to train +small transformers and test whether they can learn to perform search. We find +that, when given the right training distribution, the transformer is able to +learn to search. + We analyze the algorithm that the transformer has learned through a novel +mechanistic interpretability technique that enables us to extract the +computation graph from the trained model. We find that for each vertex in the +input graph, transformers compute the set of vertices reachable from that +vertex. Each layer then progressively expands these sets, allowing the model to +search over a number of vertices exponential in the number of layers. + However, we find that as the input graph size increases, the transformer has +greater difficulty in learning the task. This difficulty is not resolved even +as the number of parameters is increased, suggesting that increasing model +scale will not lead to robust search abilities. We also find that performing +search in-context (i.e., chain-of-thought) does not resolve this inability to +learn to search on larger graphs. + +
+
+
+
+
+ + ☆ Privacy-Preserving Retrieval Augmented Generation with Differential + Privacy + + +
+ With the recent remarkable advancement of large language models (LLMs), there +has been a growing interest in utilizing them in the domains with highly +sensitive data that lies outside their training data. For this purpose, +retrieval augmented generation (RAG) is particularly effective -- it assists +LLMs by directly providing relevant information from the external knowledge +sources. However, without extra privacy safeguards, RAG outputs risk leaking +sensitive information from the external data source. In this work, we explore +RAG under differential privacy (DP), a formal guarantee of data privacy. The +main challenge with differentially private RAG is how to generate long accurate +answers within a moderate privacy budget. We address this by proposing an +algorithm that smartly spends privacy budget only for the tokens that require +the sensitive information and uses the non-private LLM for other tokens. Our +extensive empirical evaluations reveal that our algorithm outperforms the +non-RAG baseline under a reasonable privacy budget of $\epsilon\approx 10$ +across different models and datasets. + +
+
+
+
+
+ + ☆ LLM-Align: Utilizing Large Language Models for Entity Alignment in + Knowledge Graphs + + +
+ Entity Alignment (EA) seeks to identify and match corresponding entities +across different Knowledge Graphs (KGs), playing a crucial role in knowledge +fusion and integration. Embedding-based entity alignment (EA) has recently +gained considerable attention, resulting in the emergence of many innovative +approaches. Initially, these approaches concentrated on learning entity +embeddings based on the structural features of knowledge graphs (KGs) as +defined by relation triples. Subsequent methods have integrated entities' names +and attributes as supplementary information to improve the embeddings used for +EA. However, existing methods lack a deep semantic understanding of entity +attributes and relations. In this paper, we propose a Large Language Model +(LLM) based Entity Alignment method, LLM-Align, which explores the +instruction-following and zero-shot capabilities of Large Language Models to +infer alignments of entities. LLM-Align uses heuristic methods to select +important attributes and relations of entities, and then feeds the selected +triples of entities to an LLM to infer the alignment results. To guarantee the +quality of alignment results, we design a multi-round voting mechanism to +mitigate the hallucination and positional bias issues that occur with LLMs. +Experiments on three EA datasets, demonstrating that our approach achieves +state-of-the-art performance compared to existing EA methods. + +
+
+
+
+
+ + ♻ ☆ A Practitioner's Guide to Continual Multimodal Pretraining NeurIPS + 2024 + + +
+ Multimodal foundation models serve numerous applications at the intersection +of vision and language. Still, despite being pretrained on extensive data, they +become outdated over time. To keep models updated, research into continual +pretraining mainly explores scenarios with either (1) infrequent, +indiscriminate updates on large-scale new data, or (2) frequent, sample-level +updates. However, practical model deployment often operates in the gap between +these two limit cases, as real-world applications often demand adaptation to +specific subdomains, tasks or concepts -- spread over the entire, varying life +cycle of a model. In this work, we complement current perspectives on continual +pretraining through a research test bed as well as provide comprehensive +guidance for effective continual model updates in such scenarios. We first +introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with +realistic compute constraints and practical deployment requirements, +constructed over 63 datasets with diverse visual and semantic coverage. Using +FoMo-in-Flux, we explore the complex landscape of practical continual +pretraining through multiple perspectives: (1) A data-centric investigation of +data mixtures and stream orderings that emulate real-world deployment +situations, (2) a method-centric investigation ranging from simple fine-tuning +and traditional continual learning strategies to parameter-efficient updates +and model merging, (3) meta learning rate schedules and mechanistic design +choices, and (4) the influence of model and compute scaling. Together, our +insights provide a practitioner's guide to continual multimodal pretraining for +real-world deployment. Our benchmark and code is here: +https://github.com/ExplainableML/fomo_in_flux. + +
+
+ comment: Technical Report. 52 pages. Shorter version published at the NeurIPS + 2024 Dataset & Benchmarks track +
+
+
+
+
+ + ♻ ☆ Is Your Paper Being Reviewed by an LLM? Investigating AI Text + Detectability in Peer Review + + +
+ Peer review is a critical process for ensuring the integrity of published +scientific research. Confidence in this process is predicated on the assumption +that experts in the relevant domain give careful consideration to the merits of +manuscripts which are submitted for publication. With the recent rapid +advancements in the linguistic capabilities of large language models (LLMs), a +new potential risk to the peer review process is that negligent reviewers will +rely on LLMs to perform the often time consuming process of reviewing a paper. +In this study, we investigate the ability of existing AI text detection +algorithms to distinguish between peer reviews written by humans and different +state-of-the-art LLMs. Our analysis shows that existing approaches fail to +identify many GPT-4o written reviews without also producing a high number of +false positive classifications. To address this deficiency, we propose a new +detection approach which surpasses existing methods in the identification of +GPT-4o written peer reviews at low levels of false positive classifications. +Our work reveals the difficulty of accurately identifying AI-generated text at +the individual review level, highlighting the urgent need for new tools and +methods to detect this type of unethical application of generative AI. + +
+
+
+
+
+ + ♻ ☆ Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues + + +
+ Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and +DeltaNet have emerged as efficient alternatives to Transformers in large +language modeling, offering linear scaling with sequence length and improved +training efficiency. However, LRNNs struggle to perform state-tracking which +may impair performance in tasks such as code evaluation or tracking a chess +game. Even parity, the simplest state-tracking task, which non-linear RNNs like +LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et +al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity +stems from restricting the value range of their diagonal state-transition +matrices to $[0, 1]$ and that incorporating negative values can resolve this +issue. We extend this result to non-diagonal LRNNs, which have recently shown +promise in models such as DeltaNet. We prove that finite precision LRNNs with +state-transition matrices having only positive eigenvalues cannot solve parity, +while complex eigenvalues are needed to count modulo $3$. Notably, we also +prove that LRNNs can learn any regular language when their state-transition +matrices are products of identity minus vector outer product matrices, each +with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that +extending the eigenvalue range of models like Mamba and DeltaNet to include +negative values not only enables them to solve parity but consistently improves +their performance on state-tracking tasks. Furthermore, pre-training LRNNs with +an extended eigenvalue range for language modeling achieves comparable +performance and stability while showing promise on code and math data. Our work +enhances the expressivity of modern LRNNs, broadening their applicability +without changing the cost of training or inference. + +
+
+ comment: Main changes: Correction to Theorem 1 and 2 (we excluded from the + only if condition complex eigenvalues with modulus strictly less than one). + Correction to point 3 of Proposition 3 +
+
+
+
+
+ + ♻ ☆ Random Tree Model of Meaningful Memory + + +
+ Traditional studies of memory for meaningful narratives focus on specific +stories and their semantic structures but do not address common quantitative +features of recall across different narratives. We introduce a statistical +ensemble of random trees to represent narratives as hierarchies of key points, +where each node is a compressed representation of its descendant leaves, which +are the original narrative segments. Recall is modeled as constrained by +working memory capacity from this hierarchical structure. Our analytical +solution aligns with observations from large-scale narrative recall +experiments. Specifically, our model explains that (1) average recall length +increases sublinearly with narrative length, and (2) individuals summarize +increasingly longer narrative segments in each recall sentence. Additionally, +the theory predicts that for sufficiently long narratives, a universal, +scale-invariant limit emerges, where the fraction of a narrative summarized by +a single recall sentence follows a distribution independent of narrative +length. + +
+
+ comment: 16 pages, 4 figures +
+
+
+
+
+ + ♻ ☆ MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal + Large Language Models + + +
+ Despite the superior capabilities of Multimodal Large Language Models (MLLMs) +across diverse tasks, they still face significant trustworthiness challenges. +Yet, current literature on the assessment of trustworthy MLLMs remains limited, +lacking a holistic evaluation to offer thorough insights into future +improvements. In this work, we establish MultiTrust, the first comprehensive +and unified benchmark on the trustworthiness of MLLMs across five primary +aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark +employs a rigorous evaluation strategy that addresses both multimodal risks and +cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. +Extensive experiments with 21 modern MLLMs reveal some previously unexplored +trustworthiness issues and risks, highlighting the complexities introduced by +the multimodality and underscoring the necessity for advanced methodologies to +enhance their reliability. For instance, typical proprietary models still +struggle with the perception of visually confusing images and are vulnerable to +multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to +disclose privacy in text and reveal ideological and cultural biases even when +paired with irrelevant images in inference, indicating that the multimodality +amplifies the internal risks from base LLMs. Additionally, we release a +scalable toolbox for standardized trustworthiness research, aiming to +facilitate future advancements in this important field. Code and resources are +publicly available at: https://multi-trust.github.io/. + +
+
+ comment: 100 pages, 84 figures, 33 tables +
+
+
+
+
+ + ♻ ☆ Sensitive Content Classification in Social Media: A Holistic Resource + and Evaluation + + +
+ The detection of sensitive content in large datasets is crucial for ensuring +that shared and analysed data is free from harmful material. However, current +moderation tools, such as external APIs, suffer from limitations in +customisation, accuracy across diverse sensitive categories, and privacy +concerns. Additionally, existing datasets and open-source models focus +predominantly on toxic language, leaving gaps in detecting other sensitive +categories such as substance abuse or self-harm. In this paper, we put forward +a unified dataset tailored for social media content moderation across six +sensitive categories: conflictual language, profanity, sexually explicit +material, drug-related content, self-harm, and spam. By collecting and +annotating data with consistent retrieval strategies and guidelines, we address +the shortcomings of previous focalised research. Our analysis demonstrates that +fine-tuning large language models (LLMs) on this novel dataset yields +significant improvements in detection performance compared to open +off-the-shelf models such as LLaMA, and even proprietary OpenAI models, which +underperform by 10-15% overall. This limitation is even more pronounced on +popular moderation APIs, which cannot be easily tailored to specific sensitive +content categories, among others. + +
+
+
+
+
+ + ♻ ☆ An Evolved Universal Transformer Memory + + +
+ Prior methods propose to offset the escalating costs of modern foundation +models by dropping specific parts of their contexts with hand-designed rules, +while attempting to preserve their original performance. We overcome this +trade-off with Neural Attention Memory Models (NAMMs), introducing a learned +network for memory management that improves both the performance and efficiency +of transformers. We evolve NAMMs atop pre-trained transformers to provide +different latent contexts focusing on the most relevant information for +individual layers and attention heads. NAMMs are universally applicable to any +model using self-attention as they condition exclusively on the values in the +produced attention matrices. Learning NAMMs on a small set of problems, we +achieve substantial performance improvements across multiple long-context +benchmarks while cutting the model's input contexts up to a fraction of the +original sizes. We show the generality of our conditioning enables zero-shot +transfer of NAMMs trained only on language to entirely new transformer +architectures even across input modalities, with their benefits carrying over +to vision and reinforcement learning. + +
+
+ comment: Preprint, under submission. Source code is available at + https://github.com/SakanaAI/evo-memory +
+
+
+
+
+ + ♻ ☆ Hallucination Detection in LLMs: Fast and Memory-Efficient Fine-Tuned + Models + + +
+ Uncertainty estimation is a necessary component when implementing AI in +high-risk settings, such as autonomous cars, medicine, or insurances. Large +Language Models (LLMs) have seen a surge in popularity in recent years, but +they are subject to hallucinations, which may cause serious harm in high-risk +settings. Despite their success, LLMs are expensive to train and run: they need +a large amount of computations and memory, preventing the use of ensembling +methods in practice. In this work, we present a novel method that allows for +fast and memory-friendly training of LLM ensembles. We show that the resulting +ensembles can detect hallucinations and are a viable approach in practice as +only one GPU is needed for training and inference. + +
+
+ comment: 6 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ LLMs May Perform MCQA by Selecting the Least Incorrect Option COLING 2025 + + +
+ In the field of NLP, Large Language Models (LLMs) have markedly enhanced +performance across a variety of tasks. However, the comprehensive evaluation of +LLMs remains an inevitable challenge for the community. Recently, the adoption +of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs +has gained considerable traction. However, concerns regarding the robustness of +this evaluative method persist. Building upon previous discussions on the issue +of \textit{variability}, we reveal an additional dimension of concern: LLMs may +perform MCQA by selecting the least incorrect option rather than distinctly +correct. This observation suggests that LLMs might regard multiple options as +correct, which could undermine the reliability of MCQA as a metric for +evaluating LLMs. To address this challenge, we introduce an enhanced dataset +augmentation method for MCQA, termed MCQA+, to provide a more accurate +reflection of the model performance, thereby highlighting the necessity for +more sophisticated evaluation mechanisms in the assessment of LLM capabilities. + +
+
+ comment: COLING 2025 +
+
+
+
+
+ + ♻ ☆ Densing Law of LLMs + + +
+ Large Language Models (LLMs) have emerged as a milestone in artificial +intelligence, and their performance can improve as the model size increases. +However, this scaling brings great challenges to training and inference +efficiency, particularly for deploying LLMs in resource-constrained +environments, and the scaling trend is becoming increasingly unsustainable. +This paper introduces the concept of ``\textit{capacity density}'' as a new +metric to evaluate the quality of the LLMs across different scales and +describes the trend of LLMs in terms of both effectiveness and efficiency. To +calculate the capacity density of a given target LLM, we first introduce a set +of reference models and develop a scaling law to predict the downstream +performance of these reference models based on their parameter sizes. We then +define the \textit{effective parameter size} of the target LLM as the parameter +size required by a reference model to achieve equivalent performance, and +formalize the capacity density as the ratio of the effective parameter size to +the actual parameter size of the target LLM. Capacity density provides a +unified framework for assessing both model effectiveness and efficiency. Our +further analysis of recent open-source base LLMs reveals an empirical law (the +densing law)that the capacity density of LLMs grows exponentially over time. +More specifically, using some widely used benchmarks for evaluation, the +capacity density of LLMs doubles approximately every three months. The law +provides new perspectives to guide future LLM development, emphasizing the +importance of improving capacity density to achieve optimal results with +minimal computational overhead. + +
+
+
+
+
+ + ♻ ☆ Docling Technical Report AAAI 25 + + +
+ We introduce Docling, an easy-to-use, self-contained, MIT-licensed, +open-source toolkit for document conversion, that can parse several types of +popular document formats into a unified, richly structured representation. It +is powered by state-of-the-art specialized AI models for layout analysis +(DocLayNet) and table structure recognition (TableFormer), and runs efficiently +on commodity hardware in a small resource budget. Docling is released as a +Python package and can be used as a Python API or as a CLI tool. Docling's +modular architecture and efficient document representation %, known as +DoclingDocument, make it easy to implement extensions, new features, models, +and customizations. Docling has been already integrated in other popular +open-source frameworks (e.g., LlamaIndex, LangChain, spaCy), making it a +natural fit for the processing of documents and the development of high-end +applications. The open-source community has fully engaged in using, promoting, +and developing for Docling, which gathered 10k stars on GitHub in less than a +month and was reported as the No. 1 trending repository in GitHub worldwide in +November 2024. + +
+
+ comment: Submitted to AAAI 25: Workshop on Open-Source AI for Mainstream Use +
+
+
+
+
+ + ♻ ☆ Endless Jailbreaks with Bijection Learning + + +
+ Despite extensive safety measures, LLMs are vulnerable to adversarial inputs, +or jailbreaks, which can elicit unsafe behaviors. In this work, we introduce +bijection learning, a powerful attack algorithm which automatically fuzzes LLMs +for safety vulnerabilities using randomly-generated encodings whose complexity +can be tightly controlled. We leverage in-context learning to teach models +bijective encodings, pass encoded queries to the model to bypass built-in +safety mechanisms, and finally decode responses back into English. Our attack +is extremely effective on a wide range of frontier language models. Moreover, +by controlling complexity parameters such as number of key-value mappings in +the encodings, we find a close relationship between the capability level of the +attacked LLM and the average complexity of the most effective bijection +attacks. Our work highlights that new vulnerabilities in frontier models can +emerge with scale: more capable models are more severely jailbroken by +bijection attacks. + +
+
+
+
+
+ + ♻ ☆ Enhancing Zero-shot Chain of Thought Prompting via Uncertainty-Guided + Strategy Selection COLING 2025 + + +
+ Chain-of-thought (CoT) prompting has significantly enhanced the capability of +large language models (LLMs) by structuring their reasoning processes. However, +existing methods face critical limitations: handcrafted demonstrations require +extensive human expertise, while trigger phrases are prone to inaccuracies. In +this paper, we propose the Zero-shot Uncertainty-based Selection (ZEUS) method, +a novel approach that improves CoT prompting by utilizing uncertainty estimates +to select effective demonstrations without needing access to model parameters. +Unlike traditional methods, ZEUS offers high sensitivity in distinguishing +between helpful and ineffective questions, ensuring more precise and reliable +selection. Our extensive evaluation shows that ZEUS consistently outperforms +existing CoT strategies across four challenging reasoning benchmarks, +demonstrating its robustness and scalability. + +
+
+ comment: Accepted in COLING 2025 +
+
+
+
+
+ + ♻ ☆ Acquired TASTE: Multimodal Stance Detection with Textual and Structural + Embeddings COLING + + +
+ Stance detection plays a pivotal role in enabling an extensive range of +downstream applications, from discourse parsing to tracing the spread of fake +news and the denial of scientific facts. While most stance classification +models rely on textual representation of the utterance in question, prior work +has demonstrated the importance of the conversational context in stance +detection. In this work we introduce TASTE -- a multimodal architecture for +stance detection that harmoniously fuses Transformer-based content embedding +with unsupervised structural embedding. Through the fine-tuning of a pretrained +transformer and the amalgamation with social embedding via a Gated Residual +Network (GRN) layer, our model adeptly captures the complex interplay between +content and conversational structure in determining stance. TASTE achieves +state-of-the-art results on common benchmarks, significantly outperforming an +array of strong baselines. Comparative evaluations underscore the benefits of +social grounding -- emphasizing the criticality of concurrently harnessing both +content and structure for enhanced stance detection. + +
+
+ comment: The modified camera ready version will be published in January 2025 + at COLING +
+
+
+
+
+ + ♻ ☆ On the Proper Treatment of Tokenization in Psycholinguistics EMNLP 2024 + + +
+ Language models are widely used in computational psycholinguistics to test +theories that relate the negative log probability (the surprisal) of a region +of interest (a substring of characters) under a language model to its cognitive +cost experienced by readers, as operationalized, for example, by gaze duration +on the region. However, the application of modern language models to +psycholinguistic studies is complicated by the practice of using tokenization +as an intermediate step in training a model. Doing so results in a language +model over token strings rather than one over character strings. Vexingly, +regions of interest are generally misaligned with these token strings. The +paper argues that token-level language models should be (approximately) +marginalized into character-level language models before they are used in +psycholinguistic studies to compute the surprisal of a region of interest; +then, the marginalized character-level language model can be used to compute +the surprisal of an arbitrary character substring, which we term a focal area, +that the experimenter may wish to use as a predictor. Our proposal of +marginalizing a token-level model into a character-level one solves this +misalignment issue independently of the tokenization scheme. Empirically, we +discover various focal areas whose surprisal is a better psychometric predictor +than the surprisal of the region of interest itself. + +
+
+ comment: Main conference long paper at EMNLP 2024. New version: copy-editing + and updated bib +
+
+
+
+
+ + ♻ ☆ Autoformalize Mathematical Statements by Symbolic Equivalence and + Semantic Consistency NeurIPS 2024 + + +
+ Autoformalization, the task of automatically translating natural language +descriptions into a formal language, poses a significant challenge across +various domains, especially in mathematics. Recent advancements in large +language models (LLMs) have unveiled their promising capabilities to formalize +even competition-level math problems. However, we observe a considerable +discrepancy between pass@1 and pass@k accuracies in LLM-generated +formalizations. To address this gap, we introduce a novel framework that scores +and selects the best result from k autoformalization candidates based on two +complementary self-consistency methods: symbolic equivalence and semantic +consistency. Elaborately, symbolic equivalence identifies the logical +homogeneity among autoformalization candidates using automated theorem provers, +and semantic consistency evaluates the preservation of the original meaning by +informalizing the candidates and computing the similarity between the +embeddings of the original and informalized texts. Our extensive experiments on +the MATH and miniF2F datasets demonstrate that our approach significantly +enhances autoformalization accuracy, achieving up to 0.22-1.35x relative +improvements across various LLMs and baseline methods. + +
+
+ comment: Published as a conference paper at NeurIPS 2024. Code is available at + https://github.com/Miracle-Messi/Isa-AutoFormal +
+
+
+
+
+ + ♻ ☆ COBias and Debias: Minimizing Language Model Pairwise Accuracy Bias via + Nonlinear Integer Programming + + +
+ When performing classification tasks with language models, would you prefer +having only one highly accurate class or having every class deliver reliable +performance? Obviously, a more balanced accuracy among classes better reflects +the expectations of the majority of users. Especially for large language models +(LLMs), the fact that they achieve a fair overall accuracy by in-context +learning (ICL) obscures a large difference in individual class accuracies. In +this work, we uncover and tackle language models' imbalance in per-class +prediction accuracy by reconceptualizing it as the Contextual Oddity Bias +(COBias), and we are the first to engage nonlinear integer programming (NIP) to +debias it. Briefly, the proposed COBias metric measures accuracy differences +among class pairs, with which we reveal the large per-class accuracy +differences exhibited in LLMs of varied scales and families. Then we propose +Debiasing as Nonlinear Integer Programming (DNIP) to correct ICL per-class +probabilities towards lower COBias and higher overall accuracy. Our +optimization objective is directly based on the evaluation scores by COBias and +accuracy metrics, which is non-differentiable and solved by the simulated +annealing metaheuristic. Evaluations on three LLMs across seven NLP +classification tasks show that DNIP simultaneously achieves significant COBias +reduction (-27%) and accuracy improvement (+12%) over the conventional ICL +approach, suggesting that modeling pairwise class accuracy differences is a +direction in pushing forward more accurate, more reliable LLM predictions. + +
+
+
+
+
+ + ♻ ☆ DiversityMedQA: Assessing Demographic Biases in Medical Diagnosis using + Large Language Models EMNLP 2024 + + +
+ As large language models (LLMs) gain traction in healthcare, concerns about +their susceptibility to demographic biases are growing. We introduce +{DiversityMedQA}, a novel benchmark designed to assess LLM responses to medical +queries across diverse patient demographics, such as gender and ethnicity. By +perturbing questions from the MedQA dataset, which comprises medical board exam +questions, we created a benchmark that captures the nuanced differences in +medical diagnosis across varying patient profiles. Our findings reveal notable +discrepancies in model performance when tested against these demographic +variations. Furthermore, to ensure the perturbations were accurate, we also +propose a filtering strategy that validates each perturbation. By releasing +DiversityMedQA, we provide a resource for evaluating and mitigating demographic +bias in LLM medical diagnoses. + +
+
+ comment: Published in NLP4PI @ EMNLP 2024, Accepted to AIM-FM @ NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Plentiful Jailbreaks with String Compositions NeurIPS + + +
+ Large language models (LLMs) remain vulnerable to a slew of adversarial +attacks and jailbreaking methods. One common approach employed by white-hat +attackers, or red-teamers, is to process model inputs and outputs using +string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, +ASCII, and more. Our work extends these encoding-based attacks by unifying them +in a framework of invertible string transformations. With invertibility, we can +devise arbitrary string compositions, defined as sequences of transformations, +that we can encode and decode end-to-end programmatically. We devise a +automated best-of-n attack that samples from a combinatorially large number of +string compositions. Our jailbreaks obtain competitive attack success rates on +several leading frontier models when evaluated on HarmBench, highlighting that +encoding-based attacks remain a persistent vulnerability even in advanced LLMs. + +
+
+ comment: NeurIPS SoLaR Workshop 2024 +
+
+
+
+
+ + ♻ ☆ U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills + in LLMs + + +
+ The current evaluation of mathematical skills in LLMs is limited, as existing +benchmarks are either relatively small, primarily focus on elementary and +high-school problems, or lack diversity in topics. Additionally, the inclusion +of visual elements in tasks remains largely under-explored. + To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 +unpublished open-ended university-level problems sourced from teaching +materials. It is balanced across six core subjects, with 20% of multimodal +problems. Given the open-ended nature of U-MATH problems, we employ an LLM to +judge the correctness of generated solutions. To this end, we release +$\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. + The evaluation of general domain, math-specific, and multimodal LLMs +highlights the challenges presented by U-MATH. Our findings reveal that LLMs +achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% +on visual problems. The solution assessment proves challenging for LLMs, with +the best LLM judge having an F1-score of 80% on $\mu$-MATH. + +
+
+
+
+
+ + ♻ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware + Omni-Modal Perception of Long Videos + + +
+ Despite impressive advancements in video understanding, most efforts remain +limited to coarse-grained or visual-only video tasks. However, real-world +videos encompass omni-modal information (vision, audio, and speech) with a +series of events forming a cohesive storyline. The lack of multi-modal video +data with fine-grained event annotations and the high cost of manual labeling +are major obstacles to comprehensive omni-modality video perception. To address +this gap, we propose an automatic pipeline consisting of high-quality +multi-modal video filtering, semantically coherent omni-modal event boundary +detection, and cross-modal correlation-aware event captioning. In this way, we +present LongVALE, the first-ever Vision-Audio-Language Event understanding +benchmark comprising 105K omni-modal events with precise temporal boundaries +and detailed relation-aware captions within 8.4K high-quality long videos. +Further, we build a baseline that leverages LongVALE to enable video large +language models (LLMs) for omni-modality fine-grained temporal video +understanding for the first time. Extensive experiments demonstrate the +effectiveness and great potential of LongVALE in advancing comprehensive +multi-modal video understanding. + +
+
+ comment: 18 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure ACL + + +
+ Autoregressive language models demonstrate excellent performance in various +scenarios. However, the inference efficiency is limited by its +one-step-one-word generation mode, which has become a pressing problem recently +as the models become increasingly larger. Speculative decoding employs a "draft +and then verify" mechanism to allow multiple tokens to be generated in one +step, realizing lossless acceleration. Existing methods mainly adopt fixed +heuristic draft structures, which fail to adapt to different situations to +maximize the acceptance length during verification. To alleviate this dilemma, +we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft +trees. It searches the optimal tree structure that maximizes the mathematical +expectation of the acceptance length in each decoding step. Experimental +results reveal that OPT-Tree outperforms the existing draft structures and +achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. +If the draft model is powerful enough and the node budget is sufficient, it can +generate more than ten tokens in a single step. Our code is available at +https://github.com/Jikai0Wang/OPT-Tree. + +
+
+ comment: Accepted at TACL; pre-MIT Press publication version +
+
+
+
+
+ + ♻ ☆ PsycoLLM: Enhancing LLM for Psychological Understanding and Evaluation ACL + + +
+ Mental health has attracted substantial attention in recent years and LLM can +be an effective technology for alleviating this problem owing to its capability +in text understanding and dialogue. However, existing research in this domain +often suffers from limitations, such as training on datasets lacking crucial +prior knowledge and evidence, and the absence of comprehensive evaluation +methods. In this paper, we propose a specialized psychological large language +model (LLM), named PsycoLLM, trained on a proposed high-quality psychological +dataset, including single-turn QA, multi-turn dialogues and knowledge-based QA. +Specifically, we construct multi-turn dialogues through a three-step pipeline +comprising multi-turn QA generation, evidence judgment, and dialogue +refinement. We augment this process with real-world psychological case +backgrounds extracted from online platforms, enhancing the relevance and +applicability of the generated data. Additionally, to compare the performance +of PsycoLLM with other LLMs, we develop a comprehensive psychological benchmark +based on authoritative psychological counseling examinations in China, which +includes assessments of professional ethics, theoretical proficiency, and case +analysis. The experimental results on the benchmark illustrate the +effectiveness of PsycoLLM, which demonstrates superior performance compared to +other LLMs. + +
+
+ comment: Accepted by IEEE Transactions on Computational Social Systems. + https://github.com/MACLAB-HFUT/PsycoLLM +
+
+
+
+
+ + ♻ ☆ CUE-M: Contextual Understanding and Enhanced Search with Multimodal + Large Language Model + + +
+ The integration of Retrieval-Augmented Generation (RAG) with Multimodal Large +Language Models (MLLMs) has revolutionized information retrieval and expanded +the practical applications of AI. However, current systems struggle in +accurately interpreting user intent, employing diverse retrieval strategies, +and effectively filtering unintended or inappropriate responses, limiting their +effectiveness. This paper introduces Contextual Understanding and Enhanced +Search with MLLM (CUE-M), a novel multimodal search framework that addresses +these challenges through a multi-stage pipeline comprising image context +enrichment, intent refinement, contextual query generation, external API +integration, and relevance-based filtering. CUE-M incorporates a robust +filtering pipeline combining image-based, text-based, and multimodal +classifiers, dynamically adapting to instance- and category-specific concern +defined by organizational policies. Evaluations on a multimodal Q&A dataset and +a public safety benchmark demonstrate that CUE-M outperforms baselines in +accuracy, knowledge integration, and safety, advancing the capabilities of +multimodal retrieval systems. + +
+
+ comment: Preprint. Under review +
+
+
+
+
+ + ♻ ☆ DRS: Deep Question Reformulation With Structured Output + + +
+ Question answering represents a core capability of large language models +(LLMs). However, when individuals encounter unfamiliar knowledge in texts, they +often formulate questions that the text itself cannot answer due to +insufficient understanding of the underlying information. Recent studies reveal +that while LLMs can detect unanswerable questions, they struggle to assist +users in reformulating these questions. Even advanced models like GPT-3.5 +demonstrate limited effectiveness in this regard. To address this limitation, +we propose DRS: Deep Question Reformulation with Structured Output, a novel +zero-shot method aimed at enhancing LLMs ability to assist users in +reformulating questions to extract relevant information from new documents. DRS +combines the strengths of LLMs with a DFS-based algorithm to iteratively +explore potential entity combinations and constrain outputs using predefined +entities. This structured approach significantly enhances the reformulation +capabilities of LLMs. Comprehensive experimental evaluations demonstrate that +DRS improves the reformulation accuracy of GPT-3.5 from 23.03% to 70.42%, while +also enhancing the performance of open-source models, such as Gemma2-9B, from +26.35% to 56.75%. + +
+
+
+
+
+ + ♻ ☆ Scaling Inference-Time Search with Vision Value Model for Improved + Visual Comprehension + + +
+ Despite significant advancements in vision-language models (VLMs), there +lacks effective approaches to enhance response quality by scaling +inference-time computation. This capability is known to be a core step towards +the self-improving models in recent large language model studies. In this +paper, we present Vision Value Model (VisVM) that can guide VLM inference-time +search to generate responses with better visual comprehension. Specifically, +VisVM not only evaluates the generated sentence quality in the current search +step, but also anticipates the quality of subsequent sentences that may result +from the current step, thus providing a long-term value. In this way, VisVM +steers VLMs away from generating sentences prone to hallucinations or +insufficient detail, thereby producing higher quality responses. Experimental +results demonstrate that VisVM-guided search significantly enhances VLMs' +ability to generate descriptive captions with richer visual details and fewer +hallucinations, compared with greedy decoding and search methods with other +visual reward signals. Furthermore, we find that self-training the model with +the VisVM-guided captions improve VLM's performance across a wide range of +multimodal benchmarks, indicating the potential for developing self-improving +VLMs. Our value model and code are available at +https://github.com/si0wang/VisVM. + +
+
+
+
+
+ + ♻ ☆ Logic Agent: Enhancing Validity with Logic Rule Invocation + + +
+ Chain-of-Thought (CoT) prompting has emerged as a pivotal technique for +augmenting the inferential capabilities of language models during reasoning +tasks. Despite its advancements, CoT often grapples with challenges in +validating reasoning validity and ensuring informativeness. Addressing these +limitations, this paper introduces the Logic Agent (LA), an agent-based +framework aimed at enhancing the validity of reasoning processes in Large +Language Models (LLMs) through strategic logic rule invocation. Unlike +conventional approaches, LA transforms LLMs into logic agents that dynamically +apply propositional logic rules, initiating the reasoning process by converting +natural language inputs into structured logic forms. The logic agent leverages +a comprehensive set of predefined functions to systematically navigate the +reasoning process. This methodology not only promotes the structured and +coherent generation of reasoning constructs but also significantly improves +their interpretability and logical coherence. Through extensive +experimentation, we demonstrate LA's capacity to scale effectively across +various model sizes, markedly improving the precision of complex reasoning +across diverse tasks. + +
+
+ comment: The experiment is subject to certain errors +
+
+
+
+
+ + ♻ ☆ Tulu 3: Pushing Frontiers in Open Language Model Post-Training + + +
+ Language model post-training is applied to refine behaviors and unlock new +skills across a wide range of recent language models, but open recipes for +applying these techniques lag behind proprietary ones. The underlying training +data and recipes for post-training are simultaneously the most important pieces +of the puzzle and the portion with the least transparency. To bridge this gap, +we introduce Tulu 3, a family of fully-open state-of-the-art post-trained +models, alongside its data, code, and training recipes, serving as a +comprehensive guide for modern post-training techniques. Tulu 3, which builds +on Llama 3.1 base models, achieves results surpassing the instruct versions of +Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and +Claude 3.5-Haiku. The training algorithms for our models include supervised +finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we +call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we +introduce a multi-task evaluation scheme for post-training recipes with +development and unseen evaluations, standard benchmark implementations, and +substantial decontamination of existing open datasets on said benchmarks. We +conclude with analysis and discussion of training methods that did not reliably +improve performance. + In addition to the Tulu 3 model weights and demo, we release the complete +recipe -- including datasets for diverse core skills, a robust toolkit for data +curation and evaluation, the training code and infrastructure, and, most +importantly, a detailed report for reproducing and further adapting the Tulu 3 +approach to more domains. + +
+
+
+
+
+ + ♻ ☆ ELBA: Learning by Asking for Embodied Visual Navigation and Task + Completion WACV 2025 + + +
+ The research community has shown increasing interest in designing intelligent +embodied agents that can assist humans in accomplishing tasks. Although there +have been significant advancements in related vision-language benchmarks, most +prior work has focused on building agents that follow instructions rather than +endowing agents the ability to ask questions to actively resolve ambiguities +arising naturally in embodied environments. To address this gap, we propose an +Embodied Learning-By-Asking (ELBA) model that learns when and what questions to +ask to dynamically acquire additional information for completing the task. We +evaluate ELBA on the TEACh vision-dialog navigation and task completion +dataset. Experimental results show that the proposed method achieves improved +task performance compared to baseline models without question-answering +capabilities. + +
+
+ comment: 14 pages, 10 figures, WACV 2025 +
+
+
+
+
+
+
+
+ + Computer Vision and Pattern Recognition 150 + +
+
+
+ + ☆ Stag-1: Towards Realistic 4D Driving Simulation with Video Generation + Model + + +
+ 4D driving simulation is essential for developing realistic autonomous +driving simulators. Despite advancements in existing methods for generating +driving scenes, significant challenges remain in view transformation and +spatial-temporal dynamic modeling. To address these limitations, we propose a +Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct +real-world scenes and design a controllable generative network to achieve 4D +simulation. Stag-1 constructs continuous 4D point cloud scenes using +surround-view data from autonomous vehicles. It decouples spatial-temporal +relationships and produces coherent keyframe videos. Additionally, Stag-1 +leverages video generation models to obtain photo-realistic and controllable 4D +driving simulation videos from any perspective. To expand the range of view +generation, we train vehicle motion videos based on decomposed camera poses, +enhancing modeling capabilities for distant scenes. Furthermore, we reconstruct +vehicle camera trajectories to integrate 3D points across consecutive views, +enabling comprehensive scene understanding along the temporal dimension. +Following extensive multi-level scene training, Stag-1 can simulate from any +desired viewpoint and achieve a deep understanding of scene evolution under +static spatial-temporal conditions. Compared to existing methods, our approach +shows promising performance in multi-view scene consistency, background +coherence, and accuracy, and contributes to the ongoing advancements in +realistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag. + +
+
+ comment: Code is available at: https://github.com/wzzheng/Stag +
+
+
+
+
+ + ☆ Perturb-and-Revise: Flexible 3D Editing with Generative Trajectories + + +
+ The fields of 3D reconstruction and text-based 3D editing have advanced +significantly with the evolution of text-based diffusion models. While existing +3D editing methods excel at modifying color, texture, and style, they struggle +with extensive geometric or appearance changes, thus limiting their +applications. We propose Perturb-and-Revise, which makes possible a variety of +NeRF editing. First, we perturb the NeRF parameters with random initializations +to create a versatile initialization. We automatically determine the +perturbation magnitude through analysis of the local loss landscape. Then, we +revise the edited NeRF via generative trajectories. Combined with the +generative process, we impose identity-preserving gradients to refine the +edited NeRF. Extensive experiments demonstrate that Perturb-and-Revise +facilitates flexible, effective, and consistent editing of color, appearance, +and geometry in 3D. For 360{\deg} results, please visit our project page: +https://susunghong.github.io/Perturb-and-Revise. + +
+
+ comment: Project page: https://susunghong.github.io/Perturb-and-Revise +
+
+
+
+
+ + ☆ Birth and Death of a Rose + + +
+ We study the problem of generating temporal object intrinsics -- temporally +evolving sequences of object geometry, reflectance, and texture, such as a +blooming rose -- from pre-trained 2D foundation models. Unlike conventional 3D +modeling and animation techniques that require extensive manual effort and +expertise, we introduce a method that generates such assets with signals +distilled from pre-trained 2D diffusion models. To ensure the temporal +consistency of object intrinsics, we propose Neural Templates for +temporal-state-guided distillation, derived automatically from image features +from self-supervised learning. Our method can generate high-quality temporal +object intrinsics for several natural phenomena and enable the sampling and +controllable rendering of these dynamic objects from any viewpoint, under any +environmental lighting conditions, at any time of their lifespan. Project +website: https://chen-geng.com/rose4d + +
+
+ comment: Project website: https://chen-geng.com/rose4d +
+
+
+
+
+ + ☆ Sparse autoencoders reveal selective remapping of visual concepts during + adaptation + + +
+ Adapting foundation models for specific purposes has become a standard +approach to build machine learning systems for downstream applications. Yet, it +is an open question which mechanisms take place during adaptation. Here we +develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named +PatchSAE, to extract interpretable concepts at granular levels (e.g. shape, +color, or semantics of an object) and their patch-wise spatial attributions. We +explore how these concepts influence the model output in downstream image +classification tasks and investigate how recent state-of-the-art prompt-based +adaptation techniques change the association of model inputs to these concepts. +While activations of concepts slightly change between adapted and non-adapted +models, we find that the majority of gains on common adaptation tasks can be +explained with the existing concepts already present in the non-adapted +foundation model. This work provides a concrete framework to train and use SAEs +for Vision Transformers and provides insights into explaining adaptation +mechanisms. + +
+
+ comment: A demo is available at github.com/dynamical-inference/patchsae +
+
+
+
+
+ + ☆ Text to Blind Motion NeurIPS 2024 + + +
+ People who are blind perceive the world differently than those who are +sighted, which can result in distinct motion characteristics. For instance, +when crossing at an intersection, blind individuals may have different patterns +of movement, such as veering more from a straight path or using touch-based +exploration around curbs and obstacles. These behaviors may appear less +predictable to motion models embedded in technologies such as autonomous +vehicles. Yet, the ability of 3D motion models to capture such behavior has not +been previously studied, as existing datasets for 3D human motion currently +lack diversity and are biased toward people who are sighted. In this work, we +introduce BlindWays, the first multimodal motion benchmark for pedestrians who +are blind. We collect 3D motion data using wearable sensors with 11 blind +participants navigating eight different routes in a real-world urban setting. +Additionally, we provide rich textual descriptions that capture the distinctive +movement characteristics of blind pedestrians and their interactions with both +the navigation aid (e.g., a white cane or a guide dog) and the environment. We +benchmark state-of-the-art 3D human prediction models, finding poor performance +with off-the-shelf and pre-training-based methods for our novel task. To +contribute toward safer and more reliable systems that can seamlessly reason +over diverse human movements in their environments, our text-and-motion +benchmark is available at https://blindways.github.io. + +
+
+ comment: Accepted at NeurIPS 2024 +
+
+
+
+
+ + ☆ MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models + + +
+ Text-to-video models have demonstrated impressive capabilities in producing +diverse and captivating video content, showcasing a notable advancement in +generative AI. However, these models generally lack fine-grained control over +motion patterns, limiting their practical applicability. We introduce +MotionFlow, a novel framework designed for motion transfer in video diffusion +models. Our method utilizes cross-attention maps to accurately capture and +manipulate spatial and temporal dynamics, enabling seamless motion transfers +across various contexts. Our approach does not require training and works on +test-time by leveraging the inherent capabilities of pre-trained video +diffusion models. In contrast to traditional approaches, which struggle with +comprehensive scene changes while maintaining consistent motion, MotionFlow +successfully handles such complex transformations through its attention-based +mechanism. Our qualitative and quantitative experiments demonstrate that +MotionFlow significantly outperforms existing models in both fidelity and +versatility even during drastic scene alterations. + +
+
+ comment: Project Page: https://motionflow-diffusion.github.io +
+
+
+
+
+ + ☆ SimC3D: A Simple Contrastive 3D Pretraining Framework Using RGB Images + + +
+ The 3D contrastive learning paradigm has demonstrated remarkable performance +in downstream tasks through pretraining on point cloud data. Recent advances +involve additional 2D image priors associated with 3D point clouds for further +improvement. Nonetheless, these existing frameworks are constrained by the +restricted range of available point cloud datasets, primarily due to the high +costs of obtaining point cloud data. To this end, we propose SimC3D, a simple +but effective 3D contrastive learning framework, for the first time, +pretraining 3D backbones from pure RGB image data. SimC3D performs contrastive +3D pretraining with three appealing properties. (1) Pure image data: SimC3D +simplifies the dependency of costly 3D point clouds and pretrains 3D backbones +using solely RBG images. By employing depth estimation and suitable data +processing, the monocular synthesized point cloud shows great potential for 3D +pretraining. (2) Simple framework: Traditional multi-modal frameworks +facilitate 3D pretraining with 2D priors by utilizing an additional 2D +backbone, thereby increasing computational expense. In this paper, we +empirically demonstrate that the primary benefit of the 2D modality stems from +the incorporation of locality information. Inspired by this insightful +observation, SimC3D directly employs 2D positional embeddings as a stronger +contrastive objective, eliminating the necessity for 2D backbones and leading +to considerable performance improvements. (3) Strong performance: SimC3D +outperforms previous approaches that leverage ground-truth point cloud data for +pretraining in various downstream tasks. Furthermore, the performance of SimC3D +can be further enhanced by combining multiple image datasets, showcasing its +significant potential for scalability. The code will be available at +https://github.com/Dongjiahua/SimC3D. + +
+
+
+
+
+ + ☆ Expanding Performance Boundaries of Open-Source Multimodal Models with + Model, Data, and Test-Time Scaling + + +
+ We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) +series that builds upon InternVL 2.0, maintaining its core model architecture +while introducing significant enhancements in training and testing strategies +as well as data quality. In this work, we delve into the relationship between +model scaling and performance, systematically exploring the performance trends +in vision encoders, language models, dataset sizes, and test-time +configurations. Through extensive evaluations on a wide range of benchmarks, +including multi-discipline reasoning, document understanding, multi-image / +video understanding, real-world comprehension, multimodal hallucination +detection, visual grounding, multilingual capabilities, and pure language +processing, InternVL 2.5 exhibits competitive performance, rivaling leading +commercial models such as GPT-4o and Claude-3.5-Sonnet. Notably, our model is +the first open-source MLLMs to surpass 70% on the MMMU benchmark, achieving a +3.7-point improvement through Chain-of-Thought (CoT) reasoning and showcasing +strong potential for test-time scaling. We hope this model contributes to the +open-source community by setting new standards for developing and applying +multimodal AI systems. HuggingFace demo see +https://huggingface.co/spaces/OpenGVLab/InternVL + +
+
+ comment: Technical Report +
+
+
+
+
+ + ☆ DenseMatcher: Learning 3D Semantic Correspondence for Category-Level + Manipulation from a Single Demo + + +
+ Dense 3D correspondence can enhance robotic manipulation by enabling the +generalization of spatial, functional, and dynamic information from one object +to an unseen counterpart. Compared to shape correspondence, semantic +correspondence is more effective in generalizing across different object +categories. To this end, we present DenseMatcher, a method capable of computing +3D correspondences between in-the-wild objects that share similar structures. +DenseMatcher first computes vertex features by projecting multiview 2D features +onto meshes and refining them with a 3D network, and subsequently finds dense +correspondences with the obtained features using functional map. In addition, +we craft the first 3D matching dataset that contains colored object meshes +across diverse categories. In our experiments, we show that DenseMatcher +significantly outperforms prior 3D matching baselines by 43.5%. We demonstrate +the downstream effectiveness of DenseMatcher in (i) robotic manipulation, where +it achieves cross-instance and cross-category generalization on long-horizon +complex manipulation tasks from observing only one demo; (ii) zero-shot color +mapping between digital assets, where appearance can be transferred between +different objects with relatable geometry. + +
+
+ comment: Project Page: https://tea-lab.github.io/DenseMatcher/ +
+
+
+
+
+ + ☆ Mind the Time: Temporally-Controlled Multi-Event Video Generation + + +
+ Real-world videos consist of sequences of events. Generating such sequences +with precise temporal control is infeasible with existing video generators that +rely on a single paragraph of text as input. When tasked with generating +multiple events described using a single prompt, such methods often ignore some +of the events or fail to arrange them in the correct order. To address this +limitation, we present MinT, a multi-event video generator with temporal +control. Our key insight is to bind each event to a specific period in the +generated video, which allows the model to focus on one event at a time. To +enable time-aware interactions between event captions and video tokens, we +design a time-based positional encoding method, dubbed ReRoPE. This encoding +helps to guide the cross-attention operation. By fine-tuning a pre-trained +video diffusion transformer on temporally grounded data, our approach produces +coherent videos with smoothly connected events. For the first time in the +literature, our model offers control over the timing of events in generated +videos. Extensive experiments demonstrate that MinT outperforms existing +open-source models by a large margin. + +
+
+ comment: Project Page: https://mint-video.github.io/ +
+
+
+
+
+ + ☆ Extrapolated Urban View Synthesis Benchmark + + +
+ Photorealistic simulators are essential for the training and evaluation of +vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis +(NVS), a crucial capability that generates diverse unseen viewpoints to +accommodate the broad and continuous pose distribution of AVs. Recent advances +in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic +rendering at real-time speeds and have been widely used in modeling large-scale +driving scenes. However, their performance is commonly evaluated using an +interpolated setup with highly correlated training and test views. In contrast, +extrapolation, where test views largely deviate from training views, remains +underexplored, limiting progress in generalizable simulation technology. To +address this gap, we leverage publicly available AV datasets with multiple +traversals, multiple vehicles, and multiple cameras to build the first +Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct +quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting +methods across different difficulty levels. Our results show that Gaussian +Splatting is prone to overfitting to training views. Besides, incorporating +diffusion priors and improving geometry cannot fundamentally improve NVS under +large view changes, highlighting the need for more robust approaches and +large-scale training. We have released our data to help advance self-driving +and urban robotics simulation technology. + +
+
+ comment: Project page: https://ai4ce.github.io/EUVS-Benchmark/ +
+
+
+
+
+ + ☆ TeamCraft: A Benchmark for Multi-Modal Multi-Agent Systems in Minecraft + + +
+ Collaboration is a cornerstone of society. In the real world, human teammates +make use of multi-sensory data to tackle challenging tasks in ever-changing +environments. It is essential for embodied agents collaborating in +visually-rich environments replete with dynamic interactions to understand +multi-modal observations and task specifications. To evaluate the performance +of generalizable multi-modal collaborative agents, we present TeamCraft, a +multi-modal multi-agent benchmark built on top of the open-world video game +Minecraft. The benchmark features 55,000 task variants specified by multi-modal +prompts, procedurally-generated expert demonstrations for imitation learning, +and carefully designed protocols to evaluate model generalization capabilities. +We also perform extensive analyses to better understand the limitations and +strengths of existing approaches. Our results indicate that existing models +continue to face significant challenges in generalizing to novel goals, scenes, +and unseen numbers of agents. These findings underscore the need for further +research in this area. The TeamCraft platform and dataset are publicly +available at https://github.com/teamcraft-bench/teamcraft. + +
+
+
+
+
+ + ☆ From classical techniques to convolution-based models: A review of + object detection algorithms + + +
+ Object detection is a fundamental task in computer vision and image +understanding, with the goal of identifying and localizing objects of interest +within an image while assigning them corresponding class labels. Traditional +methods, which relied on handcrafted features and shallow models, struggled +with complex visual data and showed limited performance. These methods combined +low-level features with contextual information and lacked the ability to +capture high-level semantics. Deep learning, especially Convolutional Neural +Networks (CNNs), addressed these limitations by automatically learning rich, +hierarchical features directly from data. These features include both semantic +and high-level representations essential for accurate object detection. This +paper reviews object detection frameworks, starting with classical computer +vision methods. We categorize object detection approaches into two groups: (1) +classical computer vision techniques and (2) CNN-based detectors. We compare +major CNN models, discussing their strengths and limitations. In conclusion, +this review highlights the significant advancements in object detection through +deep learning and identifies key areas for further research to improve +performance. + +
+
+
+
+
+ + ☆ CompCap: Improving Multimodal Large Language Models with Composite + Captions + + +
+ How well can Multimodal Large Language Models (MLLMs) understand composite +images? Composite images (CIs) are synthetic visuals created by merging +multiple visual elements, such as charts, posters, or screenshots, rather than +being captured directly by a camera. While CIs are prevalent in real-world +applications, recent MLLM developments have primarily focused on interpreting +natural images (NIs). Our research reveals that current MLLMs face significant +challenges in accurately understanding CIs, often struggling to extract +information or perform complex reasoning based on these images. We find that +existing training data for CIs are mostly formatted for question-answer tasks +(e.g., in datasets like ChartQA and ScienceQA), while high-quality +image-caption datasets, critical for robust vision-language alignment, are only +available for NIs. To bridge this gap, we introduce Composite Captions +(CompCap), a flexible framework that leverages Large Language Models (LLMs) and +automation tools to synthesize CIs with accurate and detailed captions. Using +CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs +across six CI types. We validate the effectiveness of CompCap-118K by +supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and +LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K +significantly enhances MLLMs' understanding of CIs, yielding average gains of +1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively. + +
+
+
+
+
+ + ☆ MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at + Scale + + +
+ Open-source multimodal large language models (MLLMs) have shown significant +potential in a broad range of multimodal tasks. However, their reasoning +capabilities remain constrained by existing instruction-tuning datasets, which +were predominately repurposed from academic datasets such as VQA, AI2D, and +ChartQA. These datasets target simplistic tasks, and only provide phrase-level +answers without any intermediate rationales. To address these challenges, we +introduce a scalable and cost-effective method to construct a large-scale +multimodal instruction-tuning dataset with rich intermediate rationales +designed to elicit CoT reasoning. Using only open models, we create a dataset +containing 12M instruction-response pairs to cover diverse, reasoning-intensive +tasks with detailed and faithful rationales. Experiments demonstrate that +training MLLMs on this dataset significantly improves reasoning capabilities, +achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), +MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates +notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation +studies further highlight the importance of key components, such as rewriting +and self-filtering, in the dataset construction process. + +
+
+
+
+
+ + ☆ ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And + Segmentation Of GI Bleeding + + +
+ This study presents an integrated deep learning model for automatic detection +and classification of Gastrointestinal bleeding in the frames extracted from +Wireless Capsule Endoscopy (WCE) videos. The dataset has been released as part +of Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model +attained the highest performance among 75 teams that took part in this +competition. It aims to efficiently utilizes CNN based model i.e. DenseNet and +UNet to detect and segment bleeding and non-bleeding areas in the real-world +complex dataset. The model achieves an impressive overall accuracy of 80% which +would surely help a skilled doctor to carry out further diagnostics. + +
+
+
+
+
+ + ☆ Archaeoscape: Bringing Aerial Laser Scanning Archaeology to the Deep + Learning Era NeurIPS 2023 + + +
+ Airborne Laser Scanning (ALS) technology has transformed modern archaeology +by unveiling hidden landscapes beneath dense vegetation. However, the lack of +expert-annotated, open-access resources has hindered the analysis of ALS data +using advanced deep learning techniques. We address this limitation with +Archaeoscape (available at https://archaeoscape.ai), a novel large-scale +archaeological ALS dataset spanning 888 km$^2$ in Cambodia with 31,141 +annotated archaeological features from the Angkorian period. Archaeoscape is +over four times larger than comparable datasets, and the first ALS archaeology +resource with open-access data, annotations, and models. + We benchmark several recent segmentation models to demonstrate the benefits +of modern vision techniques for this problem and highlight the unique +challenges of discovering subtle human-made structures under dense jungle +canopies. By making Archaeoscape available in open access, we hope to bridge +the gap between traditional archaeology and modern computer vision methods. + +
+
+ comment: NeurIPS 2023 - Datasets & Benchmarks Track +
+
+
+
+
+ + ☆ SurgBox: Agent-Driven Operating Room Sandbox with Surgery Copilot + + +
+ Surgical interventions, particularly in neurology, represent complex and +high-stakes scenarios that impose substantial cognitive burdens on surgical +teams. Although deliberate education and practice can enhance cognitive +capabilities, surgical training opportunities remain limited due to patient +safety concerns. To address these cognitive challenges in surgical training and +operation, we propose SurgBox, an agent-driven sandbox framework to +systematically enhance the cognitive capabilities of surgeons in immersive +surgical simulations. Specifically, our SurgBox leverages large language models +(LLMs) with tailored Retrieval-Augmented Generation (RAG) to authentically +replicate various surgical roles, enabling realistic training environments for +deliberate practice. In particular, we devise Surgery Copilot, an AI-driven +assistant to actively coordinate the surgical information stream and support +clinical decision-making, thereby diminishing the cognitive workload of +surgical teams during surgery. By incorporating a novel Long-Short Memory +mechanism, our Surgery Copilot can effectively balance immediate procedural +assistance with comprehensive surgical knowledge. Extensive experiments using +real neurosurgical procedure records validate our SurgBox framework in both +enhancing surgical cognitive capabilities and supporting clinical +decision-making. By providing an integrated solution for training and +operational support to address cognitive challenges, our SurgBox framework +advances surgical education and practice, potentially transforming surgical +outcomes and healthcare quality. The code is available at +https://github.com/franciszchen/SurgBox. + +
+
+ comment: This work is accepted by IEEE Big Data 2024 +
+
+
+
+
+ + ☆ One-shot Federated Learning via Synthetic Distiller-Distillate + Communication NeurIPS 2024 + + +
+ One-shot Federated learning (FL) is a powerful technology facilitating +collaborative training of machine learning models in a single round of +communication. While its superiority lies in communication efficiency and +privacy preservation compared to iterative FL, one-shot FL often compromises +model performance. Prior research has primarily focused on employing data-free +knowledge distillation to optimize data generators and ensemble models for +better aggregating local knowledge into the server model. However, these +methods typically struggle with data heterogeneity, where inconsistent local +data distributions can cause teachers to provide misleading knowledge. +Additionally, they may encounter scalability issues with complex datasets due +to inherent two-step information loss: first, during local training (from data +to model), and second, when transferring knowledge to the server model (from +model to inversed data). In this paper, we propose FedSD2C, a novel and +practical one-shot FL framework designed to address these challenges. FedSD2C +introduces a distiller to synthesize informative distillates directly from +local data to reduce information loss and proposes sharing synthetic +distillates instead of inconsistent local models to tackle data heterogeneity. +Our empirical results demonstrate that FedSD2C consistently outperforms other +one-shot FL methods with more complex and real datasets, achieving up to 2.6 +the performance of the best baseline. Code: https://github.com/Carkham/FedSD2C + +
+
+ comment: Accepted by NeurIPS 2024 +
+
+
+
+
+ + ☆ LinVT: Empower Your Image-level Large Language Model to Understand + Videos + + +
+ Large Language Models (LLMs) have been widely used in various tasks, +motivating us to develop an LLM-based assistant for videos. Instead of training +from scratch, we propose a module to transform arbitrary well-trained +image-based LLMs into video-LLMs (after being trained on video data). To better +adapt image-LLMs for processing videos, we introduce two design principles: +linear transformation to preserve the original visual-language alignment and +representative information condensation from redundant video content. Guided by +these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), +which enables existing image-LLMs to understand videos. We benchmark LinVT with +six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, +showcasing the high compatibility of LinVT. LinVT-based LLMs achieve +state-of-the-art performance across various video benchmarks, illustrating the +effectiveness of LinVT in multi-modal video understanding. + +
+
+
+
+
+ + ☆ DreamColour: Controllable Video Colour Editing without Training + + +
+ Video colour editing is a crucial task for content creation, yet existing +solutions either require painstaking frame-by-frame manipulation or produce +unrealistic results with temporal artefacts. We present a practical, +training-free framework that makes precise video colour editing accessible +through an intuitive interface while maintaining professional-quality output. +Our key insight is that by decoupling spatial and temporal aspects of colour +editing, we can better align with users' natural workflow -- allowing them to +focus on precise colour selection in key frames before automatically +propagating changes across time. We achieve this through a novel technical +framework that combines: (i) a simple point-and-click interface merging +grid-based colour selection with automatic instance segmentation for precise +spatial control, (ii) bidirectional colour propagation that leverages inherent +video motion patterns, and (iii) motion-aware blending that ensures smooth +transitions even with complex object movements. Through extensive evaluation on +diverse scenarios, we demonstrate that our approach matches or exceeds +state-of-the-art methods while eliminating the need for training or specialized +hardware, making professional-quality video colour editing accessible to +everyone. + +
+
+ comment: Project page available at https://chaitron.github.io/DreamColour-demo +
+
+
+
+
+ + ☆ Spatially-Adaptive Hash Encodings For Neural Surface Reconstruction + + +
+ Positional encodings are a common component of neural scene reconstruction +methods, and provide a way to bias the learning of neural fields towards +coarser or finer representations. Current neural surface reconstruction methods +use a "one-size-fits-all" approach to encoding, choosing a fixed set of +encoding functions, and therefore bias, across all scenes. Current +state-of-the-art surface reconstruction approaches leverage grid-based +multi-resolution hash encoding in order to recover high-detail geometry. We +propose a learned approach which allows the network to choose its encoding +basis as a function of space, by masking the contribution of features stored at +separate grid resolutions. The resulting spatially adaptive approach allows the +network to fit a wider range of frequencies without introducing noise. We test +our approach on standard benchmark surface reconstruction datasets and achieve +state-of-the-art performance on two benchmark datasets. + +
+
+
+
+
+ + ☆ DNF: Unconditional 4D Generation with Dictionary-based Neural Fields + + +
+ While remarkable success has been achieved through diffusion-based 3D +generative models for shapes, 4D generative modeling remains challenging due to +the complexity of object deformations over time. We propose DNF, a new 4D +representation for unconditional generative modeling that efficiently models +deformable shapes with disentangled shape and motion while capturing +high-fidelity details in the deforming objects. To achieve this, we propose a +dictionary learning approach to disentangle 4D motion from shape as neural +fields. Both shape and motion are represented as learned latent spaces, where +each deformable shape is represented by its shape and motion global latent +codes, shape-specific coefficient vectors, and shared dictionary information. +This captures both shape-specific detail and global shared information in the +learned dictionary. Our dictionary-based representation well balances fidelity, +contiguity and compression -- combined with a transformer-based diffusion +model, our method is able to generate effective, high-fidelity 4D animations. + +
+
+ comment: Project page: https://xzhang-t.github.io/project/DNF/ +
+
+
+
+
+ + ☆ Gaining Explainability from a CNN for Stereotype Detection Based on Mice + Stopping Behavior ICPR + + +
+ Understanding the behavior of laboratory animals is a key to find answers +about diseases and neurodevelopmental disorders that also affects humans. One +behavior of interest is the stopping, as it correlates with exploration, +feeding and sleeping habits of individuals. To improve comprehension of +animal's behavior, we focus on identifying trait revealing age/sex of mice +through the series of stopping spots of each individual. We track 4 mice using +LiveMouseTracker (LMT) system during 3 days. Then, we build a stack of 2D +histograms of the stop positions. This stack of histograms passes through a +shallow CNN architecture to classify mice in terms of age and sex. We observe +that female mice show more recognizable behavioral patterns, reaching a +classification accuracy of more than 90%, while males, which do not present as +many distinguishable patterns, reach an accuracy of 62.5%. To gain +explainability from the model, we look at the activation function of the +convolutional layers and found that some regions of the cage are preferentially +explored by females. Males, especially juveniles, present behavior patterns +that oscillate between juvenile female and adult male. + +
+
+ comment: to be published in VAIB - Visual observation and analysis of + Vertebrate And Insect Behavior (ICPR) 2024 +
+
+
+
+
+ + ☆ Towards Flexible 3D Perception: Object-Centric Occupancy Completion + Augments 3D Object Detection NeurIPS 2024 + + +
+ While 3D object bounding box (bbox) representation has been widely used in +autonomous driving perception, it lacks the ability to capture the precise +details of an object's intrinsic geometry. Recently, occupancy has emerged as a +promising alternative for 3D scene perception. However, constructing a +high-resolution occupancy map remains infeasible for large scenes due to +computational constraints. Recognizing that foreground objects only occupy a +small portion of the scene, we introduce object-centric occupancy as a +supplement to object bboxes. This representation not only provides intricate +details for detected objects but also enables higher voxel resolution in +practical applications. We advance the development of object-centric occupancy +perception from both data and algorithm perspectives. On the data side, we +construct the first object-centric occupancy dataset from scratch using an +automated pipeline. From the algorithmic standpoint, we introduce a novel +object-centric occupancy completion network equipped with an implicit shape +decoder that manages dynamic-size occupancy generation. This network accurately +predicts the complete object-centric occupancy volume for inaccurate object +proposals by leveraging temporal information from long sequences. Our method +demonstrates robust performance in completing object shapes under noisy +detection and tracking conditions. Additionally, we show that our occupancy +features significantly enhance the detection results of state-of-the-art 3D +object detectors, especially for incomplete or distant objects in the Waymo +Open Dataset. + +
+
+ comment: NeurIPS 2024 +
+
+
+
+
+ + ☆ BIAS: A Body-based Interpretable Active Speaker Approach + + +
+ State-of-the-art Active Speaker Detection (ASD) approaches heavily rely on +audio and facial features to perform, which is not a sustainable approach in +wild scenarios. Although these methods achieve good results in the standard +AVA-ActiveSpeaker set, a recent wilder ASD dataset (WASD) showed the +limitations of such models and raised the need for new approaches. As such, we +propose BIAS, a model that, for the first time, combines audio, face, and body +information, to accurately predict active speakers in varying/challenging +conditions. Additionally, we design BIAS to provide interpretability by +proposing a novel use for Squeeze-and-Excitation blocks, namely in attention +heatmaps creation and feature importance assessment. For a full +interpretability setup, we annotate an ASD-related actions dataset (ASD-Text) +to finetune a ViT-GPT2 for text scene description to complement BIAS +interpretability. The results show that BIAS is state-of-the-art in challenging +conditions where body-based features are of utmost importance (Columbia, +open-settings, and WASD), and yields competitive results in AVA-ActiveSpeaker, +where face is more influential than body for ASD. BIAS interpretability also +shows the features/aspects more relevant towards ASD prediction in varying +settings, making it a strong baseline for further developments in interpretable +ASD models, and is available at https://github.com/Tiago-Roxo/BIAS. + +
+
+
+
+
+ + ☆ LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style + Conditioned Image Generation + + +
+ Recent advancements in image generation models have enabled personalized +image creation with both user-defined subjects (content) and styles. Prior +works achieved personalization by merging corresponding low-rank adaptation +parameters (LoRAs) through optimization-based methods, which are +computationally demanding and unsuitable for real-time use on +resource-constrained devices like smartphones. To address this, we introduce +LoRA.rar, a method that not only improves image quality but also achieves a +remarkable speedup of over $4000\times$ in the merging process. LoRA.rar +pre-trains a hypernetwork on a diverse set of content-style LoRA pairs, +learning an efficient merging strategy that generalizes to new, unseen +content-style pairs, enabling fast, high-quality personalization. Moreover, we +identify limitations in existing evaluation metrics for content-style quality +and propose a new protocol using multimodal large language models (MLLM) for +more accurate assessment. Our method significantly outperforms the current +state of the art in both content and style fidelity, as validated by MLLM +assessments and human evaluations. + +
+
+ comment: 17 pages, 20 figures +
+
+
+
+
+ + ☆ How to Squeeze An Explanation Out of Your Model + + +
+ Deep learning models are widely used nowadays for their reliability in +performing various tasks. However, they do not typically provide the reasoning +behind their decision, which is a significant drawback, particularly for more +sensitive areas such as biometrics, security and healthcare. The most commonly +used approaches to provide interpretability create visual attention heatmaps of +regions of interest on an image based on models gradient backpropagation. +Although this is a viable approach, current methods are targeted toward image +settings and default/standard deep learning models, meaning that they require +significant adaptations to work on video/multi-modal settings and custom +architectures. This paper proposes an approach for interpretability that is +model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block +that creates visual attention heatmaps. By including an SE block prior to the +classification layer of any model, we are able to retrieve the most influential +features via SE vector manipulation, one of the key components of the SE block. +Our results show that this new SE-based interpretability can be applied to +various models in image and video/multi-modal settings, namely biometrics of +facial features with CelebA and behavioral biometrics using Active Speaker +Detection datasets. Furthermore, our proposal does not compromise model +performance toward the original task, and has competitive results with current +interpretability approaches in state-of-the-art object datasets, highlighting +its robustness to perform in varying data aside from the biometric context. + +
+
+
+
+
+ + ☆ The Silent Prompt: Initial Noise as Implicit Guidance for Goal-Driven + Image Generation + + +
+ Text-to-image synthesis (T2I) has advanced remarkably with the emergence of +large-scale diffusion models. In the conventional setup, the text prompt +provides explicit, user-defined guidance, directing the generation process by +denoising a randomly sampled Gaussian noise. In this work, we reveal that the +often-overlooked noise itself encodes inherent generative tendencies, acting as +a "silent prompt" that implicitly guides the output. This implicit guidance, +embedded in the noise scheduler design of diffusion model formulations and +their training stages, generalizes across a wide range of T2I models and +backbones. Building on this insight, we introduce NoiseQuery, a novel strategy +that selects optimal initial noise from a pre-built noise library to meet +diverse user needs. Our approach not only enhances high-level semantic +alignment with text prompts, but also allows for nuanced adjustments of +low-level visual attributes, such as texture, sharpness, shape, and color, +which are typically challenging to control through text alone. Extensive +experiments across various models and target attributes demonstrate the strong +performance and zero-shot transferability of our approach, requiring no +additional optimization. + +
+
+ comment: 18 pages, 18 figures, 6 tables +
+
+
+
+
+ + ☆ SoPo: Text-to-Motion Generation Using Semi-Online Preference + Optimization + + +
+ Text-to-motion generation is essential for advancing the creative industry +but often presents challenges in producing consistent, realistic motions. To +address this, we focus on fine-tuning text-to-motion models to consistently +favor high-quality, human-preferred motions, a critical yet largely unexplored +problem. In this work, we theoretically investigate the DPO under both online +and offline settings, and reveal their respective limitation: overfitting in +offline DPO, and biased sampling in online DPO. Building on our theoretical +insights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based +method for training text-to-motion models using "semi-online" data pair, +consisting of unpreferred motion from online distribution and preferred motion +in offline datasets. This method leverages both online and offline DPO, +allowing each to compensate for the other's limitations. Extensive experiments +demonstrate that SoPo outperforms other preference alignment methods, with an +MM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g. +0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model +fine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM +Dist. Visualization results also show the efficacy of our SoPo in preference +alignment. Our project page is https://sopo-motion.github.io. + +
+
+
+
+
+ + ☆ Reconstructing Quantitative Cerebral Perfusion Images Directly From + Measured Sinogram Data Acquired Using C-arm Cone-Beam CT + + +
+ To shorten the door-to-puncture time for better treating patients with acute +ischemic stroke, it is highly desired to obtain quantitative cerebral perfusion +images using C-arm cone-beam computed tomography (CBCT) equipped in the +interventional suite. However, limited by the slow gantry rotation speed, the +temporal resolution and temporal sampling density of typical C-arm CBCT are +much poorer than those of multi-detector-row CT in the diagnostic imaging +suite. The current quantitative perfusion imaging includes two cascaded steps: +time-resolved image reconstruction and perfusion parametric estimation. For +time-resolved image reconstruction, the technical challenge imposed by poor +temporal resolution and poor sampling density causes inaccurate quantification +of the temporal variation of cerebral artery and tissue attenuation values. For +perfusion parametric estimation, it remains a technical challenge to +appropriately design the handcrafted regularization for better solving the +associated deconvolution problem. These two challenges together prevent +obtaining quantitatively accurate perfusion images using C-arm CBCT. The +purpose of this work is to simultaneously address these two challenges by +combining the two cascaded steps into a single joint optimization problem and +reconstructing quantitative perfusion images directly from the measured +sinogram data. In the developed direct cerebral perfusion parametric image +reconstruction technique, TRAINER in short, the quantitative perfusion images +have been represented as a subject-specific conditional generative model +trained under the constraint of the time-resolved CT forward model, perfusion +convolutional model, and the subject's own measured sinogram data. Results +shown in this paper demonstrated that using TRAINER, quantitative cerebral +perfusion images can be accurately obtained using C-arm CBCT in the +interventional suite. + +
+
+
+
+
+ + ☆ Spinal ligaments detection on vertebrae meshes using registration and 3D + edge detection + + +
+ Spinal ligaments are crucial elements in the complex biomechanical simulation +models as they transfer forces on the bony structure, guide and limit movements +and stabilize the spine. The spinal ligaments encompass seven major groups +being responsible for maintaining functional interrelationships among the other +spinal components. Determination of the ligament origin and insertion points on +the 3D vertebrae models is an essential step in building accurate and complex +spine biomechanical models. In our paper, we propose a pipeline that is able to +detect 66 spinal ligament attachment points by using a step-wise approach. Our +method incorporates a fast vertebra registration that strategically extracts +only 15 3D points to compute the transformation, and edge detection for a +precise projection of the registered ligaments onto any given patient-specific +vertebra model. Our method shows high accuracy, particularly in identifying +landmarks on the anterior part of the vertebra with an average distance of 2.24 +mm for anterior longitudinal ligament and 1.26 mm for posterior longitudinal +ligament landmarks. The landmark detection requires approximately 3.0 seconds +per vertebra, providing a substantial improvement over existing methods. +Clinical relevance: using the proposed method, the required landmarks that +represent origin and insertion points for forces in the biomechanical spine +models can be localized automatically in an accurate and time-efficient manner. + +
+
+
+
+
+ + ☆ Improving analytical color and texture similarity estimation methods for + dataset-agnostic person reidentification + + +
+ This paper studies a combined person reidentification (re-id) method that +uses human parsing, analytical feature extraction and similarity estimation +schemes. One of its prominent features is its low computational requirements so +it can be implemented on edge devices. The method allows direct comparison of +specific image regions using interpretable features which consist of color and +texture channels. It is proposed to analyze and compare colors in CIE-Lab color +space using histogram smoothing for noise reduction. A novel pre-configured +latent space (LS) supervised autoencoder (SAE) is proposed for texture analysis +which encodes input textures as LS points. This allows to obtain more accurate +similarity measures compared to simplistic label comparison. The proposed +method also does not rely upon photos or other re-id data for training, which +makes it completely re-id dataset-agnostic. The viability of the proposed +method is verified by computing rank-1, rank-10, and mAP re-id metrics on +Market1501 dataset. The results are comparable to those of conventional deep +learning methods and the potential ways to further improve the method are +discussed. + +
+
+ comment: 8 pages, 2 figures, 3 tables, 3 equations +
+
+
+
+
+ + ☆ LoFi: Vision-Aided Label Generator for Wi-Fi Localization and Tracking + + +
+ Wi-Fi localization and tracking has shown immense potential due to its +privacy-friendliness, wide coverage, permeability, independence from lighting +conditions, and low cost. Current methods can be broadly categorized as +model-based and data-driven approaches, where data-driven methods show better +performance and have less requirement for specialized devices, but struggle +with limited datasets for training. Due to limitations in current data +collection methods, most datasets only provide coarse-grained ground truth (GT) +or limited amount of label points, which greatly hinders the development of +data-driven methods. Even though lidar can provide accurate GT, their high cost +makes them inaccessible to many users. To address these challenges, we propose +LoFi, a vision-aided label generator for Wi-Fi localization and tracking, which +can generate ground truth position coordinates solely based on 2D images. The +easy and quick data collection method also helps data-driven based methods +deploy in practice, since Wi-Fi is a low-generalization modality and when using +relevant methods, it always requires fine-tuning the model using newly +collected data. Based on our method, we also collect a Wi-Fi tracking and +localization dataset using ESP32-S3 and a webcam. To facilitate future +research, we will make our code and dataset publicly available upon +publication. + +
+
+
+
+
+ + ☆ BimArt: A Unified Approach for the Synthesis of 3D Bimanual Interaction + with Articulated Objects + + +
+ We present BimArt, a novel generative approach for synthesizing 3D bimanual +hand interactions with articulated objects. Unlike prior works, we do not rely +on a reference grasp, a coarse hand trajectory, or separate modes for grasping +and articulating. To achieve this, we first generate distance-based contact +maps conditioned on the object trajectory with an articulation-aware feature +representation, revealing rich bimanual patterns for manipulation. The learned +contact prior is then used to guide our hand motion generator, producing +diverse and realistic bimanual motions for object movement and articulation. +Our work offers key insights into feature representation and contact prior for +articulated objects, demonstrating their effectiveness in taming the complex, +high-dimensional space of bimanual hand-object interactions. Through +comprehensive quantitative experiments, we demonstrate a clear step towards +simplified and high-quality hand-object animations that excel over the +state-of-the-art in motion quality and diversity. + +
+
+
+
+
+ + ☆ Reconstruction of 3D lumbar spine models from incomplete segmentations + using landmark detection + + +
+ Patient-specific 3D spine models serve as a foundation for spinal treatment +and surgery planning as well as analysis of loading conditions in biomechanical +and biomedical research. Despite advancements in imaging technologies, the +reconstruction of complete 3D spine models often faces challenges due to +limitations in imaging modalities such as planar X-Ray and missing certain +spinal structures, such as the spinal or transverse processes, in volumetric +medical images and resulting segmentations. In this study, we present a novel +accurate and time-efficient method to reconstruct complete 3D lumbar spine +models from incomplete 3D vertebral bodies obtained from segmented magnetic +resonance images (MRI). In our method, we use an affine transformation to align +artificial vertebra models with patient-specific incomplete vertebrae. The +transformation matrix is derived from vertebra landmarks, which are +automatically detected on the vertebra endplates. The results of our evaluation +demonstrate the high accuracy of the performed registration, achieving an +average point-to-model distance of 1.95 mm. Additionally, in assessing the +morphological properties of the vertebrae and intervertebral characteristics, +our method demonstrated a mean absolute error (MAE) of 3.4{\deg} in the angles +of functional spine units (FSUs), emphasizing its effectiveness in maintaining +important spinal features throughout the transformation process of individual +vertebrae. Our method achieves the registration of the entire lumbar spine, +spanning segments L1 to L5, in just 0.14 seconds, showcasing its +time-efficiency. Clinical relevance: the fast and accurate reconstruction of +spinal models from incomplete input data such as segmentations provides a +foundation for many applications in spine diagnostics, treatment planning, and +the development of spinal healthcare solutions. + +
+
+
+
+
+ + ☆ EvTTC: An Event Camera Dataset for Time-to-Collision Estimation + + +
+ Time-to-Collision (TTC) estimation lies in the core of the forward collision +warning (FCW) functionality, which is key to all Automatic Emergency Braking +(AEB) systems. Although the success of solutions using frame-based cameras +(e.g., Mobileye's solutions) has been witnessed in normal situations, some +extreme cases, such as the sudden variation in the relative speed of leading +vehicles and the sudden appearance of pedestrians, still pose significant risks +that cannot be handled. This is due to the inherent imaging principles of +frame-based cameras, where the time interval between adjacent exposures +introduces considerable system latency to AEB. Event cameras, as a novel +bio-inspired sensor, offer ultra-high temporal resolution and can +asynchronously report brightness changes at the microsecond level. To explore +the potential of event cameras in the above-mentioned challenging cases, we +propose EvTTC, which is, to the best of our knowledge, the first multi-sensor +dataset focusing on TTC tasks under high-relative-speed scenarios. EvTTC +consists of data collected using standard cameras and event cameras, covering +various potential collision scenarios in daily driving and involving multiple +collision objects. Additionally, LiDAR and GNSS/INS measurements are provided +for the calculation of ground-truth TTC. Considering the high cost of testing +TTC algorithms on full-scale mobile platforms, we also provide a small-scale +TTC testbed for experimental validation and data augmentation. All the data and +the design of the testbed are open sourced, and they can serve as a benchmark +that will facilitate the development of vision-based TTC techniques. + +
+
+ comment: 8 pages, 7 figures, 5 tables +
+
+
+
+
+ + ☆ ReF-LDM: A Latent Diffusion Model for Reference-based Face Image + Restoration NeurIPS 2024 + + +
+ While recent works on blind face image restoration have successfully produced +impressive high-quality (HQ) images with abundant details from low-quality (LQ) +input images, the generated content may not accurately reflect the real +appearance of a person. To address this problem, incorporating well-shot +personal images as additional reference inputs could be a promising strategy. +Inspired by the recent success of the Latent Diffusion Model (LDM), we propose +ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned +on one LQ image and multiple HQ reference images. Our model integrates an +effective and efficient mechanism, CacheKV, to leverage the reference images +during the generation process. Additionally, we design a timestep-scaled +identity loss, enabling our LDM-based model to focus on learning the +discriminating features of human faces. Lastly, we construct FFHQ-Ref, a +dataset consisting of 20,405 high-quality (HQ) face images with corresponding +reference images, which can serve as both training and evaluation data for +reference-based face restoration models. + +
+
+ comment: NeurIPS 2024, project page + https://chiweihsiao.github.io/refldm.github.io/ +
+
+
+
+
+ + ☆ Improving Post-Earthquake Crack Detection using Semi-Synthetic Generated + Images ECCV2024 + + +
+ Following an earthquake, it is vital to quickly evaluate the safety of the +impacted areas. Damage detection systems, powered by computer vision and deep +learning, can assist experts in this endeavor. However, the lack of extensive, +labeled datasets poses a challenge to the development of these systems. In this +study, we introduce a technique for generating semi-synthetic images to be used +as data augmentation during the training of a damage detection system. We +specifically aim to generate images of cracks, which are a prevalent and +indicative form of damage. The central concept is to employ parametric +meta-annotations to guide the process of generating cracks on 3D models of +real-word structures. The governing parameters of these meta-annotations can be +adjusted iteratively to yield images that are optimally suited for improving +detectors' performance. Comparative evaluations demonstrated that a crack +detection system trained with a combination of real and semi-synthetic images +outperforms a system trained on real images alone. + +
+
+ comment: Accepted at ECCV2024 Workshop: SyntheticData4CV 2024 +
+
+
+
+
+ + ☆ SMIC: Semantic Multi-Item Compression based on CLIP dictionary + + +
+ Semantic compression, a compression scheme where the distortion metric, +typically MSE, is replaced with semantic fidelity metrics, tends to become more +and more popular. Most recent semantic compression schemes rely on the +foundation model CLIP. In this work, we extend such a scheme to image +collection compression, where inter-item redundancy is taken into account +during the coding phase. For that purpose, we first show that CLIP's latent +space allows for easy semantic additions and subtractions. From this property, +we define a dictionary-based multi-item codec that outperforms state-of-the-art +generative codec in terms of compression rate, around $10^{-5}$ BPP per image, +while not sacrificing semantic fidelity. We also show that the learned +dictionary is of a semantic nature and works as a semantic projector for the +semantic content of images. + +
+
+ comment: 12 pages, 14 figures, 3 tables, journal paper, preprint +
+
+
+
+
+ + ☆ SAMCL: Empowering SAM to Continually Learn from Dynamic Domains + + +
+ Segment Anything Model (SAM) struggles with segmenting objects in the open +world, especially across diverse and dynamic domains. Continual segmentation +(CS) is a potential technique to solve this issue, but a significant obstacle +is the intractable balance between previous domains (stability) and new domains +(plasticity) during CS. Furthermore, how to utilize two kinds of features of +SAM, images and prompts, in an efficient and effective CS manner remains a +significant hurdle. In this work, we propose a novel CS method, termed SAMCL, +to address these challenges. It is the first study to empower SAM with the CS +ability across dynamic domains. SAMCL decouples stability and plasticity during +CS by two components: $\textit{AugModule}$ and $\textit{Module Selector}$. +Specifically, SAMCL leverages individual $\textit{AugModule}$ to effectively +and efficiently learn new relationships between images and prompts in each +domain. $\textit{Module Selector}$ selects the appropriate module during +testing, based on the inherent ability of SAM to distinguish between different +domains. These two components enable SAMCL to realize a task-agnostic method +without any interference across different domains. Experimental results +demonstrate that SAMCL outperforms state-of-the-art methods, achieving an +exceptionally low average forgetting of just $0.5$%, along with at least a +$2.5$% improvement in transferring to unseen domains. Moreover, the tunable +parameter consumption in AugModule is about $0.236$MB, marking at least a +$23.3$% reduction compared to other fine-tuning methods. + +
+
+ comment: 14 pages, 11 figures +
+
+
+
+
+ + ☆ Backdooring Outlier Detection Methods: A Novel Attack Approach + + +
+ There have been several efforts in backdoor attacks, but these have primarily +focused on the closed-set performance of classifiers (i.e., classification). +This has left a gap in addressing the threat to classifiers' open-set +performance, referred to as outlier detection in the literature. Reliable +outlier detection is crucial for deploying classifiers in critical real-world +applications such as autonomous driving and medical image analysis. First, we +show that existing backdoor attacks fall short in affecting the open-set +performance of classifiers, as they have been specifically designed to confuse +intra-closed-set decision boundaries. In contrast, an effective backdoor attack +for outlier detection needs to confuse the decision boundary between the closed +and open sets. Motivated by this, in this study, we propose BATOD, a novel +Backdoor Attack targeting the Outlier Detection task. Specifically, we design +two categories of triggers to shift inlier samples to outliers and vice versa. +We evaluate BATOD using various real-world datasets and demonstrate its +superior ability to degrade the open-set performance of classifiers compared to +previous attacks, both before and after applying defenses. + +
+
+
+
+
+ + ☆ SLayR: Scene Layout Generation with Rectified Flow + + +
+ We introduce SLayR, Scene Layout Generation with Rectified flow. +State-of-the-art text-to-image models achieve impressive results. However, they +generate images end-to-end, exposing no fine-grained control over the process. +SLayR presents a novel transformer-based rectified flow model for layout +generation over a token space that can be decoded into bounding boxes and +corresponding labels, which can then be transformed into images using existing +models. We show that established metrics for generated images are inconclusive +for evaluating their underlying scene layout, and introduce a new benchmark +suite, including a carefully designed repeatable human-evaluation procedure +that assesses the plausibility and variety of generated layouts. In contrast to +previous works, which perform well in either high variety or plausibility, we +show that our approach performs well on both of these axes at the same time. It +is also at least 5x times smaller in the number of parameters and 37% faster +than the baselines. Our complete text-to-image pipeline demonstrates the added +benefits of an interpretable and editable intermediate representation. + +
+
+ comment: 34 pages, 29 figures, 5 tables +
+
+
+
+
+ + ☆ ETLNet: An Efficient TCN-BiLSTM Network for Road Anomaly Detection Using + Smartphone Sensors ICPR 2024 + + +
+ Road anomalies can be defined as irregularities on the road surface or in the +surface itself. Some may be intentional (such as speedbumps), accidental (such +as materials falling off a truck), or the result of roads' excessive use or low +or no maintenance, such as potholes. Despite their varying origins, these +irregularities often harm vehicles substantially. Speed bumps are intentionally +placed for safety but are dangerous due to their non-standard shape, size, and +lack of proper markings. Potholes are unintentional and can also cause severe +damage. To address the detection of these anomalies, we need an automated road +monitoring system. Today, various systems exist that use visual information to +track these anomalies. Still, due to poor lighting conditions and improper or +missing markings, they may go undetected and have severe consequences for +public transport, automated vehicles, etc. In this paper, the Enhanced +Temporal-BiLSTM Network (ETLNet) is introduced as a novel approach that +integrates two Temporal Convolutional Network (TCN) layers with a Bidirectional +Long Short-Term Memory (BiLSTM) layer. This combination is tailored to detect +anomalies effectively irrespective of lighting conditions, as it depends not on +visuals but smartphone inertial sensor data. Our methodology employs +accelerometer and gyroscope sensors, typically in smartphones, to gather data +on road conditions. Empirical evaluations demonstrate that the ETLNet model +maintains an F1-score for detecting speed bumps of 99.3%. The ETLNet model's +robustness and efficiency significantly advance automated road surface +monitoring technologies. + +
+
+ comment: Presented in ICPR 2024, Kolkata, December 1-5, 2024 (First Workshop + on Intelligent Mobility in Unstructured Environments) +
+
+
+
+
+ + ☆ Power Plant Detection for Energy Estimation using GIS with Remote + Sensing, CNN & Vision Transformers + + +
+ In this research, we propose a hybrid model for power plant detection to +assist energy estimation applications, by pipelining GIS (Geographical +Information Systems) having Remote Sensing capabilities with CNN (Convolutional +Neural Networks) and ViT (Vision Transformers). Our proposed approach enables +real-time analysis with multiple data types on a common map via the GIS, +entails feature-extraction abilities due to the CNN, and captures long-range +dependencies through the ViT. This hybrid approach is found to enhance +classification, thus helping in the monitoring and operational management of +power plants; hence assisting energy estimation and sustainable energy planning +in the future. It exemplifies adequate deployment of machine learning methods +in conjunction with domain-specific approaches to enhance performance. + +
+
+
+
+
+ + ☆ MixedGaussianAvatar: Realistically and Geometrically Accurate Head + Avatar via Mixed 2D-3D Gaussian Splatting + + +
+ Reconstructing high-fidelity 3D head avatars is crucial in various +applications such as virtual reality. The pioneering methods reconstruct +realistic head avatars with Neural Radiance Fields (NeRF), which have been +limited by training and rendering speed. Recent methods based on 3D Gaussian +Splatting (3DGS) significantly improve the efficiency of training and +rendering. However, the surface inconsistency of 3DGS results in subpar +geometric accuracy; later, 2DGS uses 2D surfels to enhance geometric accuracy +at the expense of rendering fidelity. To leverage the benefits of both 2DGS and +3DGS, we propose a novel method named MixedGaussianAvatar for realistically and +geometrically accurate head avatar reconstruction. Our main idea is to utilize +2D Gaussians to reconstruct the surface of the 3D head, ensuring geometric +accuracy. We attach the 2D Gaussians to the triangular mesh of the FLAME model +and connect additional 3D Gaussians to those 2D Gaussians where the rendering +quality of 2DGS is inadequate, creating a mixed 2D-3D Gaussian representation. +These 2D-3D Gaussians can then be animated using FLAME parameters. We further +introduce a progressive training strategy that first trains the 2D Gaussians +and then fine-tunes the mixed 2D-3D Gaussians. We demonstrate the superiority +of MixedGaussianAvatar through comprehensive experiments. The code will be +released at: https://github.com/ChenVoid/MGA/. + +
+
+ comment: Project: https://chenvoid.github.io/MGA/ +
+
+
+
+
+ + ☆ Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for + Radiology Report Generation ACL 2024 + + +
+ We introduce a radiology-focused visual language model designed to generate +radiology reports from chest X-rays. Building on previous findings that large +language models (LLMs) can acquire multimodal capabilities when aligned with +pretrained vision encoders, we demonstrate similar potential with chest X-ray +images. This integration enhances the ability of model to understand and +describe chest X-ray images. Our model combines an image encoder with a +fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate +different sections of a radiology report with notable accuracy. The training +process involves a two-stage approach: (i) initial alignment of chest X-ray +features with the LLM (ii) followed by fine-tuning for radiology report +generation. + +
+
+ comment: Accepted by BioNLP@ACL 2024 +
+
+
+
+
+ + ☆ HOLa: HoloLens Object Labeling + + +
+ In the context of medical Augmented Reality (AR) applications, object +tracking is a key challenge and requires a significant amount of annotation +masks. As segmentation foundation models like the Segment Anything Model (SAM) +begin to emerge, zero-shot segmentation requires only minimal human +participation obtaining high-quality object masks. We introduce a +HoloLens-Object-Labeling (HOLa) Unity and Python application based on the +SAM-Track algorithm that offers fully automatic single object annotation for +HoloLens 2 while requiring minimal human participation. HOLa does not have to +be adjusted to a specific image appearance and could thus alleviate AR research +in any application field. We evaluate HOLa for different degrees of image +complexity in open liver surgery and in medical phantom experiments. Using HOLa +for image annotation can increase the labeling speed by more than 500 times +while providing Dice scores between 0.875 and 0.982, which are comparable to +human annotators. Our code is publicly available at: +https://github.com/mschwimmbeck/HOLa + +
+
+ comment: accepted by BMT 2024 +
+
+
+
+
+ + ☆ Verb Mirage: Unveiling and Assessing Verb Concept Hallucinations in + Multimodal Large Language Models + + +
+ Multimodal Large Language Models (MLLMs) have garnered significant attention +recently and demonstrate outstanding capabilities in various tasks such as OCR, +VQA, captioning, $\textit{etc}$. However, hallucination remains a persistent +issue. While numerous methods have been proposed to mitigate hallucinations, +achieving notable improvements, these methods primarily focus on mitigating +hallucinations about $\textbf{object/noun-related}$ concepts. Verb concepts, +crucial for understanding human actions, have been largely overlooked. In this +paper, to the best of our knowledge, we are the $\textbf{first}$ to investigate +the $\textbf{verb hallucination}$ phenomenon of MLLMs from various +perspectives. Our findings reveal that most state-of-the-art MLLMs suffer from +severe verb hallucination. To assess the effectiveness of existing mitigation +methods for object concept hallucination on verb hallucination, we evaluated +these methods and found that they do not effectively address verb +hallucination. To address this issue, we propose a novel rich verb +knowledge-based tuning method to mitigate verb hallucination. The experiment +results demonstrate that our method significantly reduces hallucinations +related to verbs. $\textit{Our code and data will be made publicly available}$. + +
+
+
+
+
+ + ☆ Uncertainty-aware retinal layer segmentation in OCT through + probabilistic signed distance functions + + +
+ In this paper, we present a new approach for uncertainty-aware retinal layer +segmentation in Optical Coherence Tomography (OCT) scans using probabilistic +signed distance functions (SDF). Traditional pixel-wise and regression-based +methods primarily encounter difficulties in precise segmentation and lack of +geometrical grounding respectively. To address these shortcomings, our +methodology refines the segmentation by predicting a signed distance function +(SDF) that effectively parameterizes the retinal layer shape via level set. We +further enhance the framework by integrating probabilistic modeling, applying +Gaussian distributions to encapsulate the uncertainty in the shape +parameterization. This ensures a robust representation of the retinal layer +morphology even in the presence of ambiguous input, imaging noise, and +unreliable segmentations. Both quantitative and qualitative evaluations +demonstrate superior performance when compared to other methods. Additionally, +we conducted experiments on artificially distorted datasets with various noise +types-shadowing, blinking, speckle, and motion-common in OCT scans to showcase +the effectiveness of our uncertainty estimation. Our findings demonstrate the +possibility to obtain reliable segmentation of retinal layers, as well as an +initial step towards the characterization of layer integrity, a key biomarker +for disease progression. Our code is available at +\url{https://github.com/niazoys/RLS_PSDF}. + +
+
+
+
+
+ + ☆ DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object + Detection + + +
+ Object detection in poor-illumination environments is a challenging task as +objects are usually not clearly visible in RGB images. As infrared images +provide additional clear edge information that complements RGB images, fusing +RGB and infrared images has potential to enhance the detection ability in +poor-illumination environments. However, existing works involving both visible +and infrared images only focus on image fusion, instead of object detection. +Moreover, they directly fuse the two kinds of image modalities, which ignores +the mutual interference between them. To fuse the two modalities to maximize +the advantages of cross-modality, we design a dual-enhancement-based +cross-modality object detection network DEYOLO, in which semantic-spatial cross +modality and novel bi-directional decoupled focus modules are designed to +achieve the detection-centered mutual enhancement of RGB-infrared (RGB-IR). +Specifically, a dual semantic enhancing channel weight assignment module (DECA) +and a dual spatial enhancing pixel weight assignment module (DEPA) are firstly +proposed to aggregate cross-modality information in the feature space to +improve the feature representation ability, such that feature fusion can aim at +the object detection task. Meanwhile, a dual-enhancement mechanism, including +enhancements for two-modality fusion and single modality, is designed in both +DECAand DEPAto reduce interference between the two kinds of image modalities. +Then, a novel bi-directional decoupled focus is developed to enlarge the +receptive field of the backbone network in different directions, which improves +the representation quality of DEYOLO. Extensive experiments on M3FD and LLVIP +show that our approach outperforms SOTA object detection algorithms by a clear +margin. Our code is available at https://github.com/chips96/DEYOLO. + +
+
+
+
+
+ + ☆ Video Decomposition Prior: A Methodology to Decompose Videos into Layers ICLR + + +
+ In the evolving landscape of video enhancement and editing methodologies, a +majority of deep learning techniques often rely on extensive datasets of +observed input and ground truth sequence pairs for optimal performance. Such +reliance often falters when acquiring data becomes challenging, especially in +tasks like video dehazing and relighting, where replicating identical motions +and camera angles in both corrupted and ground truth sequences is complicated. +Moreover, these conventional methodologies perform best when the test +distribution closely mirrors the training distribution. Recognizing these +challenges, this paper introduces a novel video decomposition prior +`\texttt{VDP}' framework which derives inspiration from professional video +editing practices. Our methodology does not mandate task-specific external data +corpus collection, instead pivots to utilizing the motion and appearance of the +input video. \texttt{VDP} framework decomposes a video sequence into a set of +multiple RGB layers and associated opacity levels. These set of layers are then +manipulated individually to obtain the desired results. We addresses tasks such +as video object segmentation, dehazing, and relighting. Moreover, we introduce +a novel logarithmic video decomposition formulation for video relighting tasks, +setting a new benchmark over the existing methodologies. We observe the +property of relighting emerge as we optimize for our novel relighting +decomposition formulation. We evaluate our approach on standard video datasets +like DAVIS, REVIDE, \& SDSD and show qualitative results on a diverse array of +internet videos. Project Page - +https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video +results. + +
+
+ comment: Project Page - + https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video + results. Extended version of ICLR publication +
+
+
+
+
+ + ☆ Continuous Video Process: Modeling Videos as Continuous + Multi-Dimensional Processes for Video Prediction CVPR + + +
+ Diffusion models have made significant strides in image generation, mastering +tasks such as unconditional image synthesis, text-image translation, and +image-to-image conversions. However, their capability falls short in the realm +of video prediction, mainly because they treat videos as a collection of +independent images, relying on external constraints such as temporal attention +mechanisms to enforce temporal coherence. In our paper, we introduce a novel +model class, that treats video as a continuous multi-dimensional process rather +than a series of discrete frames. We also report a reduction of 75\% sampling +steps required to sample a new frame thus making our framework more efficient +during the inference time. Through extensive experimentation, we establish +state-of-the-art performance in video prediction, validated on benchmark +datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project +page https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.} + +
+
+ comment: Navigate to the project page + https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results. + Extended version of published CVPR paper +
+
+
+
+
+ + ☆ $S^3$: Synonymous Semantic Space for Improving Zero-Shot Generalization + of Vision-Language Models + + +
+ Recently, many studies have been conducted to enhance the zero-shot +generalization ability of vision-language models (e.g., CLIP) by addressing the +semantic misalignment between image and text embeddings in downstream tasks. +Although many efforts have been made, existing methods barely consider the fact +that a class of images can be described by notably different textual concepts +due to well-known lexical variation in natural language processing, which +heavily affects the zero-shot generalization of CLIP. Therefore, this paper +proposes a \textbf{S}ynonymous \textbf{S}emantic \textbf{S}pace ($S^3$) for +each image class, rather than relying on a single textual concept, achieving +more stable semantic alignment and improving the zero-shot generalization of +CLIP. Specifically, our $S^3$ method first generates several synonymous +concepts based on the label of each class by using large language models, and +constructs a continuous yet compact synonymous semantic space based on the +Vietoris-Rips complex of the generated synonymous concepts. Furthermore, we +explore the effect of several point-to-space metrics on our $S^3$, while +presenting a point-to-local-center metric to compute similarity between image +embeddings and the synonymous semantic space of each class, accomplishing +effective zero-shot predictions. Extensive experiments are conducted across 17 +benchmarks, including fine-grained zero-shot classification, natural +distribution zero-shot classification, and open-vocabulary segmentation, and +the results show that our $S^3$ outperforms state-of-the-art methods. + +
+
+
+
+
+ + ☆ Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video + Object Detection WACV 2025 + + +
+ The primary challenge in Video Object Detection (VOD) is effectively +exploiting temporal information to enhance object representations. Traditional +strategies, such as aggregating region proposals, often suffer from feature +variance due to the inclusion of background information. We introduce a novel +instance mask-based feature aggregation approach, significantly refining this +process and deepening the understanding of object dynamics across video frames. +We present FAIM, a new VOD method that enhances temporal Feature Aggregation by +leveraging Instance Mask features. In particular, we propose the lightweight +Instance Feature Extraction Module (IFEM) to learn instance mask features and +the Temporal Instance Classification Aggregation Module (TICAM) to aggregate +instance mask and classification features across video frames. Using YOLOX as a +base detector, FAIM achieves 87.9% mAP on the ImageNet VID dataset at 33 FPS on +a single 2080Ti GPU, setting a new benchmark for the speed-accuracy trade-off. +Additional experiments on multiple datasets validate that our approach is +robust, method-agnostic, and effective in multi-object tracking, demonstrating +its broader applicability to video understanding tasks. + +
+
+ comment: To appear in WACV 2025 +
+
+
+
+
+ + ☆ UniMIC: Towards Universal Multi-modality Perceptual Image Compression + + +
+ We present UniMIC, a universal multi-modality image compression framework, +intending to unify the rate-distortion-perception (RDP) optimization for +multiple image codecs simultaneously through excavating cross-modality +generative priors. Unlike most existing works that need to design and optimize +image codecs from scratch, our UniMIC introduces the visual codec repository, +which incorporates amounts of representative image codecs and directly uses +them as the basic codecs for various practical applications. Moreover, we +propose multi-grained textual coding, where variable-length content prompt and +compression prompt are designed and encoded to assist the perceptual +reconstruction through the multi-modality conditional generation. In +particular, a universal perception compensator is proposed to improve the +perception quality of decoded images from all basic codecs at the decoder side +by reusing text-assisted diffusion priors from stable diffusion. With the +cooperation of the above three strategies, our UniMIC achieves a significant +improvement of RDP optimization for different compression codecs, e.g., +traditional and learnable codecs, and different compression costs, e.g., +ultra-low bitrates. The code will be available in +https://github.com/Amygyx/UniMIC . + +
+
+
+
+
+ + ☆ EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation + + +
+ Multimodal large language models (MLLMs) have achieved remarkable progress on +various visual question answering and reasoning tasks leveraging instruction +fine-tuning specific datasets. They can also learn from preference data +annotated by human to enhance their reasoning ability and mitigate +hallucinations. Most of preference data is generated from the model itself. +However, existing methods require high-quality critical labels, which are +costly and rely on human or proprietary models like GPT-4V. In this work, we +propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which +aligns MLLMs by self-generated preference data using only 5k images +economically. Our approach begins with collecting and refining a Scoring +Evaluation Instruction-tuning dataset to train a critical evaluation model, +termed the Critic. This Critic observes model responses across multiple +dimensions, selecting preferred and non-preferred outputs for refined Direct +Preference Optimization (DPO) tuning. To further enhance model performance, we +employ an additional supervised fine-tuning stage after preference tuning. EACO +reduces the overall hallucinations by 65.6% on HallusionBench and improves the +reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement +over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also +shows the potential critical ability in open-source MLLMs, demonstrating that +EACO is a viable path to boost the competence of MLLMs. + +
+
+ comment: 19 pages +
+
+
+
+
+ + ☆ Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised + Pretraining with Pseudo-Label Refinement + + +
+ Deep learning models rely heavily on large volumes of labeled data to achieve +high performance. However, real-world datasets often contain noisy labels due +to human error, ambiguity, or resource constraints during the annotation +process. Instance-dependent label noise (IDN), where the probability of a label +being corrupted depends on the input features, poses a significant challenge +because it is more prevalent and harder to address than instance-independent +noise. In this paper, we propose a novel hybrid framework that combines +self-supervised learning using SimCLR with iterative pseudo-label refinement to +mitigate the effects of IDN. The self-supervised pre-training phase enables the +model to learn robust feature representations without relying on potentially +noisy labels, establishing a noise-agnostic foundation. Subsequently, we employ +an iterative training process with pseudo-label refinement, where confidently +predicted samples are identified through a multistage approach and their labels +are updated to improve label quality progressively. We evaluate our method on +the CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent +noise at varying noise levels. Experimental results demonstrate that our +approach significantly outperforms several state-of-the-art methods, +particularly under high noise conditions, achieving notable improvements in +classification accuracy and robustness. Our findings suggest that integrating +self-supervised learning with iterative pseudo-label refinement offers an +effective strategy for training deep neural networks on noisy datasets +afflicted by instance-dependent label noise. + +
+
+
+
+
+ + ☆ Comprehensive Analysis and Improvements in Pansharpening Using Deep + Learning + + +
+ Pansharpening is a crucial task in remote sensing, enabling the generation of +high-resolution multispectral images by fusing low-resolution multispectral +data with high-resolution panchromatic images. This paper provides a +comprehensive analysis of traditional and deep learning-based pansharpening +methods. While state-of-the-art deep learning methods have significantly +improved image quality, issues like spectral distortions persist. To address +this, we propose enhancements to the PSGAN framework by introducing novel +regularization techniques for the generator loss function. Experimental results +on images from the Worldview-3 dataset demonstrate that the proposed +modifications improve spectral fidelity and achieve superior performance across +multiple quantitative metrics while delivering visually superior results. + +
+
+
+
+
+ + ☆ Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large + Scene Reconstruction + + +
+ 3D Gaussian Splatting has demonstrated notable success in large-scale scene +reconstruction, but challenges persist due to high training memory consumption +and storage overhead. Hybrid representations that integrate implicit and +explicit features offer a way to mitigate these limitations. However, when +applied in parallelized block-wise training, two critical issues arise since +reconstruction accuracy deteriorates due to reduced data diversity when +training each block independently, and parallel training restricts the number +of divided blocks to the available number of GPUs. To address these issues, we +propose Momentum-GS, a novel approach that leverages momentum-based +self-distillation to promote consistency and accuracy across the blocks while +decoupling the number of blocks from the physical GPU count. Our method +maintains a teacher Gaussian decoder updated with momentum, ensuring a stable +reference during training. This teacher provides each block with global +guidance in a self-distillation manner, promoting spatial consistency in +reconstruction. To further ensure consistency across the blocks, we incorporate +block weighting, dynamically adjusting each block's weight according to its +reconstruction accuracy. Extensive experiments on large-scale scenes show that +our method consistently outperforms existing techniques, achieving a 12.8% +improvement in LPIPS over CityGaussian with much fewer divided blocks and +establishing a new state of the art. Project page: +https://jixuan-fan.github.io/Momentum-GS_Page/ + +
+
+
+
+
+ + ☆ AI-Driven Non-Invasive Detection and Staging of Steatosis in Fatty Liver + Disease Using a Novel Cascade Model and Information Fusion Techniques + + +
+ Non-alcoholic fatty liver disease (NAFLD) is one of the most widespread liver +disorders on a global scale, posing a significant threat of progressing to more +severe conditions like nonalcoholic steatohepatitis (NASH), liver fibrosis, +cirrhosis, and hepatocellular carcinoma. Diagnosing and staging NAFLD presents +challenges due to its non-specific symptoms and the invasive nature of liver +biopsies. Our research introduces a novel artificial intelligence cascade model +employing ensemble learning and feature fusion techniques. We developed a +non-invasive, robust, and reliable diagnostic artificial intelligence tool that +utilizes anthropometric and laboratory parameters, facilitating early detection +and intervention in NAFLD progression. Our novel artificial intelligence +achieved an 86% accuracy rate for the NASH steatosis staging task (non-NASH, +steatosis grade 1, steatosis grade 2, and steatosis grade 3) and an impressive +96% AUC-ROC for distinguishing between NASH (steatosis grade 1, grade 2, and +grade3) and non-NASH cases, outperforming current state-of-the-art models. This +notable improvement in diagnostic performance underscores the potential +application of artificial intelligence in the early diagnosis and treatment of +NAFLD, leading to better patient outcomes and a reduced healthcare burden +associated with advanced liver disease. + +
+
+
+
+
+ + ☆ MozzaVID: Mozzarella Volumetric Image Dataset + + +
+ Influenced by the complexity of volumetric imaging, there is a shortage of +established datasets useful for benchmarking volumetric deep-learning models. +As a consequence, new and existing models are not easily comparable, limiting +the development of architectures optimized specifically for volumetric data. To +counteract this trend, we introduce MozzaVID - a large, clean, and versatile +volumetric classification dataset. Our dataset contains X-ray computed +tomography (CT) images of mozzarella microstructure and enables the +classification of 25 cheese types and 149 cheese samples. We provide data in +three different resolutions, resulting in three dataset instances containing +from 591 to 37,824 images. While being general-purpose, the dataset also +facilitates investigating mozzarella structure properties. The structure of +food directly affects its functional properties and thus its consumption +experience. Understanding food structure helps tune the production and +mimicking it enables sustainable alternatives to animal-derived food products. +The complex and disordered nature of food structures brings a unique challenge, +where a choice of appropriate imaging method, scale, and sample size is not +trivial. With this dataset we aim to address these complexities, contributing +to more robust structural analysis models. The dataset can be downloaded from: +https://archive.compute.dtu.dk/files/public/projects/MozzaVID/. + +
+
+
+
+
+ + ☆ Automatic Tissue Differentiation in Parotidectomy using Hyperspectral + Imaging + + +
+ In head and neck surgery, continuous intraoperative tissue differentiation is +of great importance to avoid injury to sensitive structures such as nerves and +vessels. Hyperspectral imaging (HSI) with neural network analysis could support +the surgeon in tissue differentiation. A 3D Convolutional Neural Network with +hyperspectral data in the range of $400-1000$ nm is used in this work. The +acquisition system consisted of two multispectral snapshot cameras creating a +stereo-HSI-system. For the analysis, 27 images with annotations of glandular +tissue, nerve, muscle, skin and vein in 18 patients undergoing parotidectomy +are included. Three patients are removed for evaluation following the +leave-one-subject-out principle. The remaining images are used for training, +with the data randomly divided into a training group and a validation group. In +the validation, an overall accuracy of $98.7\%$ is achieved, indicating robust +training. In the evaluation on the excluded patients, an overall accuracy of +$83.4\%$ has been achieved showing good detection and identification abilities. +The results clearly show that it is possible to achieve robust intraoperative +tissue differentiation using hyperspectral imaging. Especially the high +sensitivity in parotid or nerve tissue is of clinical importance. It is +interesting to note that vein was often confused with muscle. This requires +further analysis and shows that a very good and comprehensive data basis is +essential. This is a major challenge, especially in surgery. + +
+
+ comment: Accepted and presented at 58th Annual Conference of the German + Society for Biomedical Engineering in press at Current Directions in + Biomedical Engineering +
+
+
+
+
+ + ☆ MANTA: A Large-Scale Multi-View and Visual-Text Anomaly Detection + Dataset for Tiny Objects + + +
+ We present MANTA, a visual-text anomaly detection dataset for tiny objects. +The visual component comprises over 137.3K images across 38 object categories +spanning five typical domains, of which 8.6K images are labeled as anomalous +with pixel-level annotations. Each image is captured from five distinct +viewpoints to ensure comprehensive object coverage. The text component consists +of two subsets: Declarative Knowledge, including 875 words that describe common +anomalies across various domains and specific categories, with detailed +explanations for < what, why, how>, including causes and visual +characteristics; and Constructivist Learning, providing 2K multiple-choice +questions with varying levels of difficulty, each paired with images and +corresponded answer explanations. We also propose a baseline for visual-text +tasks and conduct extensive benchmarking experiments to evaluate advanced +methods across different settings, highlighting the challenges and efficacy of +our dataset. + +
+
+ comment: https://grainnet.github.io/MANTA +
+
+
+
+
+ + ☆ GS-Matching: Reconsidering Feature Matching task in Point Cloud + Registration + + +
+ Traditional point cloud registration (PCR) methods for feature matching often +employ the nearest neighbor policy. This leads to many-to-one matches and +numerous potential inliers without any corresponding point. Recently, some +approaches have framed the feature matching task as an assignment problem to +achieve optimal one-to-one matches. We argue that the transition to the +Assignment problem is not reliable for general correspondence-based PCR. In +this paper, we propose a heuristics stable matching policy called GS-matching, +inspired by the Gale-Shapley algorithm. Compared to the other matching +policies, our method can perform efficiently and find more non-repetitive +inliers under low overlapping conditions. Furthermore, we employ the +probability theory to analyze the feature matching task, providing new insights +into this research problem. Extensive experiments validate the effectiveness of +our matching policy, achieving better registration recall on multiple datasets. + +
+
+
+
+
+ + ☆ SleeperMark: Towards Robust Watermark against Fine-Tuning Text-to-image + Diffusion Models + + +
+ Recent advances in large-scale text-to-image (T2I) diffusion models have +enabled a variety of downstream applications, including style customization, +subject-driven personalization, and conditional generation. As T2I models +require extensive data and computational resources for training, they +constitute highly valued intellectual property (IP) for their legitimate +owners, yet making them incentive targets for unauthorized fine-tuning by +adversaries seeking to leverage these models for customized, usually profitable +applications. Existing IP protection methods for diffusion models generally +involve embedding watermark patterns and then verifying ownership through +generated outputs examination, or inspecting the model's feature space. +However, these techniques are inherently ineffective in practical scenarios +when the watermarked model undergoes fine-tuning, and the feature space is +inaccessible during verification ((i.e., black-box setting). The model is prone +to forgetting the previously learned watermark knowledge when it adapts to a +new task. To address this challenge, we propose SleeperMark, a novel framework +designed to embed resilient watermarks into T2I diffusion models. SleeperMark +explicitly guides the model to disentangle the watermark information from the +semantic concepts it learns, allowing the model to retain the embedded +watermark while continuing to be fine-tuned to new downstream tasks. Our +extensive experiments demonstrate the effectiveness of SleeperMark across +various types of diffusion models, including latent diffusion models (e.g., +Stable Diffusion) and pixel diffusion models (e.g., DeepFloyd-IF), showing +robustness against downstream fine-tuning and various attacks at both the image +and model levels, with minimal impact on the model's generative capability. The +code is available at https://github.com/taco-group/SleeperMark. + +
+
+
+
+
+ + ☆ UniMLVG: Unified Framework for Multi-view Long Video Generation with + Comprehensive Control Capabilities for Autonomous Driving + + +
+ The creation of diverse and realistic driving scenarios has become essential +to enhance perception and planning capabilities of the autonomous driving +system. However, generating long-duration, surround-view consistent driving +videos remains a significant challenge. To address this, we present UniMLVG, a +unified framework designed to generate extended street multi-perspective videos +under precise control. By integrating single- and multi-view driving videos +into the training data, our approach updates cross-frame and cross-view modules +across three stages with different training objectives, substantially boosting +the diversity and quality of generated visual content. Additionally, we employ +the explicit viewpoint modeling in multi-view video generation to effectively +improve motion transition consistency. Capable of handling various input +reference formats (e.g., text, images, or video), our UniMLVG generates +high-quality multi-view videos according to the corresponding condition +constraints such as 3D bounding boxes or frame-level text descriptions. +Compared to the best models with similar capabilities, our framework achieves +improvements of 21.4% in FID and 36.5% in FVD. + +
+
+
+
+
+ + ☆ Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards + for Visuomotor Robot Policy Alignment + + +
+ Visuomotor robot policies, increasingly pre-trained on large-scale datasets, +promise significant advancements across robotics domains. However, aligning +these policies with end-user preferences remains a challenge, particularly when +the preferences are hard to specify. While reinforcement learning from human +feedback (RLHF) has become the predominant mechanism for alignment in +non-embodied domains like large language models, it has not seen the same +success in aligning visuomotor policies due to the prohibitive amount of human +feedback required to learn visual reward functions. To address this limitation, +we propose Representation-Aligned Preference-based Learning (RAPL), an +observation-only method for learning visual rewards from significantly less +human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback +on fine-tuning pre-trained vision encoders to align with the end-user's visual +representation and then constructs a dense visual reward via feature matching +in this aligned representation space. We first validate RAPL through simulation +experiments in the X-Magical benchmark and Franka Panda robotic manipulation, +demonstrating that it can learn rewards aligned with human preferences, more +efficiently uses preference data, and generalizes across robot embodiments. +Finally, our hardware experiments align pre-trained Diffusion Policies for +three object manipulation tasks. We find that RAPL can fine-tune these policies +with 5x less real human preference data, taking the first step towards +minimizing human feedback while maximizing visuomotor robot policy alignment. + +
+
+ comment: Submitted to IJRR, this paper is an extended journal version of the + conference paper arXiv:2310.07932 with new results and discussion. arXiv + admin note: substantial text overlap with arXiv:2310.07932 +
+
+
+
+
+ + ☆ Customized Generation Reimagined: Fidelity and Editability Harmonized ECCV 2024 + + +
+ Customized generation aims to incorporate a novel concept into a pre-trained +text-to-image model, enabling new generations of the concept in novel contexts +guided by textual prompts. However, customized generation suffers from an +inherent trade-off between concept fidelity and editability, i.e., between +precisely modeling the concept and faithfully adhering to the prompts. Previous +methods reluctantly seek a compromise and struggle to achieve both high concept +fidelity and ideal prompt alignment simultaneously. In this paper, we propose a +Divide, Conquer, then Integrate (DCI) framework, which performs a surgical +adjustment in the early stage of denoising to liberate the fine-tuned model +from the fidelity-editability trade-off at inference. The two conflicting +components in the trade-off are decoupled and individually conquered by two +collaborative branches, which are then selectively integrated to preserve high +concept fidelity while achieving faithful prompt adherence. To obtain a better +fine-tuned model, we introduce an Image-specific Context Optimization} (ICO) +strategy for model customization. ICO replaces manual prompt templates with +learnable image-specific contexts, providing an adaptive and precise +fine-tuning direction to promote the overall performance. Extensive experiments +demonstrate the effectiveness of our method in reconciling the +fidelity-editability trade-off. + +
+
+ comment: 18 pages, 12 figures, ECCV 2024 +
+
+
+
+
+ + ☆ DAug: Diffusion-based Channel Augmentation for Radiology Image Retrieval + and Classification + + +
+ Medical image understanding requires meticulous examination of fine visual +details, with particular regions requiring additional attention. While +radiologists build such expertise over years of experience, it is challenging +for AI models to learn where to look with limited amounts of training data. +This limitation results in unsatisfying robustness in medical image +understanding. To address this issue, we propose Diffusion-based Feature +Augmentation (DAug), a portable method that improves a perception model's +performance with a generative model's output. Specifically, we extend a +radiology image to multiple channels, with the additional channels being the +heatmaps of regions where diseases tend to develop. A diffusion-based +image-to-image translation model was used to generate such heatmaps conditioned +on selected disease classes. Our method is motivated by the fact that +generative models learn the distribution of normal and abnormal images, and +such knowledge is complementary to image understanding tasks. In addition, we +propose the Image-Text-Class Hybrid Contrastive learning to utilize both text +and class labels. With two novel approaches combined, our method surpasses +baseline models without changing the model architecture, and achieves +state-of-the-art performance on both medical image retrieval and classification +tasks. + +
+
+
+
+
+ + ☆ PanoDreamer: 3D Panorama Synthesis from a Single Image + + +
+ In this paper, we present PanoDreamer, a novel method for producing a +coherent 360$^\circ$ 3D scene from a single input image. Unlike existing +methods that generate the scene sequentially, we frame the problem as +single-image panorama and depth estimation. Once the coherent panoramic image +and its corresponding depth are obtained, the scene can be reconstructed by +inpainting the small occluded regions and projecting them into 3D space. Our +key contribution is formulating single-image panorama and depth estimation as +two optimization tasks and introducing alternating minimization strategies to +effectively solve their objectives. We demonstrate that our approach +outperforms existing techniques in single-image 360$^\circ$ scene +reconstruction in terms of consistency and overall quality. + +
+
+ comment: Project page: https://people.engr.tamu.edu/nimak/Papers/PanoDreamer, + Code: https://github.com/avinashpaliwal/PanoDreamer +
+
+
+
+
+ + ☆ Pushing Rendering Boundaries: Hard Gaussian Splatting + + +
+ 3D Gaussian Splatting (3DGS) has demonstrated impressive Novel View Synthesis +(NVS) results in a real-time rendering manner. During training, it relies +heavily on the average magnitude of view-space positional gradients to grow +Gaussians to reduce rendering loss. However, this average operation smooths the +positional gradients from different viewpoints and rendering errors from +different pixels, hindering the growth and optimization of many defective +Gaussians. This leads to strong spurious artifacts in some areas. To address +this problem, we propose Hard Gaussian Splatting, dubbed HGS, which considers +multi-view significant positional gradients and rendering errors to grow hard +Gaussians that fill the gaps of classical Gaussian Splatting on 3D scenes, thus +achieving superior NVS results. In detail, we present positional gradient +driven HGS, which leverages multi-view significant positional gradients to +uncover hard Gaussians. Moreover, we propose rendering error guided HGS, which +identifies noticeable pixel rendering errors and potentially over-large +Gaussians to jointly mine hard Gaussians. By growing and optimizing these hard +Gaussians, our method helps to resolve blurring and needle-like artifacts. +Experiments on various datasets demonstrate that our method achieves +state-of-the-art rendering quality while maintaining real-time efficiency. + +
+
+
+
+
+ + ☆ LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment + + +
+ Recent advancements in text-to-video (T2V) generative models have shown +impressive capabilities. However, these models are still inadequate in aligning +synthesized videos with human preferences (e.g., accurately reflecting text +descriptions), which is particularly difficult to address, as human preferences +are inherently subjective and challenging to formalize as objective functions. +Therefore, this paper proposes LiFT, a novel fine-tuning method leveraging +human feedback for T2V model alignment. Specifically, we first construct a +Human Rating Annotation dataset, LiFT-HRA, consisting of approximately 10k +human annotations, each including a score and its corresponding rationale. +Based on this, we train a reward model LiFT-Critic to learn reward function +effectively, which serves as a proxy for human judgment, measuring the +alignment between given videos and human expectations. Lastly, we leverage the +learned reward function to align the T2V model by maximizing the +reward-weighted likelihood. As a case study, we apply our pipeline to +CogVideoX-2B, showing that the fine-tuned model outperforms the CogVideoX-5B +across all 16 metrics, highlighting the potential of human feedback in +improving the alignment and quality of synthesized videos. + +
+
+ comment: project page: https://codegoat24.github.io/LiFT +
+
+
+
+
+ + ☆ Automatic Prediction of Stroke Treatment Outcomes: Latest Advances and + Perspectives + + +
+ Stroke is a major global health problem that causes mortality and morbidity. +Predicting the outcomes of stroke intervention can facilitate clinical +decision-making and improve patient care. Engaging and developing deep learning +techniques can help to analyse large and diverse medical data, including brain +scans, medical reports and other sensor information, such as EEG, ECG, EMG and +so on. Despite the common data standardisation challenge within medical image +analysis domain, the future of deep learning in stroke outcome prediction lie +in using multimodal information, including final infarct data, to achieve +better prediction of long-term functional outcomes. This article provides a +broad review of recent advances and applications of deep learning in the +prediction of stroke outcomes, including (i) the data and models used, (ii) the +prediction tasks and measures of success, (iii) the current challenges and +limitations, and (iv) future directions and potential benefits. This +comprehensive review aims to provide researchers, clinicians, and policy makers +with an up-to-date understanding of this rapidly evolving and promising field. + +
+
+ comment: The paper is under consideration at Biomedical Engineering Letters + (Springer) +
+
+
+
+
+ + ☆ Modality Decoupling is All You Need: A Simple Solution for Unsupervised + Hyperspectral Image Fusion + + +
+ Hyperspectral Image Fusion (HIF) aims to fuse low-resolution hyperspectral +images (LR-HSIs) and high-resolution multispectral images (HR-MSIs) to +reconstruct high spatial and high spectral resolution images. Current methods +typically apply direct fusion from the two modalities without valid +supervision, failing to fully perceive the deep modality-complementary +information and hence, resulting in a superficial understanding of +inter-modality connections. To bridge this gap, we propose a simple and +effective solution for unsupervised HIF with an assumption that modality +decoupling is essential for HIF. We introduce the modality clustering loss that +ensures clear guidance of the modality, decoupling towards modality-shared +features while steering clear of modality-complementary ones. Also, we propose +an end-to-end Modality-Decoupled Spatial-Spectral Fusion (MossFuse) framework +that decouples shared and complementary information across modalities and +aggregates a concise representation of the LR-HSI and HR-MSI to reduce the +modality redundancy. Systematic experiments over multiple datasets demonstrate +that our simple and effective approach consistently outperforms the existing +HIF methods while requiring considerably fewer parameters with reduced +inference time. + +
+
+
+
+
+ + ☆ DrIFT: Autonomous Drone Dataset with Integrated Real and Synthetic Data, + Flexible Views, and Transformed Domains WACV2025 + + +
+ Dependable visual drone detection is crucial for the secure integration of +drones into the airspace. However, drone detection accuracy is significantly +affected by domain shifts due to environmental changes, varied points of view, +and background shifts. To address these challenges, we present the DrIFT +dataset, specifically developed for visual drone detection under domain shifts. +DrIFT includes fourteen distinct domains, each characterized by shifts in point +of view, synthetic-to-real data, season, and adverse weather. DrIFT uniquely +emphasizes background shift by providing background segmentation maps to enable +background-wise metrics and evaluation. Our new uncertainty estimation metric, +MCDO-map, features lower postprocessing complexity, surpassing traditional +methods. We use the MCDO-map in our uncertainty-aware unsupervised domain +adaptation method, demonstrating superior performance to SOTA unsupervised +domain adaptation techniques. The dataset is available at: +https://github.com/CARG-uOttawa/DrIFT.git. + +
+
+ comment: WACV2025 +
+
+
+
+
+ + ☆ Slicing Vision Transformer for Flexible Inference NeurIPS 2024 + + +
+ Vision Transformers (ViT) is known for its scalability. In this work, we +target to scale down a ViT to fit in an environment with dynamic-changing +resource constraints. We observe that smaller ViTs are intrinsically the +sub-networks of a larger ViT with different widths. Thus, we propose a general +framework, named Scala, to enable a single network to represent multiple +smaller ViTs with flexible inference capability, which aligns with the inherent +design of ViT to vary from widths. Concretely, Scala activates several subnets +during training, introduces Isolated Activation to disentangle the smallest +sub-network from other subnets, and leverages Scale Coordination to ensure each +sub-network receives simplified, steady, and accurate learning objectives. +Comprehensive empirical validations on different tasks demonstrate that with +only one-shot training, Scala learns slimmable representation without modifying +the original ViT structure and matches the performance of Separate Training. +Compared with the prior art, Scala achieves an average improvement of 1.6% on +ImageNet-1K with fewer parameters. + +
+
+ comment: Accepted by NeurIPS 2024 +
+
+
+
+
+ + ☆ KNN-MMD: Cross Domain Wi-Fi Sensing Based on Local Distribution + Alignment + + +
+ As a key technology in Integrated Sensing and Communications (ISAC), Wi-Fi +sensing has gained widespread application in various settings such as homes, +offices, and public spaces. By analyzing the patterns of Channel State +Information (CSI), we can obtain information about people's actions for tasks +like person identification, gesture recognition, and fall detection. However, +the CSI is heavily influenced by the environment, such that even minor +environmental changes can significantly alter the CSI patterns. This will cause +the performance deterioration and even failure when applying the Wi-Fi sensing +model trained in one environment to another. To address this problem, we +introduce a K-Nearest Neighbors Maximum Mean Discrepancy (KNN-MMD) model, a +few-shot method for cross-domain Wi-Fi sensing. We propose a local distribution +alignment method within each category, which outperforms traditional Domain +Adaptation (DA) methods based on global alignment. Besides, our method can +determine when to stop training, which cannot be realized by most DA methods. +As a result, our method is more stable and can be better used in practice. The +effectiveness of our method are evaluated in several cross-domain Wi-Fi sensing +tasks, including gesture recognition, person identification, fall detection, +and action recognition, using both a public dataset and a self-collected +dataset. In one-shot scenario, our method achieves accuracy of 93.26%, 81.84%, +77.62%, and 75.30% in the four tasks respectively. To facilitate future +research, we will make our code and dataset publicly available upon +publication. + +
+
+
+
+
+ + ☆ Megatron: Evasive Clean-Label Backdoor Attacks against Vision + Transformer + + +
+ Vision transformers have achieved impressive performance in various +vision-related tasks, but their vulnerability to backdoor attacks is +under-explored. A handful of existing works focus on dirty-label attacks with +wrongly-labeled poisoned training samples, which may fail if a benign model +trainer corrects the labels. In this paper, we propose Megatron, an evasive +clean-label backdoor attack against vision transformers, where the attacker +injects the backdoor without manipulating the data-labeling process. To +generate an effective trigger, we customize two loss terms based on the +attention mechanism used in transformer networks, i.e., latent loss and +attention diffusion loss. The latent loss aligns the last attention layer +between triggered samples and clean samples of the target label. The attention +diffusion loss emphasizes the attention diffusion area that encompasses the +trigger. A theoretical analysis is provided to underpin the rationale behind +the attention diffusion loss. Extensive experiments on CIFAR-10, GTSRB, +CIFAR-100, and Tiny ImageNet demonstrate the effectiveness of Megatron. +Megatron can achieve attack success rates of over 90% even when the position of +the trigger is slightly shifted during testing. Furthermore, Megatron achieves +better evasiveness than baselines regarding both human visual inspection and +defense strategies (i.e., DBAVT, BAVT, Beatrix, TeCo, and SAGE). + +
+
+
+
+
+ + ☆ Revitalizing Reconstruction Models for Multi-class Anomaly Detection via + Class-Aware Contrastive Learning + + +
+ For anomaly detection (AD), early approaches often train separate models for +individual classes, yielding high performance but posing challenges in +scalability and resource management. Recent efforts have shifted toward +training a single model capable of handling multiple classes. However, directly +extending early AD methods to multi-class settings often results in degraded +performance. In this paper, we analyze this degradation observed in +reconstruction-based methods, identifying two key issues: catastrophic +forgetting and inter-class confusion. To this end, we propose a plug-and-play +modification by incorporating class-aware contrastive learning (CL). By +explicitly leveraging raw object category information (e.g., carpet or wood) as +supervised signals, we apply local CL to fine-tune multiscale features and +global CL to learn more compact feature representations of normal patterns, +thereby effectively adapting the models to multi-class settings. Experiments +across four datasets (over 60 categories) verify the effectiveness of our +approach, yielding significant improvements and superior performance compared +to advanced methods. Notably, ablation studies show that even using +pseudo-class labels can achieve comparable performance. + +
+
+ comment: https://lgc-ad.github.io/ +
+
+
+
+
+ + ☆ DAWN-SI: Data-Aware and Noise-Informed Stochastic Interpolation for + Solving Inverse Problems + + +
+ Inverse problems, which involve estimating parameters from incomplete or +noisy observations, arise in various fields such as medical imaging, +geophysics, and signal processing. These problems are often ill-posed, +requiring regularization techniques to stabilize the solution. In this work, we +employ $\textit{Stochastic Interpolation}$ (SI), a generative framework that +integrates both deterministic and stochastic processes to map a simple +reference distribution, such as a Gaussian, to the target distribution. Our +method $\textbf{DAWN-SI}$: $\textbf{D}$ata-$\textbf{AW}$are and +$\textbf{N}$oise-informed $\textbf{S}$tochastic $\textbf{I}$nterpolation +incorporates data and noise embedding, allowing the model to access +representations about the measured data explicitly and also account for noise +in the observations, making it particularly robust in scenarios where data is +noisy or incomplete. By learning a time-dependent velocity field, SI not only +provides accurate solutions but also enables uncertainty quantification by +generating multiple plausible outcomes. Unlike pre-trained diffusion models, +which may struggle in highly ill-posed settings, our approach is trained +specifically for each inverse problem and adapts to varying noise levels. We +validate the effectiveness and robustness of our method through extensive +numerical experiments on tasks such as image deblurring and tomography. + +
+
+ comment: 20 pages, 11 figures, 6 tables +
+
+
+
+
+ + ☆ Latent Space Characterization of Autoencoder Variants + + +
+ Understanding the latent spaces learned by deep learning models is crucial in +exploring how they represent and generate complex data. Autoencoders (AEs) have +played a key role in the area of representation learning, with numerous +regularization techniques and training principles developed not only to enhance +their ability to learn compact and robust representations, but also to reveal +how different architectures influence the structure and smoothness of the +lower-dimensional non-linear manifold. We strive to characterize the structure +of the latent spaces learned by different autoencoders including convolutional +autoencoders (CAEs), denoising autoencoders (DAEs), and variational +autoencoders (VAEs) and how they change with the perturbations in the input. By +characterizing the matrix manifolds corresponding to the latent spaces, we +provide an explanation for the well-known observation that the latent spaces of +CAE and DAE form non-smooth manifolds, while that of VAE forms a smooth +manifold. We also map the points of the matrix manifold to a Hilbert space +using distance preserving transforms and provide an alternate view in terms of +the subspaces generated in the Hilbert space as a function of the distortion in +the input. The results show that the latent manifolds of CAE and DAE are +stratified with each stratum being a smooth product manifold, while the +manifold of VAE is a smooth product manifold of two symmetric positive definite +matrices and a symmetric positive semi-definite matrix. + +
+
+ comment: 8 pages, 6 figures, and 1 table +
+
+
+
+
+ + ☆ Machine learning algorithms to predict the risk of rupture of + intracranial aneurysms: a systematic review + + +
+ Purpose: Subarachnoid haemorrhage is a potentially fatal consequence of +intracranial aneurysm rupture, however, it is difficult to predict if aneurysms +will rupture. Prophylactic treatment of an intracranial aneurysm also involves +risk, hence identifying rupture-prone aneurysms is of substantial clinical +importance. This systematic review aims to evaluate the performance of machine +learning algorithms for predicting intracranial aneurysm rupture risk. + Methods: MEDLINE, Embase, Cochrane Library and Web of Science were searched +until December 2023. Studies incorporating any machine learning algorithm to +predict the risk of rupture of an intracranial aneurysm were included. Risk of +bias was assessed using the Prediction Model Risk of Bias Assessment Tool +(PROBAST). PROSPERO registration: CRD42023452509. Results: Out of 10,307 +records screened, 20 studies met the eligibility criteria for this review +incorporating a total of 20,286 aneurysm cases. The machine learning models +gave a 0.66-0.90 range for performance accuracy. The models were compared to +current clinical standards in six studies and gave mixed results. Most studies +posed high or unclear risks of bias and concerns for applicability, limiting +the inferences that can be drawn from them. There was insufficient homogenous +data for a meta-analysis. + Conclusions: Machine learning can be applied to predict the risk of rupture +for intracranial aneurysms. However, the evidence does not comprehensively +demonstrate superiority to existing practice, limiting its role as a clinical +adjunct. Further prospective multicentre studies of recent machine learning +tools are needed to prove clinical validation before they are implemented in +the clinic. + +
+
+ comment: Clin Neuroradiol (2024) +
+
+
+
+
+ + ☆ Decomposed Distribution Matching in Dataset Condensation + + +
+ Dataset Condensation (DC) aims to reduce deep neural networks training +efforts by synthesizing a small dataset such that it will be as effective as +the original large dataset. Conventionally, DC relies on a costly bi-level +optimization which prohibits its practicality. Recent research formulates DC as +a distribution matching problem which circumvents the costly bi-level +optimization. However, this efficiency sacrifices the DC performance. To +investigate this performance degradation, we decomposed the dataset +distribution into content and style. Our observations indicate two major +shortcomings of: 1) style discrepancy between original and condensed data, and +2) limited intra-class diversity of condensed dataset. We present a simple yet +effective method to match the style information between original and condensed +data, employing statistical moments of feature maps as well-established style +indicators. Moreover, we enhance the intra-class diversity by maximizing the +Kullback-Leibler divergence within each synthetic class, i.e., content. We +demonstrate the efficacy of our method through experiments on diverse datasets +of varying size and resolution, achieving improvements of up to 4.1% on +CIFAR10, 4.2% on CIFAR100, 4.3% on TinyImageNet, 2.0% on ImageNet-1K, 3.3% on +ImageWoof, 2.5% on ImageNette, and 5.5% in continual learning accuracy. + +
+
+
+
+
+ + ☆ Fair Diagnosis: Leveraging Causal Modeling to Mitigate Medical Bias + + +
+ In medical image analysis, model predictions can be affected by sensitive +attributes, such as race and gender, leading to fairness concerns and potential +biases in diagnostic outcomes. To mitigate this, we present a causal modeling +framework, which aims to reduce the impact of sensitive attributes on +diagnostic predictions. Our approach introduces a novel fairness criterion, +\textbf{Diagnosis Fairness}, and a unique fairness metric, leveraging +path-specific fairness to control the influence of demographic attributes, +ensuring that predictions are primarily informed by clinically relevant +features rather than sensitive attributes. By incorporating adversarial +perturbation masks, our framework directs the model to focus on critical image +regions, suppressing bias-inducing information. Experimental results across +multiple datasets demonstrate that our framework effectively reduces bias +directly associated with sensitive attributes while preserving diagnostic +accuracy. Our findings suggest that causal modeling can enhance both fairness +and interpretability in AI-powered clinical decision support systems. + +
+
+
+
+
+ + ☆ Espresso: High Compression For Rich Extraction From Videos for Your + Vision-Language Model + + +
+ Most of the current vision-language models (VLMs) for videos struggle to +understand videos longer than a few seconds. This is primarily due to the fact +that they do not scale to utilizing a large number of frames. In order to +address this limitation, we propose Espresso, a novel method that extracts and +compresses spatial and temporal information separately. Through extensive +evaluations, we show that spatial and temporal compression in Espresso each +have a positive impact on the long-form video understanding capabilities; when +combined, their positive impact increases. Furthermore, we show that Espresso's +performance scales well with more training data, and that Espresso is far more +effective than the existing projectors for VLMs in long-form video +understanding. Moreover, we devise a more difficult evaluation setting for +EgoSchema called "needle-in-a-haystack" that multiplies the lengths of the +input videos. Espresso achieves SOTA performance on this task, outperforming +the SOTA VLMs that have been trained on much more training data. + +
+
+ comment: 11 pages +
+
+
+
+
+ + ☆ Learning to Translate Noise for Robust Image Denoising + + +
+ Deep learning-based image denoising techniques often struggle with poor +generalization performance to out-of-distribution real-world noise. To tackle +this challenge, we propose a novel noise translation framework that performs +denoising on an image with translated noise rather than directly denoising an +original noisy image. Specifically, our approach translates complex, unknown +real-world noise into Gaussian noise, which is spatially uncorrelated and +independent of image content, through a noise translation network. The +translated noisy images are then processed by an image denoising network +pretrained to effectively remove Gaussian noise, enabling robust and consistent +denoising performance. We also design well-motivated loss functions and +architectures for the noise translation network by leveraging the mathematical +properties of Gaussian noise. Experimental results demonstrate that the +proposed method substantially improves robustness and generalizability, +outperforming state-of-the-art methods across diverse benchmarks. Visualized +denoising results and the source code are available on our project page. + +
+
+ comment: The project page is available at + https://hij1112.github.io/learning-to-translate-noise/ +
+
+
+
+
+ + ♻ ☆ A Practitioner's Guide to Continual Multimodal Pretraining NeurIPS + 2024 + + +
+ Multimodal foundation models serve numerous applications at the intersection +of vision and language. Still, despite being pretrained on extensive data, they +become outdated over time. To keep models updated, research into continual +pretraining mainly explores scenarios with either (1) infrequent, +indiscriminate updates on large-scale new data, or (2) frequent, sample-level +updates. However, practical model deployment often operates in the gap between +these two limit cases, as real-world applications often demand adaptation to +specific subdomains, tasks or concepts -- spread over the entire, varying life +cycle of a model. In this work, we complement current perspectives on continual +pretraining through a research test bed as well as provide comprehensive +guidance for effective continual model updates in such scenarios. We first +introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with +realistic compute constraints and practical deployment requirements, +constructed over 63 datasets with diverse visual and semantic coverage. Using +FoMo-in-Flux, we explore the complex landscape of practical continual +pretraining through multiple perspectives: (1) A data-centric investigation of +data mixtures and stream orderings that emulate real-world deployment +situations, (2) a method-centric investigation ranging from simple fine-tuning +and traditional continual learning strategies to parameter-efficient updates +and model merging, (3) meta learning rate schedules and mechanistic design +choices, and (4) the influence of model and compute scaling. Together, our +insights provide a practitioner's guide to continual multimodal pretraining for +real-world deployment. Our benchmark and code is here: +https://github.com/ExplainableML/fomo_in_flux. + +
+
+ comment: Technical Report. 52 pages. Shorter version published at the NeurIPS + 2024 Dataset & Benchmarks track +
+
+
+
+
+ + ♻ ☆ FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models + + +
+ The advent of foundation models (FMs) in healthcare offers unprecedented +opportunities to enhance medical diagnostics through automated classification +and segmentation tasks. However, these models also raise significant concerns +about their fairness, especially when applied to diverse and underrepresented +populations in healthcare applications. Currently, there is a lack of +comprehensive benchmarks, standardized pipelines, and easily adaptable +libraries to evaluate and understand the fairness performance of FMs in medical +imaging, leading to considerable challenges in formulating and implementing +solutions that ensure equitable outcomes across diverse patient populations. To +fill this gap, we introduce FairMedFM, a fairness benchmark for FM research in +medical imaging.FairMedFM integrates with 17 popular medical imaging datasets, +encompassing different modalities, dimensionalities, and sensitive attributes. +It explores 20 widely used FMs, with various usages such as zero-shot learning, +linear probing, parameter-efficient fine-tuning, and prompting in various +downstream tasks -- classification and segmentation. Our exhaustive analysis +evaluates the fairness performance over different evaluation metrics from +multiple perspectives, revealing the existence of bias, varied utility-fairness +trade-offs on different FMs, consistent disparities on the same datasets +regardless FMs, and limited effectiveness of existing unfairness mitigation +methods. Checkout FairMedFM's project page and open-sourced codebase, which +supports extendible functionalities and applications as well as inclusive for +studies on FMs in medical imaging over the long term. + +
+
+ comment: 29 pages, 17 figures +
+
+
+
+
+ + ♻ ☆ Aesthetic Post-Training Diffusion Models from Generic Preferences with + Step-by-step Preference Optimization + + +
+ Generating visually appealing images is fundamental to modern text-to-image +generation models. A potential solution to better aesthetics is direct +preference optimization (DPO), which has been applied to diffusion models to +improve general image quality including prompt alignment and aesthetics. +Popular DPO methods propagate preference labels from clean image pairs to all +the intermediate steps along the two generation trajectories. However, +preference labels provided in existing datasets are blended with layout and +aesthetic opinions, which would disagree with aesthetic preference. Even if +aesthetic labels were provided (at substantial cost), it would be hard for the +two-trajectory methods to capture nuanced visual differences at different +steps. To improve aesthetics economically, this paper uses existing generic +preference data and introduces step-by-step preference optimization (SPO) that +discards the propagation strategy and allows fine-grained image details to be +assessed. Specifically, at each denoising step, we 1) sample a pool of +candidates by denoising from a shared noise latent, 2) use a step-aware +preference model to find a suitable win-lose pair to supervise the diffusion +model, and 3) randomly select one from the pool to initialize the next +denoising step. This strategy ensures that the diffusion models to focus on the +subtle, fine-grained visual differences instead of layout aspect. We find that +aesthetic can be significantly enhanced by accumulating these improved minor +differences. When fine-tuning Stable Diffusion v1.5 and SDXL, SPO yields +significant improvements in aesthetics compared with existing DPO methods while +not sacrificing image-text alignment compared with vanilla models. Moreover, +SPO converges much faster than DPO methods due to the step-by-step alignment of +fine-grained visual details. Code and models are available at +https://github.com/RockeyCoss/SPO. + +
+
+
+
+
+ + ♻ ☆ Understanding Multi-Granularity for Open-Vocabulary Part Segmentation NeurIPS 2024 + + +
+ Open-vocabulary part segmentation (OVPS) is an emerging research area focused +on segmenting fine-grained entities using diverse and previously unseen +vocabularies. Our study highlights the inherent complexities of part +segmentation due to intricate boundaries and diverse granularity, reflecting +the knowledge-based nature of part identification. To address these challenges, +we propose PartCLIPSeg, a novel framework utilizing generalized parts and +object-level contexts to mitigate the lack of generalization in fine-grained +parts. PartCLIPSeg integrates competitive part relationships and attention +control, alleviating ambiguous boundaries and underrepresented parts. +Experimental results demonstrate that PartCLIPSeg outperforms existing +state-of-the-art OVPS methods, offering refined segmentation and an advanced +understanding of part relationships within images. Through extensive +experiments, our model demonstrated a significant improvement over the +state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and +PartImageNet datasets. + +
+
+ comment: NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Beyond Pixels: Text Enhances Generalization in Real-World Image + Restoration + + +
+ Generalization has long been a central challenge in real-world image +restoration. While recent diffusion-based restoration methods, which leverage +generative priors from text-to-image models, have made progress in recovering +more realistic details, they still encounter "generative capability +deactivation" when applied to out-of-distribution real-world data. To address +this, we propose using text as an auxiliary invariant representation to +reactivate the generative capabilities of these models. We begin by identifying +two key properties of text input: richness and relevance, and examine their +respective influence on model performance. Building on these insights, we +introduce Res-Captioner, a module that generates enhanced textual descriptions +tailored to image content and degradation levels, effectively mitigating +response failures. Additionally, we present RealIR, a new benchmark designed to +capture diverse real-world scenarios. Extensive experiments demonstrate that +Res-Captioner significantly enhances the generalization abilities of +diffusion-based restoration models, while remaining fully plug-and-play. + +
+
+
+
+
+ + ♻ ☆ HunyuanVideo: A Systematic Framework For Large Video Generative Models + + +
+ Recent advancements in video generation have significantly impacted daily +life for both individuals and industries. However, the leading video generation +models remain closed-source, resulting in a notable performance gap between +industry capabilities and those available to the public. In this report, we +introduce HunyuanVideo, an innovative open-source video foundation model that +demonstrates performance in video generation comparable to, or even surpassing, +that of leading closed-source models. HunyuanVideo encompasses a comprehensive +framework that integrates several key elements, including data curation, +advanced architectural design, progressive model scaling and training, and an +efficient infrastructure tailored for large-scale model training and inference. +As a result, we successfully trained a video generative model with over 13 +billion parameters, making it the largest among all open-source models. We +conducted extensive experiments and implemented a series of targeted designs to +ensure high visual quality, motion dynamics, text-video alignment, and advanced +filming techniques. According to evaluations by professionals, HunyuanVideo +outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, +and three top-performing Chinese video generative models. By releasing the code +for the foundation model and its applications, we aim to bridge the gap between +closed-source and open-source communities. This initiative will empower +individuals within the community to experiment with their ideas, fostering a +more dynamic and vibrant video generation ecosystem. The code is publicly +available at https://github.com/Tencent/HunyuanVideo. + +
+
+
+
+
+ + ♻ ☆ Comparing ImageNet Pre-training with Digital Pathology Foundation Models + for Whole Slide Image-Based Survival Analysis + + +
+ The abundance of information present in Whole Slide Images (WSIs) renders +them an essential tool for survival analysis. Several Multiple Instance +Learning frameworks proposed for this task utilize a ResNet50 backbone +pre-trained on natural images. By leveraging recenetly released +histopathological foundation models such as UNI and Hibou, the predictive +prowess of existing MIL networks can be enhanced. Furthermore, deploying an +ensemble of digital pathology foundation models yields higher baseline +accuracy, although the benefits appear to diminish with more complex MIL +architectures. Our code will be made publicly available upon acceptance. + +
+
+
+
+
+ + ♻ ☆ Fine-Tuning CLIP's Last Visual Projector: A Few-Shot Cornucopia + + +
+ We consider the problem of adapting a contrastively pretrained +vision-language model like CLIP (Radford et al., 2021) for few-shot +classification. The literature addresses this problem by learning a linear +classifier of the frozen visual features, optimizing word embeddings, or +learning external feature adapters. This paper introduces an alternative way +for CLIP adaptation without adding 'external' parameters to optimize. We find +that simply fine-tuning the last projection matrix of the vision encoder leads +to performance better than all baselines. Furthermore, we show that +regularizing training with the distance between the fine-tuned and pretrained +matrices adds reliability for adapting CLIP. This simple approach, coined +ProLIP, yields state-of-the-art performance on 11 few-shot classification +benchmarks, few-shot domain generalization, cross-dataset transfer, base-to-new +class generalization, and test-time adaptation. Code will be made available at: +https://github.com/astra-vision/ProLIP . + +
+
+
+
+
+ + ♻ ☆ GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D + Occupancy Prediction + + +
+ 3D semantic occupancy prediction is an important task for robust +vision-centric autonomous driving, which predicts fine-grained geometry and +semantics of the surrounding scene. Most existing methods leverage dense +grid-based scene representations, overlooking the spatial sparsity of the +driving scenes. Although 3D semantic Gaussian serves as an object-centric +sparse alternative, most of the Gaussians still describe the empty region with +low efficiency. To address this, we propose a probabilistic Gaussian +superposition model which interprets each Gaussian as a probability +distribution of its neighborhood being occupied and conforms to probabilistic +multiplication to derive the overall geometry. Furthermore, we adopt the exact +Gaussian mixture model for semantics calculation to avoid unnecessary +overlapping of Gaussians. To effectively initialize Gaussians in non-empty +region, we design a distribution-based initialization module which learns the +pixel-aligned occupancy distribution instead of the depth of surfaces. We +conduct extensive experiments on nuScenes and KITTI-360 datasets and our +GaussianFormer-2 achieves state-of-the-art performance with high efficiency. +Code: https://github.com/huang-yh/GaussianFormer. + +
+
+ comment: Code is available at: https://github.com/huang-yh/GaussianFormer +
+
+
+
+
+ + ♻ ☆ EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online + Scene Understanding + + +
+ 3D occupancy prediction provides a comprehensive description of the +surrounding scenes and has become an essential task for 3D perception. Most +existing methods focus on offline perception from one or a few views and cannot +be applied to embodied agents which demands to gradually perceive the scene +through progressive embodied exploration. In this paper, we formulate an +embodied 3D occupancy prediction task to target this practical scenario and +propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize +the global scene with uniform 3D semantic Gaussians and progressively update +local regions observed by the embodied agent. For each update, we extract +semantic and structural features from the observed image and efficiently +incorporate them via deformable cross-attention to refine the regional +Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global +3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown +(i.e., uniformly distributed) environment and maintains an explicit global +memory of it with 3D Gaussians. It gradually gains knowledge through the local +refinement of regional Gaussians, which is consistent with how humans +understand new scenes through embodied exploration. We reorganize an +EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the +evaluation of the embodied 3D occupancy prediction task. Experiments +demonstrate that our EmbodiedOcc outperforms existing local prediction methods +and accomplishes the embodied occupancy prediction with high accuracy and +strong expandability. Code: https://github.com/YkiWu/EmbodiedOcc. + +
+
+ comment: Code: https://github.com/YkiWu/EmbodiedOcc +
+
+
+
+
+ + ♻ ☆ LUDVIG: Learning-free Uplifting of 2D Visual features to Gaussian + Splatting scenes + + +
+ We address the problem of extending the capabilities of vision foundation +models such as DINO, SAM, and CLIP, to 3D tasks. Specifically, we introduce a +novel method to uplift 2D image features into 3D Gaussian Splatting scenes. +Unlike traditional approaches that rely on minimizing a reconstruction loss, +our method employs a simpler and more efficient feature aggregation technique, +augmented by a graph diffusion mechanism. Graph diffusion enriches features +from a given model, such as CLIP, by leveraging 3D geometry and pairwise +similarities induced by another strong model such as DINOv2. Our approach +achieves performance comparable to the state of the art on multiple downstream +tasks while delivering significant speed-ups. Notably, we obtain competitive +segmentation results using generic DINOv2 features, despite DINOv2 not being +trained on millions of annotated segmentation masks like SAM. When applied to +CLIP features, our method demonstrates strong performance in open-vocabulary +object detection tasks, highlighting the versatility of our approach. + +
+
+
+
+
+ + ♻ ☆ Probabilistic Language-Image Pre-Training + + +
+ Vision-language models (VLMs) embed aligned image-text pairs into a joint +space but often rely on deterministic embeddings, assuming a one-to-one +correspondence between images and texts. This oversimplifies real-world +relationships, which are inherently many-to-many, with multiple captions +describing a single image and vice versa. We introduce Probabilistic +Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained +on a billion-scale image-text dataset using only probabilistic objectives, +achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot +accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an +"uncertainty token" without extra parameters. We also introduce a novel +inclusion loss that enforces distributional inclusion relationships between +image-text pairs and between original and masked inputs. Experiments +demonstrate that, by leveraging uncertainty estimates, ProLIP benefits +downstream tasks and aligns with intuitive notions of uncertainty, e.g., +shorter texts being more uncertain and more general inputs including specific +ones. Utilizing text uncertainties, we further improve ImageNet accuracy from +74.6% to 75.8% (under a few-shot setting), supporting the practical advantages +of our probabilistic approach. The code is available at +https://github.com/naver-ai/prolip + +
+
+ comment: Code: https://github.com/naver-ai/prolip HuggingFace Hub: + https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291 + 31 pages, 4.29 MB +
+
+
+
+
+ + ♻ ☆ Scaling Efficient Masked Image Modeling on Large Remote Sensing Dataset + + +
+ Masked Image Modeling (MIM) has become an essential method for building +foundational visual models in remote sensing (RS). However, the limitations in +size and diversity of existing RS datasets restrict the ability of MIM methods +to learn generalizable representations. Additionally, conventional MIM +techniques, which require reconstructing all tokens, introduce unnecessary +computational overhead. To address these issues, we present a new pre-training +pipeline for RS models, featuring the creation of a large-scale RS dataset and +an efficient MIM approach. We curated a high-quality dataset named +OpticalRS-13M by collecting publicly available RS datasets and processing them +through exclusion, slicing, and deduplication. OpticalRS-13M comprises 13 +million optical images covering various RS tasks, such as object detection and +pixel segmentation. To enhance efficiency, we propose SelectiveMAE, a +pre-training method that dynamically encodes and reconstructs semantically rich +patch tokens, thereby reducing the inefficiencies of traditional MIM models +caused by redundant background pixels in RS images. Extensive experiments +demonstrate that OpticalRS-13M significantly improves classification, +detection, and segmentation performance, while SelectiveMAE increases training +efficiency over 2 times. This highlights the effectiveness and scalability of +our pipeline in developing RS foundational models. + +
+
+
+
+
+ + ♻ ☆ MC-NeRF: Multi-Camera Neural Radiance Fields for Multi-Camera Image + Acquisition Systems + + +
+ Neural Radiance Fields (NeRF) use multi-view images for 3D scene +representation, demonstrating remarkable performance. As one of the primary +sources of multi-view images, multi-camera systems encounter challenges such as +varying intrinsic parameters and frequent pose changes. Most previous +NeRF-based methods assume a unique camera and rarely consider multi-camera +scenarios. Besides, some NeRF methods that can optimize intrinsic and extrinsic +parameters still remain susceptible to suboptimal solutions when these +parameters are poor initialized. In this paper, we propose MC-NeRF, a method +that enables joint optimization of both intrinsic and extrinsic parameters +alongside NeRF. The method also supports each image corresponding to +independent camera parameters. First, we tackle coupling issue and the +degenerate case that arise from the joint optimization between intrinsic and +extrinsic parameters. Second, based on the proposed solutions, we introduce an +efficient calibration image acquisition scheme for multi-camera systems, +including the design of calibration object. Finally, we present an end-to-end +network with training sequence that enables the estimation of intrinsic and +extrinsic parameters, along with the rendering network. Furthermore, +recognizing that most existing datasets are designed for a unique camera, we +construct a real multi-camera image acquisition system and create a +corresponding new dataset, which includes both simulated data and real-world +captured images. Experiments confirm the effectiveness of our method when each +image corresponds to different camera parameters. Specifically, we use +multi-cameras, each with different intrinsic and extrinsic parameters in +real-world system, to achieve 3D scene representation without providing initial +poses. + +
+
+ comment: This manuscript is currently under review +
+
+
+
+
+ + ♻ ☆ Leveraging Bi-Focal Perspectives and Granular Feature Integration for + Accurate Reliable Early Alzheimer's Detection + + +
+ Alzheimer's disease (AD) is the most common neurodegeneration, annually +diagnosed in millions of patients. The present medicine scenario still finds +challenges in the exact diagnosis and classification of AD through neuroimaging +data. Traditional CNNs can extract a good amount of low-level information in an +image but fail to extract high-level minuscule particles, which is a +significant challenge in detecting AD from MRI scans. To overcome this, we +propose a novel Granular Feature Integration method to combine information +extraction at different scales combined with an efficient information flow, +enabling the model to capture both broad and fine-grained features +simultaneously. We also propose a Bi-Focal Perspective mechanism to highlight +the subtle neurofibrillary tangles and amyloid plaques in the MRI scans, +ensuring that critical pathological markers are accurately identified. Our +model achieved an F1-Score of 99.31%, precision of 99.24%, and recall of +99.51%. These scores prove that our model is significantly better than the +state-of-the-art (SOTA) CNNs in existence. + +
+
+ comment: 14 pages, 12 figures, 6 tables +
+
+
+
+
+ + ♻ ☆ OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary + Understanding NeurIPS2024 + + +
+ This paper introduces OpenGaussian, a method based on 3D Gaussian Splatting +(3DGS) capable of 3D point-level open vocabulary understanding. Our primary +motivation stems from observing that existing 3DGS-based open vocabulary +methods mainly focus on 2D pixel-level parsing. These methods struggle with 3D +point-level tasks due to weak feature expressiveness and inaccurate 2D-3D +feature associations. To ensure robust feature presentation and 3D point-level +understanding, we first employ SAM masks without cross-frame associations to +train instance features with 3D consistency. These features exhibit both +intra-object consistency and inter-object distinction. Then, we propose a +two-stage codebook to discretize these features from coarse to fine levels. At +the coarse level, we consider the positional information of 3D points to +achieve location-based clustering, which is then refined at the fine level. +Finally, we introduce an instance-level 3D-2D feature association method that +links 3D points to 2D masks, which are further associated with 2D CLIP +features. Extensive experiments, including open vocabulary-based 3D object +selection, 3D point cloud understanding, click-based 3D object selection, and +ablation studies, demonstrate the effectiveness of our proposed method. The +source code is available at our project page: +https://3d-aigc.github.io/OpenGaussian + +
+
+ comment: NeurIPS2024 +
+
+
+
+
+ + ♻ ☆ MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal + Large Language Models + + +
+ Despite the superior capabilities of Multimodal Large Language Models (MLLMs) +across diverse tasks, they still face significant trustworthiness challenges. +Yet, current literature on the assessment of trustworthy MLLMs remains limited, +lacking a holistic evaluation to offer thorough insights into future +improvements. In this work, we establish MultiTrust, the first comprehensive +and unified benchmark on the trustworthiness of MLLMs across five primary +aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark +employs a rigorous evaluation strategy that addresses both multimodal risks and +cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. +Extensive experiments with 21 modern MLLMs reveal some previously unexplored +trustworthiness issues and risks, highlighting the complexities introduced by +the multimodality and underscoring the necessity for advanced methodologies to +enhance their reliability. For instance, typical proprietary models still +struggle with the perception of visually confusing images and are vulnerable to +multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to +disclose privacy in text and reveal ideological and cultural biases even when +paired with irrelevant images in inference, indicating that the multimodality +amplifies the internal risks from base LLMs. Additionally, we release a +scalable toolbox for standardized trustworthiness research, aiming to +facilitate future advancements in this important field. Code and resources are +publicly available at: https://multi-trust.github.io/. + +
+
+ comment: 100 pages, 84 figures, 33 tables +
+
+
+
+
+ + ♻ ☆ LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing + Layer Execution Order + + +
+ Due to their architecture and how they are trained, artificial neural +networks are typically not robust toward pruning or shuffling layers at test +time. However, such properties would be desirable for different applications, +such as distributed neural network architectures where the order of execution +cannot be guaranteed or parts of the network can fail during inference. In this +work, we address these issues through a number of training approaches for +vision transformers whose most important component is randomizing the execution +order of attention modules at training time. With our proposed approaches, +vision transformers are capable to adapt to arbitrary layer execution orders at +test time assuming one tolerates a reduction (about 20\%) in accuracy at the +same model size. We analyse the feature representations of our trained models +as well as how each layer contributes to the models prediction based on its +position during inference. Our analysis shows that layers learn to contribute +differently based on their position in the network. Finally, we layer-prune our +models at test time and find that their performance declines gracefully. Code +available at https://github.com/matfrei/layershuffle. + +
+
+
+
+
+ + ♻ ☆ Cross-modal semantic segmentation for indoor environmental perception + using single-chip millimeter-wave radar raw data + + +
+ In the context of firefighting and rescue operations, a cross-modal semantic +segmentation model based on a single-chip millimeter-wave (mmWave) radar for +indoor environmental perception is proposed and discussed. To efficiently +obtain high-quality labels, an automatic label generation method utilizing +LiDAR point clouds and occupancy grid maps is introduced. The proposed +segmentation model is based on U-Net. A spatial attention module is +incorporated, which enhanced the performance of the mode. The results +demonstrate that cross-modal semantic segmentation provides a more intuitive +and accurate representation of indoor environments. Unlike traditional methods, +the model's segmentation performance is minimally affected by azimuth. Although +performance declines with increasing distance, this can be mitigated by a +well-designed model. Additionally, it was found that using raw ADC data as +input is ineffective; compared to RA tensors, RD tensors are more suitable for +the proposed model. + +
+
+ comment: 5291 words, 17 pages, 11 figures +
+
+
+
+
+ + ♻ ☆ NVComposer: Boosting Generative Novel View Synthesis with Multiple + Sparse and Unposed Images + + +
+ Recent advancements in generative models have significantly improved novel +view synthesis (NVS) from multi-view data. However, existing methods depend on +external multi-view alignment processes, such as explicit pose estimation or +pre-reconstruction, which limits their flexibility and accessibility, +especially when alignment is unstable due to insufficient overlap or occlusions +between views. In this paper, we propose NVComposer, a novel approach that +eliminates the need for explicit external alignment. NVComposer enables the +generative model to implicitly infer spatial and geometric relationships +between multiple conditional views by introducing two key components: 1) an +image-pose dual-stream diffusion model that simultaneously generates target +novel views and condition camera poses, and 2) a geometry-aware feature +alignment module that distills geometric priors from dense stereo models during +training. Extensive experiments demonstrate that NVComposer achieves +state-of-the-art performance in generative multi-view NVS tasks, removing the +reliance on external alignment and thus improving model accessibility. Our +approach shows substantial improvements in synthesis quality as the number of +unposed input views increases, highlighting its potential for more flexible and +accessible generative NVS systems. Our project page is available at +https://lg-li.github.io/project/nvcomposer + +
+
+ comment: Project Page: https://lg-li.github.io/project/nvcomposer +
+
+
+
+
+ + ♻ ☆ Bridging Text and Image for Artist Style Transfer via Contrastive + Learning + + +
+ Image style transfer has attracted widespread attention in the past few +years. Despite its remarkable results, it requires additional style images +available as references, making it less flexible and inconvenient. Using text +is the most natural way to describe the style. More importantly, text can +describe implicit abstract styles, like styles of specific artists or art +movements. In this paper, we propose a Contrastive Learning for Artistic Style +Transfer (CLAST) that leverages advanced image-text encoders to control +arbitrary style transfer. We introduce a supervised contrastive training +strategy to effectively extract style descriptions from the image-text model +(i.e., CLIP), which aligns stylization with the text description. To this end, +we also propose a novel and efficient adaLN based state space models that +explore style-content fusion. Finally, we achieve a text-driven image style +transfer. Extensive experiments demonstrate that our approach outperforms the +state-of-the-art methods in artistic style transfer. More importantly, it does +not require online fine-tuning and can render a 512x512 image in 0.03s. + +
+
+ comment: 18 pages, 8 figures. arXiv admin note: substantial text overlap with + arXiv:2202.13562 +
+
+
+
+
+ + ♻ ☆ StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing + + +
+ A significant research effort is focused on exploiting the amazing capacities +of pretrained diffusion models for the editing of images.They either finetune +the model, or invert the image in the latent space of the pretrained model. +However, they suffer from two problems: (1) Unsatisfying results for selected +regions and unexpected changes in non-selected regions.(2) They require careful +text prompt editing where the prompt should include all visual objects in the +input image.To address this, we propose two improvements: (1) Only optimizing +the input of the value linear network in the cross-attention layers is +sufficiently powerful to reconstruct a real image. (2) We propose attention +regularization to preserve the object-like attention maps after reconstruction +and editing, enabling us to obtain accurate style editing without invoking +significant structural changes. We further improve the editing technique that +is used for the unconditional branch of classifier-free guidance as used by +P2P. Extensive experimental prompt-editing results on a variety of images +demonstrate qualitatively and quantitatively that our method has superior +editing capabilities compared to existing and concurrent works. See our +accompanying code in Stylediffusion: +\url{https://github.com/sen-mao/StyleDiffusion}. + +
+
+ comment: Accepted by Computational Visual Meda +
+
+
+
+
+ + ♻ ☆ Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ + Documents + + +
+ Large multimodal models (LMMs) have achieved impressive progress in +vision-language understanding, yet they face limitations in real-world +applications requiring complex reasoning over a large number of images. +Existing benchmarks for multi-image question-answering are limited in scope, +each question is paired with only up to 30 images, which does not fully capture +the demands of large-scale retrieval tasks encountered in the real-world +usages. To reduce these gaps, we introduce two document haystack benchmarks, +dubbed DocHaystack and InfoHaystack, designed to evaluate LMM performance on +large-scale visual document retrieval and understanding. Additionally, we +propose V-RAG, a novel, vision-centric retrieval-augmented generation (RAG) +framework that leverages a suite of multimodal vision encoders, each optimized +for specific strengths, and a dedicated question-document relevance module. +V-RAG sets a new standard, with a 9% and 11% improvement in Recall@1 on the +challenging DocHaystack-1000 and InfoHaystack-1000 benchmarks, respectively, +compared to the previous best baseline models. Additionally, integrating V-RAG +with LMMs enables them to efficiently operate across thousands of images, +yielding significant improvements on our DocHaystack and InfoHaystack +benchmarks. Our code and datasets are available at +https://github.com/Vision-CAIR/dochaystacks + +
+
+ comment: the correct arxiv version +
+
+
+
+
+ + ♻ ☆ GameGen-X: Interactive Open-world Game Video Generation + + +
+ We introduce GameGen-X, the first diffusion transformer model specifically +designed for both generating and interactively controlling open-world game +videos. This model facilitates high-quality, open-domain generation by +simulating an extensive array of game engine features, such as innovative +characters, dynamic environments, complex actions, and diverse events. +Additionally, it provides interactive controllability, predicting and altering +future content based on the current clip, thus allowing for gameplay +simulation. To realize this vision, we first collected and built an Open-World +Video Game Dataset from scratch. It is the first and largest dataset for +open-world game video generation and control, which comprises over a million +diverse gameplay video clips sampling from over 150 games with informative +captions from GPT-4o. GameGen-X undergoes a two-stage training process, +consisting of foundation model pre-training and instruction tuning. Firstly, +the model was pre-trained via text-to-video generation and video continuation, +endowing it with the capability for long-sequence, high-quality open-domain +game video generation. Further, to achieve interactive controllability, we +designed InstructNet to incorporate game-related multi-modal control signal +experts. This allows the model to adjust latent representations based on user +inputs, unifying character interaction and scene content control for the first +time in video generation. During instruction tuning, only the InstructNet is +updated while the pre-trained foundation model is frozen, enabling the +integration of interactive controllability without loss of diversity and +quality of generated video content. + +
+
+ comment: Homepage: https://gamegen-x.github.io/ Github: + https://github.com/GameGen-X/GameGen-X +
+
+
+
+
+ + ♻ ☆ Quantum-Hybrid Stereo Matching With Nonlinear Regularization and Spatial + Pyramids 3DV + + +
+ Quantum visual computing is advancing rapidly. This paper presents a new +formulation for stereo matching with nonlinear regularizers and spatial +pyramids on quantum annealers as a maximum a posteriori inference problem that +minimizes the energy of a Markov Random Field. Our approach is hybrid (i.e., +quantum-classical) and is compatible with modern D-Wave quantum annealers, +i.e., it includes a quadratic unconstrained binary optimization (QUBO) +objective. Previous quantum annealing techniques for stereo matching are +limited to using linear regularizers, and thus, they do not exploit the +fundamental advantages of the quantum computing paradigm in solving +combinatorial optimization problems. In contrast, our method utilizes the full +potential of quantum annealing for stereo matching, as nonlinear regularizers +create optimization problems which are NP-hard. On the Middlebury benchmark, we +achieve an improved root mean squared accuracy over the previous state of the +art in quantum stereo matching of 2% and 22.5% when using different solvers. + +
+
+ comment: 26 pages, 15 figures. To be published in the International Conference + on 3D Vision (3DV) 2024 +
+
+
+
+
+ + ♻ ☆ Enhancing Dynamic CT Image Reconstruction with Neural Fields and Optical + Flow + + +
+ In this paper, we investigate image reconstruction for dynamic Computed +Tomography. The motion of the target with respect to the measurement +acquisition rate leads to highly resolved in time but highly undersampled in +space measurements. Such problems pose a major challenge: not accounting for +the dynamics of the process leads to a poor reconstruction with non-realistic +motion. Variational approaches that penalize time evolution have been proposed +to relate subsequent frames and improve image quality based on classical +grid-based discretizations. Neural fields have emerged as a novel way to +parameterize the quantity of interest using a neural network with a +low-dimensional input, benefiting from being lightweight, continuous, and +biased towards smooth representations. The latter property has been exploited +when solving dynamic inverse problems with neural fields by minimizing a +data-fidelity term only. We investigate and show the benefits of introducing +explicit motion regularizers for dynamic inverse problems based on partial +differential equations, namely, the optical flow equation, for the optimization +of neural fields. We compare it against its unregularized counterpart and show +the improvements in the reconstruction. We also compare neural fields against a +grid-based solver and show that the former outperforms the latter in terms of +PSNR in this task. + +
+
+
+
+
+ + ♻ ☆ Retina-Inspired Object Motion Segmentation for Event-Cameras + + +
+ Event-cameras have emerged as a revolutionary technology with a high temporal +resolution that far surpasses standard active pixel cameras. This technology +draws biological inspiration from photoreceptors and the initial retinal +synapse. This research showcases the potential of additional retinal +functionalities to extract visual features. We provide a domain-agnostic and +efficient algorithm for ego-motion compensation based on Object Motion +Sensitivity (OMS), one of the multiple features computed within the mammalian +retina. We develop a method based on experimental neuroscience that translates +OMS' biological circuitry to a low-overhead algorithm to suppress camera motion +bypassing the need for deep networks and learning. Our system processes event +data from dynamic scenes to perform pixel-wise object motion segmentation using +a real and synthetic dataset. This paper introduces a bio-inspired computer +vision method that dramatically reduces the number of parameters by +$\text{10}^\text{3}$ to $\text{10}^\text{6}$ orders of magnitude compared to +previous approaches. Our work paves the way for robust, high-speed, and +low-bandwidth decision-making for in-sensor computations. + +
+
+
+
+
+ + ♻ ☆ $\textit{X}^2$-DFD: A framework for e${X}$plainable and e${X}$tendable + Deepfake Detection + + +
+ Detecting deepfakes has become an important task. Most existing detection +methods provide only real/fake predictions without offering +human-comprehensible explanations. Recent studies leveraging MLLMs for deepfake +detection have shown improvements in explainability. However, the performance +of pre-trained MLLMs (e.g., LLaVA) remains limited due to a lack of +understanding of their capabilities for this task and strategies to enhance +them. In this work, we empirically assess the strengths and weaknesses of MLLMs +specifically in deepfake detection via forgery features analysis. Building on +these assessments, we propose a novel framework called ${X}^2$-DFD, consisting +of three core modules. The first module, Model Feature Assessment (MFA), +measures the detection capabilities of forgery features intrinsic to MLLMs, and +gives a descending ranking of these features. The second module, Strong Feature +Strengthening (SFS), enhances the detection and explanation capabilities by +fine-tuning the MLLM on a dataset constructed based on the top-ranked features. +The third module, Weak Feature Supplementing (WFS), improves the fine-tuned +MLLM's capabilities on lower-ranked features by integrating external dedicated +deepfake detectors. To verify the effectiveness of this framework, we further +present a practical implementation, where an automated forgery features +generation, evaluation, and ranking procedure is designed for MFA module; an +automated generation procedure of the fine-tuning dataset containing real and +fake images with explanations based on top-ranked features is developed for SFS +model; an external conventional deepfake detector focusing on blending +artifact, which corresponds to a low detection capability in the pre-trained +MLLM, is integrated for WFS module. Experiments show that our approach enhances +both detection and explanation performance. + +
+
+
+
+
+ + ♻ ☆ Transition Rate Scheduling for Quantization-Aware Training + + +
+ Quantization-aware training (QAT) simulates a quantization process during +training to lower bit-precision of weights/activations. It learns quantized +weights indirectly by updating latent weights, i.e., full-precision inputs to a +quantizer, using gradient-based optimizers. We claim that coupling a +user-defined learning rate (LR) with these optimizers is sub-optimal for QAT. +Quantized weights transit discrete levels of a quantizer, only if corresponding +latent weights pass transition points, where the quantizer changes discrete +states. This suggests that the changes of quantized weights are affected by +both the LR for latent weights and their distributions. It is thus difficult to +control the degree of changes for quantized weights by scheduling the LR +manually. We conjecture that the degree of parameter changes in QAT is related +to the number of quantized weights transiting discrete levels. Based on this, +we introduce a transition rate (TR) scheduling technique that controls the +number of transitions of quantized weights explicitly. Instead of scheduling a +LR for latent weights, we schedule a target TR of quantized weights, and update +the latent weights with a novel transition-adaptive LR (TALR), enabling +considering the degree of changes for the quantized weights during QAT. +Experimental results demonstrate the effectiveness of our approach on standard +benchmarks. + +
+
+
+
+
+ + ♻ ☆ EvaGaussians: Event Stream Assisted Gaussian Splatting from Blurry + Images + + +
+ 3D Gaussian Splatting (3D-GS) has demonstrated exceptional capabilities in 3D +scene reconstruction and novel view synthesis. However, its training heavily +depends on high-quality, sharp images and accurate camera poses. Fulfilling +these requirements can be challenging in non-ideal real-world scenarios, where +motion-blurred images are commonly encountered in high-speed moving cameras or +low-light environments that require long exposure times. To address these +challenges, we introduce Event Stream Assisted Gaussian Splatting +(EvaGaussians), a novel approach that integrates event streams captured by an +event camera to assist in reconstructing high-quality 3D-GS from blurry images. +Capitalizing on the high temporal resolution and dynamic range offered by the +event camera, we leverage the event streams to explicitly model the formation +process of motion-blurred images and guide the deblurring reconstruction of +3D-GS. By jointly optimizing the 3D-GS parameters and recovering camera motion +trajectories during the exposure time, our method can robustly facilitate the +acquisition of high-fidelity novel views with intricate texture details. We +comprehensively evaluated our method and compared it with previous +state-of-the-art deblurring rendering methods. Both qualitative and +quantitative comparisons demonstrate that our method surpasses existing +techniques in restoring fine details from blurry images and producing +high-fidelity novel views. + +
+
+ comment: Project Page: https://www.falcary.com/EvaGaussians/ +
+
+
+
+
+ + ♻ ☆ NitroFusion: High-Fidelity Single-Step Diffusion through Dynamic + Adversarial Training + + +
+ We introduce NitroFusion, a fundamentally different approach to single-step +diffusion that achieves high-quality generation through a dynamic adversarial +framework. While one-step methods offer dramatic speed advantages, they +typically suffer from quality degradation compared to their multi-step +counterparts. Just as a panel of art critics provides comprehensive feedback by +specializing in different aspects like composition, color, and technique, our +approach maintains a large pool of specialized discriminator heads that +collectively guide the generation process. Each discriminator group develops +expertise in specific quality aspects at different noise levels, providing +diverse feedback that enables high-fidelity one-step generation. Our framework +combines: (i) a dynamic discriminator pool with specialized discriminator +groups to improve generation quality, (ii) strategic refresh mechanisms to +prevent discriminator overfitting, and (iii) global-local discriminator heads +for multi-scale quality assessment, and unconditional/conditional training for +balanced generation. Additionally, our framework uniquely supports flexible +deployment through bottom-up refinement, allowing users to dynamically choose +between 1-4 denoising steps with the same model for direct quality-speed +trade-offs. Through comprehensive experiments, we demonstrate that NitroFusion +significantly outperforms existing single-step methods across multiple +evaluation metrics, particularly excelling in preserving fine details and +global consistency. + +
+
+
+
+
+ + ♻ ☆ I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt + Generation for Text-Guided Multi-Mask Inpainting WACV 2025 + + +
+ Inpainting focuses on filling missing or corrupted regions of an image to +blend seamlessly with its surrounding content and style. While conditional +diffusion models have proven effective for text-guided inpainting, we introduce +the novel task of multi-mask inpainting, where multiple regions are +simultaneously inpainted using distinct prompts. Furthermore, we design a +fine-tuning procedure for multimodal LLMs, such as LLaVA, to generate +multi-mask prompts automatically using corrupted images as inputs. These models +can generate helpful and detailed prompt suggestions for filling the masked +regions. The generated prompts are then fed to Stable Diffusion, which is +fine-tuned for the multi-mask inpainting problem using rectified +cross-attention, enforcing prompts onto their designated regions for filling. +Experiments on digitized paintings from WikiArt and the Densely Captioned +Images dataset demonstrate that our pipeline delivers creative and accurate +inpainting results. Our code, data, and trained models are available at +https://cilabuniba.github.io/i-dream-my-painting. + +
+
+ comment: Accepted at WACV 2025 +
+
+
+
+
+ + ♻ ☆ Evolutive Rendering Models + + +
+ The landscape of computer graphics has undergone significant transformations +with the recent advances of differentiable rendering models. These rendering +models often rely on heuristic designs that may not fully align with the final +rendering objectives. We address this gap by pioneering \textit{evolutive +rendering models}, a methodology where rendering models possess the ability to +evolve and adapt dynamically throughout the rendering process. In particular, +we present a comprehensive learning framework that enables the optimization of +three principal rendering elements, including the gauge transformations, the +ray sampling mechanisms, and the primitive organization. Central to this +framework is the development of differentiable versions of these rendering +elements, allowing for effective gradient backpropagation from the final +rendering objectives. A detailed analysis of gradient characteristics is +performed to facilitate a stable and goal-oriented elements evolution. Our +extensive experiments demonstrate the large potential of evolutive rendering +models for enhancing the rendering performance across various domains, +including static and dynamic scene representations, generative modeling, and +texture mapping. + +
+
+ comment: Project page: https://fnzhan.com/Evolutive-Rendering-Models/ +
+
+
+
+
+ + ♻ ☆ Docling Technical Report AAAI 25 + + +
+ We introduce Docling, an easy-to-use, self-contained, MIT-licensed, +open-source toolkit for document conversion, that can parse several types of +popular document formats into a unified, richly structured representation. It +is powered by state-of-the-art specialized AI models for layout analysis +(DocLayNet) and table structure recognition (TableFormer), and runs efficiently +on commodity hardware in a small resource budget. Docling is released as a +Python package and can be used as a Python API or as a CLI tool. Docling's +modular architecture and efficient document representation %, known as +DoclingDocument, make it easy to implement extensions, new features, models, +and customizations. Docling has been already integrated in other popular +open-source frameworks (e.g., LlamaIndex, LangChain, spaCy), making it a +natural fit for the processing of documents and the development of high-end +applications. The open-source community has fully engaged in using, promoting, +and developing for Docling, which gathered 10k stars on GitHub in less than a +month and was reported as the No. 1 trending repository in GitHub worldwide in +November 2024. + +
+
+ comment: Submitted to AAAI 25: Workshop on Open-Source AI for Mainstream Use +
+
+
+
+
+ + ♻ ☆ Memory-efficient Continual Learning with Neural Collapse Contrastive WACV 2025 + + +
+ Contrastive learning has significantly improved representation quality, +enhancing knowledge transfer across tasks in continual learning (CL). However, +catastrophic forgetting remains a key challenge, as contrastive based methods +primarily focus on "soft relationships" or "softness" between samples, which +shift with changing data distributions and lead to representation overlap +across tasks. Recently, the newly identified Neural Collapse phenomenon has +shown promise in CL by focusing on "hard relationships" or "hardness" between +samples and fixed prototypes. However, this approach overlooks "softness", +crucial for capturing intra-class variability, and this rigid focus can also +pull old class representations toward current ones, increasing forgetting. +Building on these insights, we propose Focal Neural Collapse Contrastive +(FNC^2), a novel representation learning loss that effectively balances both +soft and hard relationships. Additionally, we introduce the Hardness-Softness +Distillation (HSD) loss to progressively preserve the knowledge gained from +these relationships across tasks. Our method outperforms state-of-the-art +approaches, particularly in minimizing memory reliance. Remarkably, even +without the use of memory, our approach rivals rehearsal-based methods, +offering a compelling solution for data privacy concerns. + +
+
+ comment: Accepted at WACV 2025 +
+
+
+
+
+ + ♻ ☆ ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for + Complicated Chart Reasoning + + +
+ Recently, many versatile Multi-modal Large Language Models (MLLMs) have +emerged continuously. However, their capacity to query information depicted in +visual charts and engage in reasoning based on the queried contents remains +under-explored. In this paper, to comprehensively and rigorously benchmark the +ability of the off-the-shelf MLLMs in the chart domain, we construct ChartX, a +multi-modal evaluation set covering 18 chart types, 7 chart tasks, 22 +disciplinary topics, and high-quality chart data. Besides, we develop ChartVLM +to offer a new perspective on handling multi-modal tasks that strongly depend +on interpretable patterns, such as reasoning tasks in the field of charts or +geometric images. We evaluate the chart-related ability of mainstream MLLMs and +our ChartVLM on the proposed ChartX evaluation set. Extensive experiments +demonstrate that ChartVLM surpasses both versatile and chart-related large +models, achieving results comparable to GPT-4V. We believe that our study can +pave the way for further exploration in creating a more comprehensive chart +evaluation set and developing more interpretable multi-modal models. Both +ChartX and ChartVLM are available at: +https://github.com/UniModal4Reasoning/ChartVLM + +
+
+ comment: Code and dataset are available for downloading at: + https://github.com/UniModal4Reasoning/ChartVLM 25 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ Open-Canopy: A Country-Scale Benchmark for Canopy Height Estimation at + Very High Resolution CVPR25 + + +
+ Estimating canopy height and its changes at meter resolution from satellite +imagery is a significant challenge in computer vision with critical +environmental applications. However, the lack of open-access datasets at this +resolution hinders the reproducibility and evaluation of models. We introduce +Open-Canopy, the first open-access, country-scale benchmark for very +high-resolution (1.5 m) canopy height estimation, covering over 87,000 km$^2$ +across France with 1.5 m resolution satellite imagery and aerial LiDAR data. +Additionally, we present Open-Canopy-$\Delta$, a benchmark for canopy height +change detection between images from different years at tree level-a +challenging task for current computer vision models. We evaluate +state-of-the-art architectures on these benchmarks, highlighting significant +challenges and opportunities for improvement. Our datasets and code are +publicly available at https://github.com/fajwel/Open-Canopy. + +
+
+ comment: 25 pages, 6+6 figures, Submitted to CVPR25 +
+
+
+
+
+ + ♻ ☆ SCMM: Calibrating Cross-modal Representations for Text-Based Person + Search + + +
+ Text-Based Person Search (TBPS) is a crucial task in the Internet of Things +(IoT) domain that enables accurate retrieval of target individuals from +large-scale galleries with only given textual caption. For cross-modal TBPS +tasks, it is critical to obtain well-distributed representation in the common +embedding space to reduce the inter-modal gap. Furthermore, learning detailed +image-text correspondences is essential to discriminate similar targets and +enable fine-grained search. To address these challenges, we present a simple +yet effective method named Sew Calibration and Masked Modeling (SCMM) that +calibrates cross-modal representations by learning compact and well-aligned +embeddings. SCMM introduces two novel losses for fine-grained cross-modal +representations: Sew calibration loss that aligns image and text features based +on textual caption quality, and Masked Caption Modeling (MCM) loss that +establishes detailed relationships between textual and visual parts. This +dual-pronged strategy enhances feature alignment and cross-modal +correspondences, enabling accurate distinction of similar individuals while +maintaining a streamlined dual-encoder architecture for real-time inference, +which is essential for resource-limited sensors and IoT systems. Extensive +experiments on three popular TBPS benchmarks demonstrate the superiority of +SCMM, achieving 73.81%, 64.25%, and 57.35% Rank-1 accuracy on CUHK-PEDES, +ICFG-PEDES, and RSTPReID, respectively. + +
+
+ comment: 10 pages, 7 figures +
+
+
+
+
+ + ♻ ☆ Noise Self-Regression: A New Learning Paradigm to Enhance Low-Light + Images Without Task-Related Data + + +
+ Deep learning-based low-light image enhancement (LLIE) is a task of +leveraging deep neural networks to enhance the image illumination while keeping +the image content unchanged. From the perspective of training data, existing +methods complete the LLIE task driven by one of the following three data types: +paired data, unpaired data and zero-reference data. Each type of these +data-driven methods has its own advantages, e.g., zero-reference data-based +methods have very low requirements on training data and can meet the human +needs in many scenarios. In this paper, we leverage pure Gaussian noise to +complete the LLIE task, which further reduces the requirements for training +data in LLIE tasks and can be used as another alternative in practical use. +Specifically, we propose Noise SElf-Regression (NoiSER) without access to any +task-related data, simply learns a convolutional neural network equipped with +an instance-normalization layer by taking a random noise image, +$\mathcal{N}(0,\sigma^2)$ for each pixel, as both input and output for each +training pair, and then the low-light image is fed to the trained network for +predicting the normal-light image. Technically, an intuitive explanation for +its effectiveness is as follows: 1) the self-regression reconstructs the +contrast between adjacent pixels of the input image, 2) the +instance-normalization layer may naturally remediate the overall +magnitude/lighting of the input image, and 3) the $\mathcal{N}(0,\sigma^2)$ +assumption for each pixel enforces the output image to follow the well-known +gray-world hypothesis when the image size is big enough. Compared to current +state-of-the-art LLIE methods with access to different task-related data, +NoiSER is highly competitive in enhancement quality, yet with a much smaller +model size, and much lower training and inference cost. Besides, NoiSER also +excels in mitigating overexposure and handling joint tasks. + +
+
+
+
+
+ + ♻ ☆ Interactive Occlusion Boundary Estimation through Exploitation of + Synthetic Data + + +
+ Occlusion boundaries (OBs) geometrically localize the occlusion events in a +2D image, and contain useful information for addressing various scene +understanding problems. To advance their study, we have led the investigation +in the following three aspects. Firstly, we have studied interactive estimation +of OBs, which is the first in the literature, and proposed an efficient +deep-network-based method using multiple-scribble intervention, named DNMMSI, +which significantly improves the performance over the state-of-the-art +fully-automatic methods. Secondly, we propose to exploit the synthetic +benchmark for the training, thanks to the particularity that OBs are determined +geometrically and unambiguously from the 3D scene. To this end, we have +developed an efficient tool, named Mesh2OB, for the automatic generation of 2D +images together with their ground-truth OBs, using which we have constructed a +synthetic benchmark, named OB-FUTURE. Abundant experimental results demonstrate +that leveraging such a synthetic benchmark for training achieves promising +performance, even without the use of domain adaptation techniques. Finally, to +achieve a more compelling and robust evaluation in OB-related research, we have +created a real-world benchmark OB-LabName, consisting of 120 high-resolution +images together with their ground-truth OBs, with precision surpassing that of +previous benchmarks. We will release DNMMSI with pre-trained parameters, +Mesh2OB, OB-FUTURE, and OB-LabName to support further research. + +
+
+ comment: 11 pages, 4 figures, 8 tables +
+
+
+
+
+ + ♻ ☆ Transferring disentangled representations: bridging the gap between + synthetic and real images NeurIPS + + +
+ Developing meaningful and efficient representations that separate the +fundamental structure of the data generation mechanism is crucial in +representation learning. However, Disentangled Representation Learning has not +fully shown its potential on real images, because of correlated generative +factors, their resolution and limited access to ground truth labels. +Specifically on the latter, we investigate the possibility of leveraging +synthetic data to learn general-purpose disentangled representations applicable +to real data, discussing the effect of fine-tuning and what properties of +disentanglement are preserved after the transfer. We provide an extensive +empirical study to address these issues. In addition, we propose a new +interpretable intervention-based metric, to measure the quality of factors +encoding in the representation. Our results indicate that some level of +disentanglement, transferring a representation from synthetic to real data, is +possible and effective. + +
+
+ comment: Accepted to NeurIPS, 2024 +
+
+
+
+
+ + ♻ ☆ BodyMetric: Evaluating the Realism of Human Bodies in Text-to-Image + Generation + + +
+ Accurately generating images of human bodies from text remains a challenging +problem for state of the art text-to-image models. Commonly observed +body-related artifacts include extra or missing limbs, unrealistic poses, +blurred body parts, etc. Currently, evaluation of such artifacts relies heavily +on time-consuming human judgments, limiting the ability to benchmark models at +scale. We address this by proposing BodyMetric, a learnable metric that +predicts body realism in images. BodyMetric is trained on realism labels and +multi-modal signals including 3D body representations inferred from the input +image, and textual descriptions. In order to facilitate this approach, we +design an annotation pipeline to collect expert ratings on human body realism +leading to a new dataset for this task, namely, BodyRealism. Ablation studies +support our architectural choices for BodyMetric and the importance of +leveraging a 3D human body prior in capturing body-related artifacts in 2D +images. In comparison to concurrent metrics which evaluate general user +preference in images, BodyMetric specifically reflects body-related artifacts. +We demonstrate the utility of BodyMetric through applications that were +previously infeasible at scale. In particular, we use BodyMetric to benchmark +the generation ability of text-to-image models to produce realistic human +bodies. We also demonstrate the effectiveness of BodyMetric in ranking +generated images based on the predicted realism scores. + +
+
+
+
+
+ + ♻ ☆ Tencent Hunyuan3D-1.0: A Unified Framework for Text-to-3D and + Image-to-3D Generation + + +
+ While 3D generative models have greatly improved artists' workflows, the +existing diffusion models for 3D generation suffer from slow generation and +poor generalization. To address this issue, we propose a two-stage approach +named Hunyuan3D-1.0 including a lite version and a standard version, that both +support text- and image-conditioned generation. In the first stage, we employ a +multi-view diffusion model that efficiently generates multi-view RGB in +approximately 4 seconds. These multi-view images capture rich details of the 3D +asset from different viewpoints, relaxing the tasks from single-view to +multi-view reconstruction. In the second stage, we introduce a feed-forward +reconstruction model that rapidly and faithfully reconstructs the 3D asset +given the generated multi-view images in approximately 7 seconds. The +reconstruction network learns to handle noises and in-consistency introduced by +the multi-view diffusion and leverages the available information from the +condition image to efficiently recover the 3D structure. Our framework involves +the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to +support both text- and image-conditioned 3D generation. Our standard version +has 3x more parameters than our lite and other existing model. Our +Hunyuan3D-1.0 achieves an impressive balance between speed and quality, +significantly reducing generation time while maintaining the quality and +diversity of the produced assets. + +
+
+ comment: Technical Report; 3D Generation +
+
+
+
+
+ + ♻ ☆ MagicTailor: Component-Controllable Personalization in Text-to-Image + Diffusion Models + + +
+ Recent text-to-image models generate high-quality images from text prompts +but lack precise control over specific components within visual concepts. +Therefore, we introduce component-controllable personalization, a new task that +allows users to customize and reconfigure individual components within +concepts. This task faces two challenges: semantic pollution, where undesirable +elements distort the concept, and semantic imbalance, which leads to +disproportionate learning of the target concept and component. To address +these, we design MagicTailor, a framework that uses Dynamic Masked Degradation +to adaptively perturb unwanted visual semantics and Dual-Stream Balancing for +more balanced learning of desired visual semantics. The experimental results +show that MagicTailor outperforms existing methods in this task and enables +more personalized, nuanced, and creative image generation. + +
+
+ comment: Project page: https://correr-zhou.github.io/MagicTailor +
+
+
+
+
+ + ♻ ☆ ShadowHack: Hacking Shadows via Luminance-Color Divide and Conquer + + +
+ Shadows introduce challenges such as reduced brightness, texture +deterioration, and color distortion in images, complicating a holistic +solution. This study presents ShadowHack, a divide-and-conquer strategy that +tackles these complexities by decomposing the original task into luminance +recovery and color remedy. To brighten shadow regions and repair the corrupted +textures in the luminance space, we customize LRNet, a U-shaped network with a +rectified outreach attention module, to enhance information interaction and +recalibrate contaminated attention maps. With luminance recovered, CRNet then +leverages cross-attention mechanisms to revive vibrant colors, producing +visually compelling results. Extensive experiments on multiple datasets are +conducted to demonstrate the superiority of ShadowHack over existing +state-of-the-art solutions both quantitatively and qualitatively, highlighting +the effectiveness of our design. Our code will be made publicly available at +https://github.com/lime-j/ShadowHack + +
+
+
+
+
+ + ♻ ☆ MobileFlow: A Multimodal LLM For Mobile GUI Agent + + +
+ Currently, the integration of mobile Graphical User Interfaces (GUIs) is +ubiquitous in most people's daily lives. And the ongoing evolution of +multimodal large-scale models, such as GPT-4v, Qwen-VL-Max, has significantly +bolstered the capabilities of GUI comprehension and user action analysis, +showcasing the potentiality of intelligent GUI assistants. However, current GUI +Agents often need to access page layout information through calling system +APIs, which may pose privacy risks. Fixing GUI (such as mobile interfaces) to a +certain low resolution might result in the loss of fine-grained image details. +At the same time, the multimodal large models built for GUI Agents currently +have poor understanding and decision-making abilities for Chinese GUI +interfaces, making them difficult to apply to a large number of Chinese apps. +This paper introduces MobileFlow, a multimodal large language model +meticulously crafted for mobile GUI agents. Transforming from the open-source +model Qwen-VL-Chat into GUI domain, MobileFlow contains approximately 21 +billion parameters and is equipped with novel hybrid visual encoders, making it +possible for variable resolutions of image inputs and good support for +multilingual GUI. By incorporating Mixture of Experts (MoE) expansions and +pioneering alignment training strategies, MobileFlow has the capacity to fully +interpret image data and comprehend user instructions for GUI interaction +tasks. Finally, MobileFlow outperforms Qwen-VL-Max and GPT-4v in terms of task +execution by GUI agents on both public and our proposed evaluation metrics, and +has been successfully deployed in real-world business contexts, proving its +effectiveness for practical applications. + +
+
+
+
+
+ + ♻ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware + Omni-Modal Perception of Long Videos + + +
+ Despite impressive advancements in video understanding, most efforts remain +limited to coarse-grained or visual-only video tasks. However, real-world +videos encompass omni-modal information (vision, audio, and speech) with a +series of events forming a cohesive storyline. The lack of multi-modal video +data with fine-grained event annotations and the high cost of manual labeling +are major obstacles to comprehensive omni-modality video perception. To address +this gap, we propose an automatic pipeline consisting of high-quality +multi-modal video filtering, semantically coherent omni-modal event boundary +detection, and cross-modal correlation-aware event captioning. In this way, we +present LongVALE, the first-ever Vision-Audio-Language Event understanding +benchmark comprising 105K omni-modal events with precise temporal boundaries +and detailed relation-aware captions within 8.4K high-quality long videos. +Further, we build a baseline that leverages LongVALE to enable video large +language models (LLMs) for omni-modality fine-grained temporal video +understanding for the first time. Extensive experiments demonstrate the +effectiveness and great potential of LongVALE in advancing comprehensive +multi-modal video understanding. + +
+
+ comment: 18 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ ChromaDistill: Colorizing Monochrome Radiance Fields with Knowledge + Distillation WACV 2025 + + +
+ Colorization is a well-explored problem in the domains of image and video +processing. However, extending colorization to 3D scenes presents significant +challenges. Recent Neural Radiance Field (NeRF) and Gaussian-Splatting(3DGS) +methods enable high-quality novel-view synthesis for multi-view images. +However, the question arises: How can we colorize these 3D representations? +This work presents a method for synthesizing colorized novel views from input +grayscale multi-view images. Using image or video colorization methods to +colorize novel views from these 3D representations naively will yield output +with severe inconsistencies. We introduce a novel method to use powerful image +colorization models for colorizing 3D representations. We propose a +distillation-based method that transfers color from these networks trained on +natural images to the target 3D representation. Notably, this strategy does not +add any additional weights or computational overhead to the original +representation during inference. Extensive experiments demonstrate that our +method produces high-quality colorized views for indoor and outdoor scenes, +showcasing significant cross-view consistency advantages over baseline +approaches. Our method is agnostic to the underlying 3D representation and +easily generalizable to NeRF and 3DGS methods. Further, we validate the +efficacy of our approach in several diverse applications: 1.) Infra-Red (IR) +multi-view images and 2.) Legacy grayscale multi-view image sequences. Project +Webpage: https://val.cds.iisc.ac.in/chroma-distill.github.io/ + +
+
+ comment: WACV 2025, AI3DCC @ ICCV 2023 +
+
+
+
+
+ + ♻ ☆ SJTU:Spatial judgments in multimodal models towards unified segmentation + through coordinate detection + + +
+ Despite significant advances in vision-language understanding, implementing +image segmentation within multimodal architectures remains a fundamental +challenge in modern artificial intelligence systems. Existing vision-language +models, which primarily rely on backbone architectures or CLIP-based embedding +learning, demonstrate inherent limitations in fine-grained spatial localization +and operational capabilities. This paper introduces SJTU: Spatial Judgments in +Multimodal Models - Towards Unified Segmentation through Coordinate Detection, +a framework that leverages spatial coordinate understanding to bridge +vision-language interaction and precise segmentation, enabling accurate target +identification through natural language instructions. The framework presents an +approach for integrating segmentation techniques with vision-language models +through spatial inference in multimodal space. By utilizing normalized +coordinate detection for bounding boxes and transforming them into actionable +segmentation outputs, we establish a connection between spatial and language +representations in multimodal architectures. Experimental results demonstrate +superior performance across benchmark datasets, achieving IoU scores of 0.5958 +on COCO 2017 and 0.6758 on Pascal VOC. Testing on a single NVIDIA RTX 3090 GPU +with 512x512 resolution images yields an average inference time of 7 seconds +per image, demonstrating the framework's effectiveness in both accuracy and +practical deployability. The project code is available at +https://github.com/jw-chae/SJTU + +
+
+ comment: 15 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ Pan-cancer Histopathology WSI Pre-training with Position-aware Masked + Autoencoder + + +
+ Large-scale pre-training models have promoted the development of +histopathology image analysis. However, existing self-supervised methods for +histopathology images primarily focus on learning patch features, while there +is a notable gap in the availability of pre-training models specifically +designed for WSI-level feature learning. In this paper, we propose a novel +self-supervised learning framework for pan-cancer WSI-level representation +pre-training with the designed position-aware masked autoencoder (PAMA). +Meanwhile, we propose the position-aware cross-attention (PACA) module with a +kernel reorientation (KRO) strategy and an anchor dropout (AD) mechanism. The +KRO strategy can capture the complete semantic structure and eliminate +ambiguity in WSIs, and the AD contributes to enhancing the robustness and +generalization of the model. We evaluated our method on 7 large-scale datasets +from multiple organs for pan-cancer classification tasks. The results have +demonstrated the effectiveness and generalization of PAMA in discriminative WSI +representation learning and pan-cancer WSI pre-training. The proposed method +was also compared with 8 WSI analysis methods. The experimental results have +indicated that our proposed PAMA is superior to the state-of-the-art methods. +The code and checkpoints are available at https://github.com/WkEEn/PAMA. + +
+
+
+
+
+ + ♻ ☆ Harnessing Shared Relations via Multimodal Mixup Contrastive Learning + for Multimodal Classification NeurIPS 2024 + + +
+ Deep multimodal learning has shown remarkable success by leveraging +contrastive learning to capture explicit one-to-one relations across +modalities. However, real-world data often exhibits shared relations beyond +simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive +Learning approach to capture nuanced shared relations inherent in multimodal +data. Our key contribution is a Mixup-based contrastive loss that learns robust +representations by aligning mixed samples from one modality with their +corresponding samples from other modalities thereby capturing shared relations +between them. For multimodal classification tasks, we introduce a framework +that integrates a fusion module with unimodal prediction modules for auxiliary +supervision during training, complemented by our proposed Mixup-based +contrastive loss. Through extensive experiments on diverse datasets (N24News, +ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures +shared multimodal relations and generalizes across domains. It outperforms +state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving +comparable performance on Food-101. Our work highlights the significance of +learning shared relations for robust multimodal learning, opening up promising +avenues for future research. Our code is publicly available at +https://github.com/RaghavSinghal10/M3CoL. + +
+
+ comment: RK and RS contributed equally to this work, 20 Pages, 8 Figures, 9 + Tables. Another version of the paper accepted at NeurIPS 2024 Workshop on + Unifying Representations in Neural Models (UniReps) +
+
+
+
+
+ + ♻ ☆ PADetBench: Towards Benchmarking Physical Attacks against Object + Detection + + +
+ Physical attacks against object detection have gained increasing attention +due to their significant practical implications. However, conducting physical +experiments is extremely time-consuming and labor-intensive. Moreover, physical +dynamics and cross-domain transformation are challenging to strictly regulate +in the real world, leading to unaligned evaluation and comparison, severely +hindering the development of physically robust models. To accommodate these +challenges, we explore utilizing realistic simulation to thoroughly and +rigorously benchmark physical attacks with fairness under controlled physical +dynamics and cross-domain transformation. This resolves the problem of +capturing identical adversarial images that cannot be achieved in the real +world. Our benchmark includes 20 physical attack methods, 48 object detectors, +comprehensive physical dynamics, and evaluation metrics. We also provide +end-to-end pipelines for dataset generation, detection, evaluation, and further +analysis. In addition, we perform 8064 groups of evaluation based on our +benchmark, which includes both overall evaluation and further detailed ablation +studies for controlled physical dynamics. Through these experiments, we provide +in-depth analyses of physical attack performance and physical adversarial +robustness, draw valuable observations, and discuss potential directions for +future research. + Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack + +
+
+
+
+
+ + ♻ ☆ ControlFace: Harnessing Facial Parametric Control for Face Rigging + + +
+ Manipulation of facial images to meet specific controls such as pose, +expression, and lighting, also known as face rigging, is a complex task in +computer vision. Existing methods are limited by their reliance on image +datasets, which necessitates individual-specific fine-tuning and limits their +ability to retain fine-grained identity and semantic details, reducing +practical usability. To overcome these limitations, we introduce ControlFace, a +novel face rigging method conditioned on 3DMM renderings that enables flexible, +high-fidelity control. We employ a dual-branch U-Nets: one, referred to as +FaceNet, captures identity and fine details, while the other focuses on +generation. To enhance control precision, the control mixer module encodes the +correlated features between the target-aligned control and reference-aligned +control, and a novel guidance method, reference control guidance, steers the +generation process for better control adherence. By training on a facial video +dataset, we fully utilize FaceNet's rich representations while ensuring control +adherence. Extensive experiments demonstrate ControlFace's superior performance +in identity preservation and control precision, highlighting its practicality. +Please see the project website: https://cvlab-kaist.github.io/ControlFace/. + +
+
+ comment: project website: https://cvlab-kaist.github.io/ControlFace/ +
+
+
+
+
+ + ♻ ☆ MaterialPicker: Multi-Modal Material Generation with Diffusion + Transformers + + +
+ High-quality material generation is key for virtual environment authoring and +inverse rendering. We propose MaterialPicker, a multi-modal material generator +leveraging a Diffusion Transformer (DiT) architecture, improving and +simplifying the creation of high-quality materials from text prompts and/or +photographs. Our method can generate a material based on an image crop of a +material sample, even if the captured surface is distorted, viewed at an angle +or partially occluded, as is often the case in photographs of natural scenes. +We further allow the user to specify a text prompt to provide additional +guidance for the generation. We finetune a pre-trained DiT-based video +generator into a material generator, where each material map is treated as a +frame in a video sequence. We evaluate our approach both quantitatively and +qualitatively and show that it enables more diverse material generation and +better distortion correction than previous work. + +
+
+
+
+
+ + ♻ ☆ VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction + + +
+ Large language models (LLMs) have proven effective for layout generation due +to their ability to produce structure-description languages, such as HTML or +JSON, even without access to visual information. Recently, LLM providers have +evolved these models into large vision-language models (LVLM), which shows +prominent multi-modal understanding capabilities. Then, how can we leverage +this multi-modal power for layout generation? To answer this, we propose +Visual-Aware Self-Correction LAyout GeneRation (VASCAR) for LVLM-based +content-aware layout generation. In our method, LVLMs iteratively refine their +outputs with reference to rendered layout images, which are visualized as +colored bounding boxes on poster backgrounds. In experiments, we demonstrate +that our method combined with the Gemini. Without any additional training, +VASCAR achieves state-of-the-art (SOTA) layout generation quality outperforming +both existing layout-specific generative models and other LLM-based methods. + +
+
+
+
+
+ + ♻ ☆ Comprehensive framework for evaluation of deep neural networks in + detection and quantification of lymphoma from PET/CT images: clinical + insights, pitfalls, and observer agreement analyses + + +
+ This study addresses critical gaps in automated lymphoma segmentation from +PET/CT images, focusing on issues often overlooked in existing literature. +While deep learning has been applied for lymphoma lesion segmentation, few +studies incorporate out-of-distribution testing, raising concerns about model +generalizability across diverse imaging conditions and patient populations. We +highlight the need to compare model performance with expert human annotators, +including intra- and inter-observer variability, to understand task difficulty +better. Most approaches focus on overall segmentation accuracy but overlook +lesion-specific measures important for precise lesion detection and disease +quantification. To address these gaps, we propose a clinically relevant +framework for evaluating deep segmentation networks. Using this lesion +measure-specific evaluation, we assess the performance of four deep networks +(ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from +multi-institutional datasets, covering various lymphoma subtypes and lesion +characteristics. Beyond standard metrics like the Dice similarity coefficient, +we evaluate clinical lesion measures and their prediction errors. We also +introduce detection criteria for lesion localization and propose a new +detection Criterion 3 based on metabolic characteristics. We show that networks +perform better on large, intense lesions with higher metabolic activity. +Finally, we compare network performance to physicians via intra- and +inter-observer variability analyses, demonstrating that network errors closely +resemble those made by experts, i.e., the small and faint lesions remain +challenging for both humans and networks. This study aims to improve automated +lesion segmentation's clinical relevance, supporting better treatment decisions +for lymphoma patients. The code is available at: +https://github.com/microsoft/lymphoma-segmentation-dnn. + +
+
+ comment: 32 pages, 15 figures, 5 tables +
+
+
+
+
+ + ♻ ☆ Euler's Elastica Based Cartoon-Smooth-Texture Image Decomposition + + +
+ We propose a novel model for decomposing grayscale images into three distinct +components: the structural part, representing sharp boundaries and regions with +strong light-to-dark transitions; the smooth part, capturing soft shadows and +shades; and the oscillatory part, characterizing textures and noise. To capture +the homogeneous structures, we introduce a combination of $L^0$-gradient and +curvature regularization on level lines. This new regularization term enforces +strong sparsity on the image gradient while reducing the undesirable staircase +effects as well as preserving the geometry of contours. For the smoothly +varying component, we utilize the $L^2$-norm of the Laplacian that favors +isotropic smoothness. To capture the oscillation, we use the inverse Sobolev +seminorm. To solve the associated minimization problem, we design an efficient +operator-splitting algorithm. Our algorithm effectively addresses the +challenging non-convex non-smooth problem by separating it into sub-problems. +Each sub-problem can be solved either directly using closed-form solutions or +efficiently using the Fast Fourier Transform (FFT). We provide systematic +experiments, including ablation and comparison studies, to analyze our model's +behaviors and demonstrate its effectiveness as well as efficiency. + +
+
+
+
+
+ + ♻ ☆ Local Curvature Smoothing with Stein's Identity for Efficient Score + Matching NeurIPS 2024 + + +
+ The training of score-based diffusion models (SDMs) is based on score +matching. The challenge of score matching is that it includes a computationally +expensive Jacobian trace. While several methods have been proposed to avoid +this computation, each has drawbacks, such as instability during training and +approximating the learning as learning a denoising vector field rather than a +true score. We propose a novel score matching variant, local curvature +smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by +applying Stein's identity, enabling regularization effectiveness and efficient +computation. We show that LCSS surpasses existing methods in sample generation +performance and matches the performance of denoising score matching, widely +adopted by most SDMs, in evaluations such as FID, Inception score, and bits per +dimension. Furthermore, we show that LCSS enables realistic image generation +even at a high resolution of $1024 \times 1024$. + +
+
+ comment: Accepted at NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Investigating Self-Supervised Image Denoising with Denaturation + + +
+ Self-supervised learning for image denoising problems in the presence of +denaturation for noisy data is a crucial approach in machine learning. However, +theoretical understanding of the performance of the approach that uses +denatured data is lacking. To provide better understanding of the approach, in +this paper, we analyze a self-supervised denoising algorithm that uses +denatured data in depth through theoretical analysis and numerical experiments. +Through the theoretical analysis, we discuss that the algorithm finds desired +solutions to the optimization problem with the population risk, while the +guarantee for the empirical risk depends on the hardness of the denoising task +in terms of denaturation levels. We also conduct several experiments to +investigate the performance of an extended algorithm in practice. The results +indicate that the algorithm training with denatured images works, and the +empirical performance aligns with the theoretical results. These results +suggest several insights for further improvement of self-supervised image +denoising that uses denatured data in future directions. + +
+
+ comment: The PDF v3 has a wrong license, while v4 has a correct license +
+
+
+
+
+ + ♻ ☆ TTT-Unet: Enhancing U-Net with Test-Time Training Layers for Biomedical + Image Segmentation + + +
+ Biomedical image segmentation is crucial for accurately diagnosing and +analyzing various diseases. However, Convolutional Neural Networks (CNNs) and +Transformers, the most commonly used architectures for this task, struggle to +effectively capture long-range dependencies due to the inherent locality of +CNNs and the computational complexity of Transformers. To address this +limitation, we introduce TTT-Unet, a novel framework that integrates Test-Time +Training (TTT) layers into the traditional U-Net architecture for biomedical +image segmentation. TTT-Unet dynamically adjusts model parameters during the +testing time, enhancing the model's ability to capture both local and +long-range features. We evaluate TTT-Unet on multiple medical imaging datasets, +including 3D abdominal organ segmentation in CT and MR images, instrument +segmentation in endoscopy images, and cell segmentation in microscopy images. +The results demonstrate that TTT-Unet consistently outperforms state-of-the-art +CNN-based and Transformer-based segmentation models across all tasks. The code +is available at https://github.com/rongzhou7/TTT-Unet. + +
+
+
+
+
+ + ♻ ☆ Local and Global Feature Attention Fusion Network for Face Recognition + + +
+ Recognition of low-quality face images remains a challenge due to invisible +or deformation in partial facial regions. For low-quality images dominated by +missing partial facial regions, local region similarity contributes more to +face recognition (FR). Conversely, in cases dominated by local face +deformation, excessive attention to local regions may lead to misjudgments, +while global features exhibit better robustness. However, most of the existing +FR methods neglect the bias in feature quality of low-quality images introduced +by different factors. To address this issue, we propose a Local and Global +Feature Attention Fusion (LGAF) network based on feature quality. The network +adaptively allocates attention between local and global features according to +feature quality and obtains more discriminative and high-quality face features +through local and global information complementarity. In addition, to +effectively obtain fine-grained information at various scales and increase the +separability of facial features in high-dimensional space, we introduce a +Multi-Head Multi-Scale Local Feature Extraction (MHMS) module. Experimental +results demonstrate that the LGAF achieves the best average performance on $4$ +validation sets (CFP-FP, CPLFW, AgeDB, and CALFW), and the performance on +TinyFace and SCFace outperforms the state-of-the-art methods (SoTA). + +
+
+
+
+
+ + ♻ ☆ Graph Canvas for Controllable 3D Scene Generation + + +
+ Spatial intelligence is foundational to AI systems that interact with the +physical world, particularly in 3D scene generation and spatial comprehension. +Current methodologies for 3D scene generation often rely heavily on predefined +datasets, and struggle to adapt dynamically to changing spatial relationships. +In this paper, we introduce GraphCanvas3D, a programmable, extensible, and +adaptable framework for controllable 3D scene generation. Leveraging in-context +learning, GraphCanvas3D enables dynamic adaptability without the need for +retraining, supporting flexible and customizable scene creation. Our framework +employs hierarchical, graph-driven scene descriptions, representing spatial +elements as graph nodes and establishing coherent relationships among objects +in 3D environments. Unlike conventional approaches, which are constrained in +adaptability and often require predefined input masks or retraining for +modifications, GraphCanvas3D allows for seamless object manipulation and scene +adjustments on the fly. Additionally, GraphCanvas3D supports 4D scene +generation, incorporating temporal dynamics to model changes over time. +Experimental results and user studies demonstrate that GraphCanvas3D enhances +usability, flexibility, and adaptability for scene generation. Our code and +models are available on the project website: +https://github.com/ILGLJ/Graph-Canvas. + +
+
+
+
+
+ + ♻ ☆ Scaling Inference-Time Search with Vision Value Model for Improved + Visual Comprehension + + +
+ Despite significant advancements in vision-language models (VLMs), there +lacks effective approaches to enhance response quality by scaling +inference-time computation. This capability is known to be a core step towards +the self-improving models in recent large language model studies. In this +paper, we present Vision Value Model (VisVM) that can guide VLM inference-time +search to generate responses with better visual comprehension. Specifically, +VisVM not only evaluates the generated sentence quality in the current search +step, but also anticipates the quality of subsequent sentences that may result +from the current step, thus providing a long-term value. In this way, VisVM +steers VLMs away from generating sentences prone to hallucinations or +insufficient detail, thereby producing higher quality responses. Experimental +results demonstrate that VisVM-guided search significantly enhances VLMs' +ability to generate descriptive captions with richer visual details and fewer +hallucinations, compared with greedy decoding and search methods with other +visual reward signals. Furthermore, we find that self-training the model with +the VisVM-guided captions improve VLM's performance across a wide range of +multimodal benchmarks, indicating the potential for developing self-improving +VLMs. Our value model and code are available at +https://github.com/si0wang/VisVM. + +
+
+
+
+
+
+
+
+ + Information Retrieval 7 + +
+
+
+ + ☆ Enhancing FKG.in: automating Indian food composition analysis + + +
+ This paper presents a novel approach to compute food composition data for +Indian recipes using a knowledge graph for Indian food (FKG.in) and LLMs. The +primary focus is to provide a broad overview of an automated food composition +analysis workflow and describe its core functionalities: nutrition data +aggregation, food composition analysis, and LLM-augmented information +resolution. This workflow aims to complement FKG.in and iteratively supplement +food composition data from verified knowledge bases. Additionally, this paper +highlights the challenges of representing Indian food and accessing food +composition data digitally. It also reviews three key sources of food +composition data: the Indian Food Composition Tables, the Indian Nutrient +Databank, and the Nutritionix API. Furthermore, it briefly outlines how users +can interact with the workflow to obtain diet-based health recommendations and +detailed food composition information for numerous recipes. We then explore the +complex challenges of analyzing Indian recipe information across dimensions +such as structure, multilingualism, and uncertainty as well as present our +ongoing work on LLM-based solutions to address these issues. The methods +proposed in this workshop paper for AI-driven knowledge curation and +information resolution are application-agnostic, generalizable, and replicable +for any domain. + +
+
+ comment: 15 pages, 3 figures, 30 references, International Conference on + Pattern Recognition 2024 - Multimedia Assisted Dietary Management Workshop +
+
+
+
+
+ + ☆ ConQRet: Benchmarking Fine-Grained Evaluation of Retrieval Augmented + Argumentation with LLM Judges + + +
+ Computational argumentation, which involves generating answers or summaries +for controversial topics like abortion bans and vaccination, has become +increasingly important in today's polarized environment. Sophisticated LLM +capabilities offer the potential to provide nuanced, evidence-based answers to +such questions through Retrieval-Augmented Argumentation (RAArg), leveraging +real-world evidence for high-quality, grounded arguments. However, evaluating +RAArg remains challenging, as human evaluation is costly and difficult for +complex, lengthy answers on complicated topics. At the same time, re-using +existing argumentation datasets is no longer sufficient, as they lack long, +complex arguments and realistic evidence from potentially misleading sources, +limiting holistic evaluation of retrieval effectiveness and argument quality. +To address these gaps, we investigate automated evaluation methods using +multiple fine-grained LLM judges, providing better and more interpretable +assessments than traditional single-score metrics and even previously reported +human crowdsourcing. To validate the proposed techniques, we introduce ConQRet, +a new benchmark featuring long and complex human-authored arguments on debated +topics, grounded in real-world websites, allowing an exhaustive evaluation +across retrieval effectiveness, argument quality, and groundedness. We validate +our LLM Judges on a prior dataset and the new ConQRet benchmark. Our proposed +LLM Judges and the ConQRet benchmark can enable rapid progress in computational +argumentation and can be naturally extended to other complex +retrieval-augmented generation tasks. + +
+
+
+
+
+ + ☆ eXpath: Explaining Knowledge Graph Link Prediction with Ontological + Closed Path Rules VLDB + + +
+ Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but +commonly suffers from interpretability issues. While several methods have been +proposed to explain embedding-based LP models, they are generally limited to +local explanations on KG and are deficient in providing human interpretable +semantics. Based on real-world observations of the characteristics of KGs from +multiple domains, we propose to explain LP models in KG with path-based +explanations. An integrated framework, namely eXpath, is introduced which +incorporates the concept of relation path with ontological closed path rules to +enhance both the efficiency and effectiveness of LP interpretation. Notably, +the eXpath explanations can be fused with other single-link explanation +approaches to achieve a better overall solution. Extensive experiments across +benchmark datasets and LP models demonstrate that introducing eXpath can boost +the quality of resulting explanations by about 20% on two key metrics and +reduce the required explanation time by 61.4%, in comparison to the best +existing method. Case studies further highlight eXpath's ability to provide +more semantically meaningful explanations through path-based evidence. + +
+
+ comment: 13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201 +
+
+
+
+
+ + ☆ Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval + with Semantic Guidance NeurIPS 2024 + + +
+ Modern music retrieval systems often rely on fixed representations of user +preferences, limiting their ability to capture users' diverse and uncertain +retrieval needs. To address this limitation, we introduce Diff4Steer, a novel +generative retrieval framework that employs lightweight diffusion models to +synthesize diverse seed embeddings from user queries that represent potential +directions for music exploration. Unlike deterministic methods that map user +query to a single point in embedding space, Diff4Steer provides a statistical +prior on the target modality (audio) for retrieval, effectively capturing the +uncertainty and multi-faceted nature of user preferences. Furthermore, +Diff4Steer can be steered by image or text inputs, enabling more flexible and +controllable music discovery combined with nearest neighbor search. Our +framework outperforms deterministic regression methods and LLM-based generative +retrieval baseline in terms of retrieval and ranking metrics, demonstrating its +effectiveness in capturing user preferences, leading to more diverse and +relevant recommendations. Listening examples are available at +tinyurl.com/diff4steer. + +
+
+ comment: NeurIPS 2024 Creative AI Track +
+
+
+
+
+ + ♻ ☆ Towards Boosting LLMs-driven Relevance Modeling with Progressive + Retrieved Behavior-augmented Prompting COLING 2025 + + +
+ Relevance modeling is a critical component for enhancing user experience in +search engines, with the primary objective of identifying items that align with +users' queries. Traditional models only rely on the semantic congruence between +queries and items to ascertain relevance. However, this approach represents +merely one aspect of the relevance judgement, and is insufficient in isolation. +Even powerful Large Language Models (LLMs) still cannot accurately judge the +relevance of a query and an item from a semantic perspective. To augment +LLMs-driven relevance modeling, this study proposes leveraging user +interactions recorded in search logs to yield insights into users' implicit +search intentions. The challenge lies in the effective prompting of LLMs to +capture dynamic search intentions, which poses several obstacles in real-world +relevance scenarios, i.e., the absence of domain-specific knowledge, the +inadequacy of an isolated prompt, and the prohibitive costs associated with +deploying LLMs. In response, we propose ProRBP, a novel Progressive Retrieved +Behavior-augmented Prompting framework for integrating search scenario-oriented +knowledge with LLMs effectively. Specifically, we perform the user-driven +behavior neighbors retrieval from the daily search logs to obtain +domain-specific knowledge in time, retrieving candidates that users consider to +meet their expectations. Then, we guide LLMs for relevance modeling by +employing advanced prompting techniques that progressively improve the outputs +of the LLMs, followed by a progressive aggregation with comprehensive +consideration of diverse aspects. For online serving, we have developed an +industrial application framework tailored for the deployment of LLMs in +relevance modeling. Experiments on real-world industry data and online A/B +testing demonstrate our proposal achieves promising performance. + +
+
+ comment: Accepted By COLING 2025 +
+
+
+
+
+ + ♻ ☆ TPRF: A Transformer-based Pseudo-Relevance Feedback Model for Efficient + and Effective Retrieval + + +
+ This paper considers Pseudo-Relevance Feedback (PRF) methods for dense +retrievers in a resource constrained environment such as that of cheap cloud +instances or embedded systems (e.g., smartphones and smartwatches), where +memory and CPU are limited and GPUs are not present. For this, we propose a +transformer-based PRF method (TPRF), which has a much smaller memory footprint +and faster inference time compared to other deep language models that employ +PRF mechanisms, with a marginal effectiveness loss. TPRF learns how to +effectively combine the relevance feedback signals from dense passage +representations. Specifically, TPRF provides a mechanism for modelling +relationships and weights between the query and the relevance feedback signals. +The method is agnostic to the specific dense representation used and thus can +be generally applied to any dense retriever. + +
+
+
+
+
+ + ♻ ☆ All-in-One: Heterogeneous Interaction Modeling for Cold-Start Rating + Prediction + + +
+ Cold-start rating prediction is a fundamental problem in recommender systems +that has been extensively studied. Many methods have been proposed that exploit +explicit relations among existing data, such as collaborative filtering, social +recommendations and heterogeneous information network, to alleviate the data +insufficiency issue for cold-start users and items. However, the explicit +relations constructed based on data between different roles may be unreliable +and irrelevant, which limits the performance ceiling of the specific +recommendation task. Motivated by this, in this paper, we propose a flexible +framework dubbed heterogeneous interaction rating network (HIRE). HIRE dose not +solely rely on the pre-defined interaction pattern or the manually constructed +heterogeneous information network. Instead, we devise a Heterogeneous +Interaction Module (HIM) to jointly model the heterogeneous interactions and +directly infer the important interactions via the observed data. In the +experiments, we evaluate our model under three cold-start settings on three +real-world datasets. The experimental results show that HIRE outperforms other +baselines by a large margin. Furthermore, we visualize the inferred +interactions of HIRE to confirm the contribution of our model. + +
+
+ comment: 14 pages, 9 figures +
+
+
+
+
+
+
+
+ + Machine Learning 150 + +
+
+
+ + ☆ Stag-1: Towards Realistic 4D Driving Simulation with Video Generation + Model + + +
+ 4D driving simulation is essential for developing realistic autonomous +driving simulators. Despite advancements in existing methods for generating +driving scenes, significant challenges remain in view transformation and +spatial-temporal dynamic modeling. To address these limitations, we propose a +Spatial-Temporal simulAtion for drivinG (Stag-1) model to reconstruct +real-world scenes and design a controllable generative network to achieve 4D +simulation. Stag-1 constructs continuous 4D point cloud scenes using +surround-view data from autonomous vehicles. It decouples spatial-temporal +relationships and produces coherent keyframe videos. Additionally, Stag-1 +leverages video generation models to obtain photo-realistic and controllable 4D +driving simulation videos from any perspective. To expand the range of view +generation, we train vehicle motion videos based on decomposed camera poses, +enhancing modeling capabilities for distant scenes. Furthermore, we reconstruct +vehicle camera trajectories to integrate 3D points across consecutive views, +enabling comprehensive scene understanding along the temporal dimension. +Following extensive multi-level scene training, Stag-1 can simulate from any +desired viewpoint and achieve a deep understanding of scene evolution under +static spatial-temporal conditions. Compared to existing methods, our approach +shows promising performance in multi-view scene consistency, background +coherence, and accuracy, and contributes to the ongoing advancements in +realistic autonomous driving simulation. Code: https://github.com/wzzheng/Stag. + +
+
+ comment: Code is available at: https://github.com/wzzheng/Stag +
+
+
+
+
+ + ☆ Sparse autoencoders reveal selective remapping of visual concepts during + adaptation + + +
+ Adapting foundation models for specific purposes has become a standard +approach to build machine learning systems for downstream applications. Yet, it +is an open question which mechanisms take place during adaptation. Here we +develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named +PatchSAE, to extract interpretable concepts at granular levels (e.g. shape, +color, or semantics of an object) and their patch-wise spatial attributions. We +explore how these concepts influence the model output in downstream image +classification tasks and investigate how recent state-of-the-art prompt-based +adaptation techniques change the association of model inputs to these concepts. +While activations of concepts slightly change between adapted and non-adapted +models, we find that the majority of gains on common adaptation tasks can be +explained with the existing concepts already present in the non-adapted +foundation model. This work provides a concrete framework to train and use SAEs +for Vision Transformers and provides insights into explaining adaptation +mechanisms. + +
+
+ comment: A demo is available at github.com/dynamical-inference/patchsae +
+
+
+
+
+ + ☆ APOLLO: SGD-like Memory, AdamW-level Performance + + +
+ Large language models (LLMs) are notoriously memory-intensive during +training, particularly with the popular AdamW optimizer. This memory burden +necessitates using more or higher-end GPUs or reducing batch sizes, limiting +training scalability and throughput. To address this, various memory-efficient +optimizers have been proposed to reduce optimizer memory usage. However, they +face critical challenges: (i) reliance on costly SVD operations; (ii) +significant performance trade-offs compared to AdamW; and (iii) still +substantial optimizer memory overhead to maintain competitive performance. + In this work, we identify that AdamW's learning rate adaptation rule can be +effectively coarsened as a structured learning rate update. Based on this +insight, we propose Approximated Gradient Scaling for Memory-Efficient LLM +Optimization (APOLLO), which approximates learning rate scaling using an +auxiliary low-rank optimizer state based on pure random projection. This +structured learning rate update rule makes APOLLO highly tolerant to further +memory reductions while delivering comparable pre-training performance. Even +its rank-1 variant, APOLLO-Mini, achieves superior pre-training performance +compared to AdamW with SGD-level memory costs. + Extensive experiments demonstrate that the APOLLO series performs on-par with +or better than AdamW, while achieving greater memory savings by nearly +eliminating the optimization states of AdamW. These savings provide significant +system-level benefits: (1) Enhanced Throughput: 3x throughput on an 8xA100-80GB +setup compared to AdamW by supporting 4x larger batch sizes. (2) Improved Model +Scalability: Pre-training LLaMA-13B with naive DDP on A100-80GB GPUs without +system-level optimizations. (3) Low-End GPU Friendly Pre-training: Pre-training +LLaMA-7B on a single GPU using less than 12 GB of memory with weight +quantization. + +
+
+ comment: Preprint +
+
+
+
+
+ + ☆ Chimera: Accurate retrosynthesis prediction by ensembling models with + diverse inductive biases + + +
+ Planning and conducting chemical syntheses remains a major bottleneck in the +discovery of functional small molecules, and prevents fully leveraging +generative AI for molecular inverse design. While early work has shown that +ML-based retrosynthesis models can predict reasonable routes, their low +accuracy for less frequent, yet important reactions has been pointed out. As +multi-step search algorithms are limited to reactions suggested by the +underlying model, the applicability of those tools is inherently constrained by +the accuracy of retrosynthesis prediction. Inspired by how chemists use +different strategies to ideate reactions, we propose Chimera: a framework for +building highly accurate reaction models that combine predictions from diverse +sources with complementary inductive biases using a learning-based ensembling +strategy. We instantiate the framework with two newly developed models, which +already by themselves achieve state of the art in their categories. Through +experiments across several orders of magnitude in data scale and time-splits, +we show Chimera outperforms all major models by a large margin, owing both to +the good individual performance of its constituents, but also to the +scalability of our ensembling strategy. Moreover, we find that PhD-level +organic chemists prefer predictions from Chimera over baselines in terms of +quality. Finally, we transfer the largest-scale checkpoint to an internal +dataset from a major pharmaceutical company, showing robust generalization +under distribution shift. With the new dimension that our framework unlocks, we +anticipate further acceleration in the development of even more accurate +models. + +
+
+
+
+
+ + ☆ Reinforcement Learning: An Overview + + +
+ This manuscript gives a big-picture, up-to-date overview of the field of +(deep) reinforcement learning and sequential decision making, covering +value-based RL, policy-gradient methods, model-based methods, and various other +topics (including a very brief discussion of RL+LLMs). + +
+
+
+
+
+ + ☆ Extrapolated Urban View Synthesis Benchmark + + +
+ Photorealistic simulators are essential for the training and evaluation of +vision-centric autonomous vehicles (AVs). At their core is Novel View Synthesis +(NVS), a crucial capability that generates diverse unseen viewpoints to +accommodate the broad and continuous pose distribution of AVs. Recent advances +in radiance fields, such as 3D Gaussian Splatting, achieve photorealistic +rendering at real-time speeds and have been widely used in modeling large-scale +driving scenes. However, their performance is commonly evaluated using an +interpolated setup with highly correlated training and test views. In contrast, +extrapolation, where test views largely deviate from training views, remains +underexplored, limiting progress in generalizable simulation technology. To +address this gap, we leverage publicly available AV datasets with multiple +traversals, multiple vehicles, and multiple cameras to build the first +Extrapolated Urban View Synthesis (EUVS) benchmark. Meanwhile, we conduct +quantitative and qualitative evaluations of state-of-the-art Gaussian Splatting +methods across different difficulty levels. Our results show that Gaussian +Splatting is prone to overfitting to training views. Besides, incorporating +diffusion priors and improving geometry cannot fundamentally improve NVS under +large view changes, highlighting the need for more robust approaches and +large-scale training. We have released our data to help advance self-driving +and urban robotics simulation technology. + +
+
+ comment: Project page: https://ai4ce.github.io/EUVS-Benchmark/ +
+
+
+
+
+ + ☆ From classical techniques to convolution-based models: A review of + object detection algorithms + + +
+ Object detection is a fundamental task in computer vision and image +understanding, with the goal of identifying and localizing objects of interest +within an image while assigning them corresponding class labels. Traditional +methods, which relied on handcrafted features and shallow models, struggled +with complex visual data and showed limited performance. These methods combined +low-level features with contextual information and lacked the ability to +capture high-level semantics. Deep learning, especially Convolutional Neural +Networks (CNNs), addressed these limitations by automatically learning rich, +hierarchical features directly from data. These features include both semantic +and high-level representations essential for accurate object detection. This +paper reviews object detection frameworks, starting with classical computer +vision methods. We categorize object detection approaches into two groups: (1) +classical computer vision techniques and (2) CNN-based detectors. We compare +major CNN models, discussing their strengths and limitations. In conclusion, +this review highlights the significant advancements in object detection through +deep learning and identifies key areas for further research to improve +performance. + +
+
+
+
+
+ + ☆ Uncertainty Quantification for Transformer Models for Dark-Pattern + Detection + + +
+ The opaque nature of transformer-based models, particularly in applications +susceptible to unethical practices such as dark-patterns in user interfaces, +requires models that integrate uncertainty quantification to enhance trust in +predictions. This study focuses on dark-pattern detection, deceptive design +choices that manipulate user decisions, undermining autonomy and consent. We +propose a differential fine-tuning approach implemented at the final +classification head via uncertainty quantification with transformer-based +pre-trained models. Employing a dense neural network (DNN) head architecture as +a baseline, we examine two methods capable of quantifying uncertainty: +Spectral-normalized Neural Gaussian Processes (SNGPs) and Bayesian Neural +Networks (BNNs). These methods are evaluated on a set of open-source +foundational models across multiple dimensions: model performance, variance in +certainty of predictions and environmental impact during training and inference +phases. Results demonstrate that integrating uncertainty quantification +maintains performance while providing insights into challenging instances +within the models. Moreover, the study reveals that the environmental impact +does not uniformly increase with the incorporation of uncertainty +quantification techniques. The study's findings demonstrate that uncertainty +quantification enhances transparency and provides measurable confidence in +predictions, improving the explainability and clarity of black-box models. This +facilitates informed decision-making and mitigates the influence of +dark-patterns on user interfaces. These results highlight the importance of +incorporating uncertainty quantification techniques in developing machine +learning models, particularly in domains where interpretability and +trustworthiness are critical. + +
+
+
+
+
+ + ☆ Enhancing Foundation Models for Time Series Forecasting via + Wavelet-based Tokenization + + +
+ How to best develop foundational models for time series forecasting remains +an important open question. Tokenization is a crucial consideration in this +effort: what is an effective discrete vocabulary for a real-valued sequential +input? To address this question, we develop WaveToken, a wavelet-based +tokenizer that allows models to learn complex representations directly in the +space of time-localized frequencies. Our method first scales and decomposes the +input time series, then thresholds and quantizes the wavelet coefficients, and +finally pre-trains an autoregressive model to forecast coefficients for the +forecast horizon. By decomposing coarse and fine structures in the inputs, +wavelets provide an eloquent and compact language for time series forecasting +that simplifies learning. Empirical results on a comprehensive benchmark, +including 42 datasets for both in-domain and zero-shot settings, show that +WaveToken: i) provides better accuracy than recently proposed foundation models +for forecasting while using a much smaller vocabulary (1024 tokens), and +performs on par or better than modern deep learning models trained specifically +on each dataset; and ii) exhibits superior generalization capabilities, +achieving the best average rank across all datasets for three complementary +metrics. In addition, we show that our method can easily capture complex +temporal patterns of practical relevance that are challenging for other recent +pre-trained models, including trends, sparse spikes, and non-stationary time +series with varying frequencies evolving over time. + +
+
+ comment: 25 pages, 15 figures +
+
+
+
+
+ + ☆ CompCap: Improving Multimodal Large Language Models with Composite + Captions + + +
+ How well can Multimodal Large Language Models (MLLMs) understand composite +images? Composite images (CIs) are synthetic visuals created by merging +multiple visual elements, such as charts, posters, or screenshots, rather than +being captured directly by a camera. While CIs are prevalent in real-world +applications, recent MLLM developments have primarily focused on interpreting +natural images (NIs). Our research reveals that current MLLMs face significant +challenges in accurately understanding CIs, often struggling to extract +information or perform complex reasoning based on these images. We find that +existing training data for CIs are mostly formatted for question-answer tasks +(e.g., in datasets like ChartQA and ScienceQA), while high-quality +image-caption datasets, critical for robust vision-language alignment, are only +available for NIs. To bridge this gap, we introduce Composite Captions +(CompCap), a flexible framework that leverages Large Language Models (LLMs) and +automation tools to synthesize CIs with accurate and detailed captions. Using +CompCap, we curate CompCap-118K, a dataset containing 118K image-caption pairs +across six CI types. We validate the effectiveness of CompCap-118K by +supervised fine-tuning MLLMs of three sizes: xGen-MM-inst.-4B and +LLaVA-NeXT-Vicuna-7B/13B. Empirical results show that CompCap-118K +significantly enhances MLLMs' understanding of CIs, yielding average gains of +1.7%, 2.0%, and 2.9% across eleven benchmarks, respectively. + +
+
+
+
+
+ + ☆ Physics-informed reduced order model with conditional neural fields NeurIPS 2024 + + +
+ This study presents the conditional neural fields for reduced-order modeling +(CNF-ROM) framework to approximate solutions of parametrized partial +differential equations (PDEs). The approach combines a parametric neural ODE +(PNODE) for modeling latent dynamics over time with a decoder that reconstructs +PDE solutions from the corresponding latent states. We introduce a +physics-informed learning objective for CNF-ROM, which includes two key +components. First, the framework uses coordinate-based neural networks to +calculate and minimize PDE residuals by computing spatial derivatives via +automatic differentiation and applying the chain rule for time derivatives. +Second, exact initial and boundary conditions (IC/BC) are imposed using +approximate distance functions (ADFs) [Sukumar and Srivastava, CMAME, 2022]. +However, ADFs introduce a trade-off as their second- or higher-order +derivatives become unstable at the joining points of boundaries. To address +this, we introduce an auxiliary network inspired by [Gladstone et al., NeurIPS +ML4PS workshop, 2022]. Our method is validated through parameter extrapolation +and interpolation, temporal extrapolation, and comparisons with analytical +solutions. + +
+
+ comment: 7 pages, 2 figures, NeurIPS 2024 Workshop on Machine Learning and the + Physical Sciences +
+
+
+
+
+ + ☆ Transformers Meet Relational Databases + + +
+ Transformer models have continuously expanded into all machine learning +domains convertible to the underlying sequence-to-sequence representation, +including tabular data. However, while ubiquitous, this representation +restricts their extension to the more general case of relational databases. In +this paper, we introduce a modular neural message-passing scheme that closely +adheres to the formal relational model, enabling direct end-to-end learning of +tabular Transformers from database storage systems. We address the challenges +of appropriate learning data representation and loading, which are critical in +the database setting, and compare our approach against a number of +representative models from various related fields across a significantly wide +range of datasets. Our results demonstrate a superior performance of this newly +proposed class of neural architectures. + +
+
+
+
+
+ + ☆ ColonNet: A Hybrid Of DenseNet121 And U-NET Model For Detection And + Segmentation Of GI Bleeding + + +
+ This study presents an integrated deep learning model for automatic detection +and classification of Gastrointestinal bleeding in the frames extracted from +Wireless Capsule Endoscopy (WCE) videos. The dataset has been released as part +of Auto-WCBleedGen Challenge Version V2 hosted by the MISAHUB team. Our model +attained the highest performance among 75 teams that took part in this +competition. It aims to efficiently utilizes CNN based model i.e. DenseNet and +UNet to detect and segment bleeding and non-bleeding areas in the real-world +complex dataset. The model achieves an impressive overall accuracy of 80% which +would surely help a skilled doctor to carry out further diagnostics. + +
+
+
+
+
+ + ☆ Global Optimization with A Power-Transformed Objective and Gaussian + Smoothing + + +
+ We propose a novel method that solves global optimization problems in two +steps: (1) perform a (exponential) power-$N$ transformation to the +not-necessarily differentiable objective function $f$ to obtain $f_N$, and (2) +optimize the Gaussian-smoothed $f_N$ with stochastic approximations. Under mild +conditions on $f$, for any $\delta>0$, we prove that with a sufficiently large +power $N_\delta$, this method converges to a solution in the +$\delta$-neighborhood of $f$'s global maximum point. The convergence rate is +$O(d^2\sigma^4\varepsilon^{-2})$, which is faster than both the standard and +single-loop homotopy methods. Extensive experiments show that our method +requires significantly fewer iterations than other compared algorithms to +produce a high-quality solution. + +
+
+
+
+
+ + ☆ One-shot Federated Learning via Synthetic Distiller-Distillate + Communication NeurIPS 2024 + + +
+ One-shot Federated learning (FL) is a powerful technology facilitating +collaborative training of machine learning models in a single round of +communication. While its superiority lies in communication efficiency and +privacy preservation compared to iterative FL, one-shot FL often compromises +model performance. Prior research has primarily focused on employing data-free +knowledge distillation to optimize data generators and ensemble models for +better aggregating local knowledge into the server model. However, these +methods typically struggle with data heterogeneity, where inconsistent local +data distributions can cause teachers to provide misleading knowledge. +Additionally, they may encounter scalability issues with complex datasets due +to inherent two-step information loss: first, during local training (from data +to model), and second, when transferring knowledge to the server model (from +model to inversed data). In this paper, we propose FedSD2C, a novel and +practical one-shot FL framework designed to address these challenges. FedSD2C +introduces a distiller to synthesize informative distillates directly from +local data to reduce information loss and proposes sharing synthetic +distillates instead of inconsistent local models to tackle data heterogeneity. +Our empirical results demonstrate that FedSD2C consistently outperforms other +one-shot FL methods with more complex and real datasets, achieving up to 2.6 +the performance of the best baseline. Code: https://github.com/Carkham/FedSD2C + +
+
+ comment: Accepted by NeurIPS 2024 +
+
+
+
+
+ + ☆ LinVT: Empower Your Image-level Large Language Model to Understand + Videos + + +
+ Large Language Models (LLMs) have been widely used in various tasks, +motivating us to develop an LLM-based assistant for videos. Instead of training +from scratch, we propose a module to transform arbitrary well-trained +image-based LLMs into video-LLMs (after being trained on video data). To better +adapt image-LLMs for processing videos, we introduce two design principles: +linear transformation to preserve the original visual-language alignment and +representative information condensation from redundant video content. Guided by +these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), +which enables existing image-LLMs to understand videos. We benchmark LinVT with +six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, +showcasing the high compatibility of LinVT. LinVT-based LLMs achieve +state-of-the-art performance across various video benchmarks, illustrating the +effectiveness of LinVT in multi-modal video understanding. + +
+
+
+
+
+ + ☆ Privacy Drift: Evolving Privacy Concerns in Incremental Learning + + +
+ In the evolving landscape of machine learning (ML), Federated Learning (FL) +presents a paradigm shift towards decentralized model training while preserving +user data privacy. This paper introduces the concept of ``privacy drift", an +innovative framework that parallels the well-known phenomenon of concept drift. +While concept drift addresses the variability in model accuracy over time due +to changes in the data, privacy drift encapsulates the variation in the leakage +of private information as models undergo incremental training. By defining and +examining privacy drift, this study aims to unveil the nuanced relationship +between the evolution of model performance and the integrity of data privacy. +Through rigorous experimentation, we investigate the dynamics of privacy drift +in FL systems, focusing on how model updates and data distribution shifts +influence the susceptibility of models to privacy attacks, such as membership +inference attacks (MIA). Our results highlight a complex interplay between +model accuracy and privacy safeguards, revealing that enhancements in model +performance can lead to increased privacy risks. We provide empirical evidence +from experiments on customized datasets derived from CIFAR-100 (Canadian +Institute for Advanced Research, 100 classes), showcasing the impact of data +and concept drift on privacy. This work lays the groundwork for future research +on privacy-aware machine learning, aiming to achieve a delicate balance between +model accuracy and data privacy in decentralized environments. + +
+
+ comment: 6 pages, 7 figures, Accepted in IEEE ICNC 25 +
+
+
+
+
+ + ☆ Variational Encoder-Decoders for Learning Latent Representations of + Physical Systems + + +
+ We present a deep-learning Variational Encoder-Decoder (VED) framework for +learning data-driven low-dimensional representations of the relationship +between high-dimensional parameters of a physical system and the system's +high-dimensional observable response. The framework consists of two deep +learning-based probabilistic transformations: An encoder mapping parameters to +latent codes and a decoder mapping latent codes to the observable response. The +hyperparameters of these transformations are identified by maximizing a +variational lower bound on the log-conditional distribution of the observable +response given parameters. To promote the disentanglement of latent codes, we +equip this variational loss with a penalty on the off-diagonal entries of the +aggregate distribution covariance of codes. This regularization penalty +encourages the pushforward of a standard Gaussian distribution of latent codes +to approximate the marginal distribution of the observable response. + Using the proposed framework we successfully model the hydraulic pressure +response at observation wells of a groundwater flow model as a function of its +discrete log-hydraulic transmissivity field. Compared to the canonical +correlation analysis encoding, the VED model achieves a lower-dimensional +latent representation, with as low as $r = 50$ latent dimensions without a +significant loss of reconstruction accuracy. We explore the impact of +regularization on model performance, finding that KL-divergence and covariance +regularization improve feature disentanglement in latent space while +maintaining reconstruction accuracy. Furthermore, we evaluate the generative +capabilities of the regularized model by decoding random Gaussian noise, +revealing that tuning both $\beta$ and $\lambda$ parameters enhances the +quality of the generated observable response data. + +
+
+
+
+
+ + ☆ Towards Understanding the Role of Sharpness-Aware Minimization + Algorithms for Out-of-Distribution Generalization + + +
+ Recently, sharpness-aware minimization (SAM) has emerged as a promising +method to improve generalization by minimizing sharpness, which is known to +correlate well with generalization ability. Since the original proposal of SAM, +many variants of SAM have been proposed to improve its accuracy and efficiency, +but comparisons have mainly been restricted to the i.i.d. setting. In this +paper we study SAM for out-of-distribution (OOD) generalization. First, we +perform a comprehensive comparison of eight SAM variants on zero-shot OOD +generalization, finding that the original SAM outperforms the Adam baseline by +$4.76\%$ and the strongest SAM variants outperform the Adam baseline by +$8.01\%$ on average. We then provide an OOD generalization bound in terms of +sharpness for this setting. Next, we extend our study of SAM to the related +setting of gradual domain adaptation (GDA), another form of OOD generalization +where intermediate domains are constructed between the source and target +domains, and iterative self-training is done on intermediate domains, to +improve the overall target domain error. In this setting, our experimental +results demonstrate that the original SAM outperforms the baseline of Adam on +each of the experimental datasets by $0.82\%$ on average and the strongest SAM +variants outperform Adam by $1.52\%$ on average. We then provide a +generalization bound for SAM in the GDA setting. Asymptotically, this +generalization bound is no better than the one for self-training in the +literature of GDA. This highlights a further disconnection between the +theoretical justification for SAM versus its empirical performance, with recent +work finding that low sharpness alone does not account for all of SAM's +generalization benefits. For future work, we provide several potential avenues +for obtaining a tighter analysis for SAM in the OOD setting. + +
+
+ comment: 25 pages +
+
+
+
+
+ + ☆ A Differentially Private Kaplan-Meier Estimator for Privacy-Preserving + Survival Analysis + + +
+ This paper presents a differentially private approach to Kaplan-Meier +estimation that achieves accurate survival probability estimates while +safeguarding individual privacy. The Kaplan-Meier estimator is widely used in +survival analysis to estimate survival functions over time, yet applying it to +sensitive datasets, such as clinical records, risks revealing private +information. To address this, we introduce a novel algorithm that applies +time-indexed Laplace noise, dynamic clipping, and smoothing to produce a +privacy-preserving survival curve while maintaining the cumulative structure of +the Kaplan-Meier estimator. By scaling noise over time, the algorithm accounts +for decreasing sensitivity as fewer individuals remain at risk, while dynamic +clipping and smoothing prevent extreme values and reduce fluctuations, +preserving the natural shape of the survival curve. + Our results, evaluated on the NCCTG lung cancer dataset, show that the +proposed method effectively lowers root mean squared error (RMSE) and enhances +accuracy across privacy budgets ($\epsilon$). At $\epsilon = 10$, the algorithm +achieves an RMSE as low as 0.04, closely approximating non-private estimates. +Additionally, membership inference attacks reveal that higher $\epsilon$ values +(e.g., $\epsilon \geq 6$) significantly reduce influential points, particularly +at higher thresholds, lowering susceptibility to inference attacks. These +findings confirm that our approach balances privacy and utility, advancing +privacy-preserving survival analysis. + +
+
+
+
+
+ + ☆ A text-to-tabular approach to generate synthetic patient data using LLMs + + +
+ Access to large-scale high-quality healthcare databases is key to accelerate +medical research and make insightful discoveries about diseases. However, +access to such data is often limited by patient privacy concerns, data sharing +restrictions and high costs. To overcome these limitations, synthetic patient +data has emerged as an alternative. However, synthetic data generation (SDG) +methods typically rely on machine learning (ML) models trained on original +data, leading back to the data scarcity problem. We propose an approach to +generate synthetic tabular patient data that does not require access to the +original data, but only a description of the desired database. We leverage +prior medical knowledge and in-context learning capabilities of large language +models (LLMs) to generate realistic patient data, even in a low-resource +setting. We quantitatively evaluate our approach against state-of-the-art SDG +models, using fidelity, privacy, and utility metrics. Our results show that +while LLMs may not match the performance of state-of-the-art models trained on +the original data, they effectively generate realistic patient data with +well-preserved clinical correlations. An ablation study highlights key elements +of our prompt contributing to high-quality synthetic patient data generation. +This approach, which is easy to use and does not require original data or +advanced ML skills, is particularly valuable for quickly generating +custom-designed patient data, supporting project implementation and providing +educational resources. + +
+
+ comment: 12 pages, 2 figures, 3 tables +
+
+
+
+
+ + ☆ Navigating Shortcuts, Spurious Correlations, and Confounders: From + Origins via Detection to Mitigation + + +
+ Shortcuts, also described as Clever Hans behavior, spurious correlations, or +confounders, present a significant challenge in machine learning and AI, +critically affecting model generalization and robustness. Research in this +area, however, remains fragmented across various terminologies, hindering the +progress of the field as a whole. Consequently, we introduce a unifying +taxonomy of shortcut learning by providing a formal definition of shortcuts and +bridging the diverse terms used in the literature. In doing so, we further +establish important connections between shortcuts and related fields, including +bias, causality, and security, where parallels exist but are rarely discussed. +Our taxonomy organizes existing approaches for shortcut detection and +mitigation, providing a comprehensive overview of the current state of the +field and revealing underexplored areas and open challenges. Moreover, we +compile and classify datasets tailored to study shortcut learning. Altogether, +this work provides a holistic perspective to deepen understanding and drive the +development of more effective strategies for addressing shortcuts in machine +learning. + +
+
+
+
+
+ + ☆ LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style + Conditioned Image Generation + + +
+ Recent advancements in image generation models have enabled personalized +image creation with both user-defined subjects (content) and styles. Prior +works achieved personalization by merging corresponding low-rank adaptation +parameters (LoRAs) through optimization-based methods, which are +computationally demanding and unsuitable for real-time use on +resource-constrained devices like smartphones. To address this, we introduce +LoRA.rar, a method that not only improves image quality but also achieves a +remarkable speedup of over $4000\times$ in the merging process. LoRA.rar +pre-trains a hypernetwork on a diverse set of content-style LoRA pairs, +learning an efficient merging strategy that generalizes to new, unseen +content-style pairs, enabling fast, high-quality personalization. Moreover, we +identify limitations in existing evaluation metrics for content-style quality +and propose a new protocol using multimodal large language models (MLLM) for +more accurate assessment. Our method significantly outperforms the current +state of the art in both content and style fidelity, as validated by MLLM +assessments and human evaluations. + +
+
+ comment: 17 pages, 20 figures +
+
+
+
+
+ + ☆ Explingo: Explaining AI Predictions using Large Language Models + + +
+ Explanations of machine learning (ML) model predictions generated by +Explainable AI (XAI) techniques such as SHAP are essential for people using ML +outputs for decision-making. We explore the potential of Large Language Models +(LLMs) to transform these explanations into human-readable, narrative formats +that align with natural communication. We address two key research questions: +(1) Can LLMs reliably transform traditional explanations into high-quality +narratives? and (2) How can we effectively evaluate the quality of narrative +explanations? To answer these questions, we introduce Explingo, which consists +of two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML +explanations and transforms them into natural-language descriptions. The Grader +scores these narratives on a set of metrics including accuracy, completeness, +fluency, and conciseness. + Our experiments demonstrate that LLMs can generate high-quality narratives +that achieve high scores across all metrics, particularly when guided by a +small number of human-labeled and bootstrapped examples. We also identified +areas that remain challenging, in particular for effectively scoring narratives +in complex domains. The findings from this work have been integrated into an +open-source tool that makes narrative explanations available for further +applications. + +
+
+ comment: To be presented in the 2024 IEEE International Conference on Big Data + (IEEE BigData) +
+
+
+
+
+ + ☆ Effective Rank and the Staircase Phenomenon: New Insights into Neural + Network Training Dynamics + + +
+ In recent years, deep learning, powered by neural networks, has achieved +widespread success in solving high-dimensional problems, particularly those +with low-dimensional feature structures. This success stems from their ability +to identify and learn low dimensional features tailored to the problems. +Understanding how neural networks extract such features during training +dynamics remains a fundamental question in deep learning theory. In this work, +we propose a novel perspective by interpreting the neurons in the last hidden +layer of a neural network as basis functions that represent essential features. +To explore the linear independence of these basis functions throughout the deep +learning dynamics, we introduce the concept of 'effective rank'. Our extensive +numerical experiments reveal a notable phenomenon: the effective rank increases +progressively during the learning process, exhibiting a staircase-like pattern, +while the loss function concurrently decreases as the effective rank rises. We +refer to this observation as the 'staircase phenomenon'. Specifically, for deep +neural networks, we rigorously prove the negative correlation between the loss +function and effective rank, demonstrating that the lower bound of the loss +function decreases with increasing effective rank. Therefore, to achieve a +rapid descent of the loss function, it is critical to promote the swift growth +of effective rank. Ultimately, we evaluate existing advanced learning +methodologies and find that these approaches can quickly achieve a higher +effective rank, thereby avoiding redundant staircase processes and accelerating +the rapid decline of the loss function. + +
+
+
+
+
+ + ☆ The Polynomial Stein Discrepancy for Assessing Moment Convergence + + +
+ We propose a novel method for measuring the discrepancy between a set of +samples and a desired posterior distribution for Bayesian inference. Classical +methods for assessing sample quality like the effective sample size are not +appropriate for scalable Bayesian sampling algorithms, such as stochastic +gradient Langevin dynamics, that are asymptotically biased. Instead, the gold +standard is to use the kernel Stein Discrepancy (KSD), which is itself not +scalable given its quadratic cost in the number of samples. The KSD and its +faster extensions also typically suffer from the curse-of-dimensionality and +can require extensive tuning. To address these limitations, we develop the +polynomial Stein discrepancy (PSD) and an associated goodness-of-fit test. +While the new test is not fully convergence-determining, we prove that it +detects differences in the first r moments in the Bernstein-von Mises limit. We +empirically show that the test has higher power than its competitors in several +examples, and at a lower computational cost. Finally, we demonstrate that the +PSD can assist practitioners to select hyper-parameters of Bayesian sampling +algorithms more efficiently than competitors. + +
+
+ comment: 17 Pages, 14 Figs +
+
+
+
+
+ + ☆ How to Squeeze An Explanation Out of Your Model + + +
+ Deep learning models are widely used nowadays for their reliability in +performing various tasks. However, they do not typically provide the reasoning +behind their decision, which is a significant drawback, particularly for more +sensitive areas such as biometrics, security and healthcare. The most commonly +used approaches to provide interpretability create visual attention heatmaps of +regions of interest on an image based on models gradient backpropagation. +Although this is a viable approach, current methods are targeted toward image +settings and default/standard deep learning models, meaning that they require +significant adaptations to work on video/multi-modal settings and custom +architectures. This paper proposes an approach for interpretability that is +model-agnostic, based on a novel use of the Squeeze and Excitation (SE) block +that creates visual attention heatmaps. By including an SE block prior to the +classification layer of any model, we are able to retrieve the most influential +features via SE vector manipulation, one of the key components of the SE block. +Our results show that this new SE-based interpretability can be applied to +various models in image and video/multi-modal settings, namely biometrics of +facial features with CelebA and behavioral biometrics using Active Speaker +Detection datasets. Furthermore, our proposal does not compromise model +performance toward the original task, and has competitive results with current +interpretability approaches in state-of-the-art object datasets, highlighting +its robustness to perform in varying data aside from the biometric context. + +
+
+
+
+
+ + ☆ Learning Hidden Physics and System Parameters with Deep Operator + Networks + + +
+ Big data is transforming scientific progress by enabling the discovery of +novel models, enhancing existing frameworks, and facilitating precise +uncertainty quantification, while advancements in scientific machine learning +complement this by providing powerful tools to solve inverse problems to +identify the complex systems where traditional methods falter due to sparse or +noisy data. We introduce two innovative neural operator frameworks tailored for +discovering hidden physics and identifying unknown system parameters from +sparse measurements. The first framework integrates a popular neural operator, +DeepONet, and a physics-informed neural network to capture the relationship +between sparse data and the underlying physics, enabling the accurate discovery +of a family of governing equations. The second framework focuses on system +parameter identification, leveraging a DeepONet pre-trained on sparse sensor +measurements to initialize a physics-constrained inverse model. Both frameworks +excel in handling limited data and preserving physical consistency. +Benchmarking on the Burgers' equation and reaction-diffusion system +demonstrates state-of-the-art performance, achieving average $L_2$ errors of +$\mathcal{O}(10^{-2})$ for hidden physics discovery and absolute errors of +$\mathcal{O}(10^{-3})$ for parameter identification. These results underscore +the frameworks' robustness, efficiency, and potential for solving complex +scientific problems with minimal observational data. + +
+
+
+
+
+ + ☆ Dirac-Equation Signal Processing: Physics Boosts Topological Machine + Learning + + +
+ Topological signals are variables or features associated with both nodes and +edges of a network. Recently, in the context of Topological Machine Learning, +great attention has been devoted to signal processing of such topological +signals. Most of the previous topological signal processing algorithms treat +node and edge signals separately and work under the hypothesis that the true +signal is smooth and/or well approximated by a harmonic eigenvector of the +Hodge-Laplacian, which may be violated in practice. Here we propose +Dirac-equation signal processing, a framework for efficiently reconstructing +true signals on nodes and edges, also if they are not smooth or harmonic, by +processing them jointly. The proposed physics-inspired algorithm is based on +the spectral properties of the topological Dirac operator. It leverages the +mathematical structure of the topological Dirac equation to boost the +performance of the signal processing algorithm. We discuss how the relativistic +dispersion relation obeyed by the topological Dirac equation can be used to +assess the quality of the signal reconstruction. Finally, we demonstrate the +improved performance of the algorithm with respect to previous algorithms. +Specifically, we show that Dirac-equation signal processing can also be used +efficiently if the true signal is a non-trivial linear combination of more than +one eigenstate of the Dirac equation, as it generally occurs for real signals. + +
+
+ comment: (14 pages, 7 figures) +
+
+
+
+
+ + ☆ Robust Computation with Intrinsic Heterogeneity + + +
+ Intrinsic within-type neuronal heterogeneity is a ubiquitous feature of +biological systems, with well-documented computational advantages. Recent works +in machine learning have incorporated such diversities by optimizing neuronal +parameters alongside synaptic connections and demonstrated state-of-the-art +performance across common benchmarks. However, this performance gain comes at +the cost of significantly higher computational costs, imposed by a larger +parameter space. Furthermore, it is unclear how the neuronal parameters, +constrained by the biophysics of their surroundings, are globally orchestrated +to minimize top-down errors. To address these challenges, we postulate that +neurons are intrinsically diverse, and investigate the computational +capabilities of such heterogeneous neuronal parameters. Our results show that +intrinsic heterogeneity, viewed as a fixed quenched disorder, often +substantially improves performance across hundreds of temporal tasks. Notably, +smaller but heterogeneous networks outperform larger homogeneous networks, +despite consuming less data. We elucidate the underlying mechanisms driving +this performance boost and illustrate its applicability to both rate and +spiking dynamics. Moreover, our findings demonstrate that heterogeneous +networks are highly resilient to severe alterations in their recurrent synaptic +hyperparameters, and even recurrent connections removal does not compromise +performance. The remarkable effectiveness of heterogeneous networks with small +sizes and relaxed connectivity is particularly relevant for the neuromorphic +community, which faces challenges due to device-to-device variability. +Furthermore, understanding the mechanism of robust computation with +heterogeneity also benefits neuroscientists and machine learners. + +
+
+ comment: 29 pages, 15 figures +
+
+
+
+
+ + ☆ Transformers Can Navigate Mazes With Multi-Step Prediction + + +
+ Despite their remarkable success in language modeling, transformers trained +to predict the next token in a sequence struggle with long-term planning. This +limitation is particularly evident in tasks requiring foresight to plan +multiple steps ahead such as maze navigation. The standard next single token +prediction objective, however, offers no explicit mechanism to predict multiple +steps ahead - or revisit the path taken so far. Consequently, in this work we +study whether explicitly predicting multiple steps ahead (and backwards) can +improve transformers' maze navigation. We train parameter-matched transformers +from scratch, under identical settings, to navigate mazes of varying types and +sizes with standard next token prediction and MLM-U, an objective explicitly +predicting multiple steps ahead and backwards. We find that MLM-U considerably +improves transformers' ability to navigate mazes compared to standard next +token prediction across maze types and complexities. We also find MLM-U +training is 4x more sample efficient and converges 2x faster in terms of GPU +training hours relative to next token training. Finally, for more complex mazes +we find MLM-U benefits from scaling to larger transformers. Remarkably, we find +transformers trained with MLM-U outperform larger transformers trained with +next token prediction using additional supervision from A* search traces. We +hope these findings underscore the promise of learning objectives to advance +transformers' capacity for long-term planning. + +
+
+ comment: 20 pages, 15 figures +
+
+
+
+
+ + ☆ Generating Rectifiable Measures through Neural Networks + + +
+ We derive universal approximation results for the class of (countably) +$m$-rectifiable measures. Specifically, we prove that $m$-rectifiable measures +can be approximated as push-forwards of the one-dimensional Lebesgue measure on +$[0,1]$ using ReLU neural networks with arbitrarily small approximation error +in terms of Wasserstein distance. What is more, the weights in the networks +under consideration are quantized and bounded and the number of ReLU neural +networks required to achieve an approximation error of $\varepsilon$ is no +larger than $2^{b(\varepsilon)}$ with +$b(\varepsilon)=\mathcal{O}(\varepsilon^{-m}\log^2(\varepsilon))$. This result +improves Lemma IX.4 in Perekrestenko et al. as it shows that the rate at which +$b(\varepsilon)$ tends to infinity as $\varepsilon$ tends to zero equals the +rectifiability parameter $m$, which can be much smaller than the ambient +dimension. We extend this result to countably $m$-rectifiable measures and show +that this rate still equals the rectifiability parameter $m$ provided that, +among other technical assumptions, the measure decays exponentially on the +individual components of the countably $m$-rectifiable support set. + +
+
+
+
+
+ + ☆ Integrating Semantic Communication and Human Decision-Making into an + End-to-End Sensing-Decision Framework + + +
+ As early as 1949, Weaver defined communication in a very broad sense to +include all procedures by which one mind or technical system can influence +another, thus establishing the idea of semantic communication. With the recent +success of machine learning in expert assistance systems where sensed +information is wirelessly provided to a human to assist task execution, the +need to design effective and efficient communications has become increasingly +apparent. In particular, semantic communication aims to convey the meaning +behind the sensed information relevant for Human Decision-Making (HDM). +Regarding the interplay between semantic communication and HDM, many questions +remain, such as how to model the entire end-to-end sensing-decision-making +process, how to design semantic communication for the HDM and which information +should be provided to the HDM. To address these questions, we propose to +integrate semantic communication and HDM into one probabilistic end-to-end +sensing-decision framework that bridges communications and psychology. In our +interdisciplinary framework, we model the human through a HDM process, allowing +us to explore how feature extraction from semantic communication can best +support human decision-making. In this sense, our study provides new insights +for the design/interaction of semantic communication with models of HDM. Our +initial analysis shows how semantic communication can balance the level of +detail with human cognitive capabilities while demanding less bandwidth, power, +and latency. + +
+
+
+
+
+ + ☆ ReF-LDM: A Latent Diffusion Model for Reference-based Face Image + Restoration NeurIPS 2024 + + +
+ While recent works on blind face image restoration have successfully produced +impressive high-quality (HQ) images with abundant details from low-quality (LQ) +input images, the generated content may not accurately reflect the real +appearance of a person. To address this problem, incorporating well-shot +personal images as additional reference inputs could be a promising strategy. +Inspired by the recent success of the Latent Diffusion Model (LDM), we propose +ReF-LDM, an adaptation of LDM designed to generate HQ face images conditioned +on one LQ image and multiple HQ reference images. Our model integrates an +effective and efficient mechanism, CacheKV, to leverage the reference images +during the generation process. Additionally, we design a timestep-scaled +identity loss, enabling our LDM-based model to focus on learning the +discriminating features of human faces. Lastly, we construct FFHQ-Ref, a +dataset consisting of 20,405 high-quality (HQ) face images with corresponding +reference images, which can serve as both training and evaluation data for +reference-based face restoration models. + +
+
+ comment: NeurIPS 2024, project page + https://chiweihsiao.github.io/refldm.github.io/ +
+
+
+
+
+ + ☆ Mixed Blessing: Class-Wise Embedding guided Instance-Dependent Partial + Label Learning KDD 2025 + + +
+ In partial label learning (PLL), every sample is associated with a candidate +label set comprising the ground-truth label and several noisy labels. The +conventional PLL assumes the noisy labels are randomly generated +(instance-independent), while in practical scenarios, the noisy labels are +always instance-dependent and are highly related to the sample features, +leading to the instance-dependent partial label learning (IDPLL) problem. +Instance-dependent noisy label is a double-edged sword. On one side, it may +promote model training as the noisy labels can depict the sample to some +extent. On the other side, it brings high label ambiguity as the noisy labels +are quite undistinguishable from the ground-truth label. To leverage the +nuances of IDPLL effectively, for the first time we create class-wise +embeddings for each sample, which allow us to explore the relationship of +instance-dependent noisy labels, i.e., the class-wise embeddings in the +candidate label set should have high similarity, while the class-wise +embeddings between the candidate label set and the non-candidate label set +should have high dissimilarity. Moreover, to reduce the high label ambiguity, +we introduce the concept of class prototypes containing global feature +information to disambiguate the candidate label set. Extensive experimental +comparisons with twelve methods on six benchmark data sets, including four +fine-grained data sets, demonstrate the effectiveness of the proposed method. +The code implementation is publicly available at +https://github.com/Yangfc-ML/CEL. + +
+
+ comment: Accepted by KDD 2025 +
+
+
+
+
+ + ☆ Backdooring Outlier Detection Methods: A Novel Attack Approach + + +
+ There have been several efforts in backdoor attacks, but these have primarily +focused on the closed-set performance of classifiers (i.e., classification). +This has left a gap in addressing the threat to classifiers' open-set +performance, referred to as outlier detection in the literature. Reliable +outlier detection is crucial for deploying classifiers in critical real-world +applications such as autonomous driving and medical image analysis. First, we +show that existing backdoor attacks fall short in affecting the open-set +performance of classifiers, as they have been specifically designed to confuse +intra-closed-set decision boundaries. In contrast, an effective backdoor attack +for outlier detection needs to confuse the decision boundary between the closed +and open sets. Motivated by this, in this study, we propose BATOD, a novel +Backdoor Attack targeting the Outlier Detection task. Specifically, we design +two categories of triggers to shift inlier samples to outliers and vice versa. +We evaluate BATOD using various real-world datasets and demonstrate its +superior ability to degrade the open-set performance of classifiers compared to +previous attacks, both before and after applying defenses. + +
+
+
+
+
+ + ☆ Prompt Transfer for Dual-Aspect Cross Domain Cognitive Diagnosis + + +
+ Cognitive Diagnosis (CD) aims to evaluate students' cognitive states based on +their interaction data, enabling downstream applications such as exercise +recommendation and personalized learning guidance. However, existing methods +often struggle with accuracy drops in cross-domain cognitive diagnosis (CDCD), +a practical yet challenging task. While some efforts have explored +exercise-aspect CDCD, such as crosssubject scenarios, they fail to address the +broader dual-aspect nature of CDCD, encompassing both student- and +exerciseaspect variations. This diversity creates significant challenges in +developing a scenario-agnostic framework. To address these gaps, we propose +PromptCD, a simple yet effective framework that leverages soft prompt transfer +for cognitive diagnosis. PromptCD is designed to adapt seamlessly across +diverse CDCD scenarios, introducing PromptCD-S for student-aspect CDCD and +PromptCD-E for exercise-aspect CDCD. Extensive experiments on real-world +datasets demonstrate the robustness and effectiveness of PromptCD, consistently +achieving superior performance across various CDCD scenarios. Our work offers a +unified and generalizable approach to CDCD, advancing both theoretical and +practical understanding in this critical domain. The implementation of our +framework is publicly available at +https://github.com/Publisher-PromptCD/PromptCD. + +
+
+
+
+
+ + ☆ Noise Matters: Diffusion Model-based Urban Mobility Generation with + Collaborative Noise Priors + + +
+ With global urbanization, the focus on sustainable cities has largely grown, +driving research into equity, resilience, and urban planning, which often +relies on mobility data. The rise of web-based apps and mobile devices has +provided valuable user data for mobility-related research. However, real-world +mobility data is costly and raises privacy concerns. To protect privacy while +retaining key features of real-world movement, the demand for synthetic data +has steadily increased. Recent advances in diffusion models have shown great +potential for mobility trajectory generation due to their ability to model +randomness and uncertainty. However, existing approaches often directly apply +identically distributed (i.i.d.) noise sampling from image generation +techniques, which fail to account for the spatiotemporal correlations and +social interactions that shape urban mobility patterns. In this paper, we +propose CoDiffMob, a diffusion method for urban mobility generation with +collaborative noise priors, we emphasize the critical role of noise in +diffusion models for generating mobility data. By leveraging both individual +movement characteristics and population-wide dynamics, we construct novel +collaborative noise priors that provide richer and more informative guidance +throughout the generation process. Extensive experiments demonstrate the +superiority of our method, with generated data accurately capturing both +individual preferences and collective patterns, achieving an improvement of +over 32\%. Furthermore, it can effectively replace web-derived mobility data to +better support downstream applications, while safeguarding user privacy and +fostering a more secure and ethical web. This highlights its tremendous +potential for applications in sustainable city-related research. + +
+
+
+
+
+ + ☆ Power Plant Detection for Energy Estimation using GIS with Remote + Sensing, CNN & Vision Transformers + + +
+ In this research, we propose a hybrid model for power plant detection to +assist energy estimation applications, by pipelining GIS (Geographical +Information Systems) having Remote Sensing capabilities with CNN (Convolutional +Neural Networks) and ViT (Vision Transformers). Our proposed approach enables +real-time analysis with multiple data types on a common map via the GIS, +entails feature-extraction abilities due to the CNN, and captures long-range +dependencies through the ViT. This hybrid approach is found to enhance +classification, thus helping in the monitoring and operational management of +power plants; hence assisting energy estimation and sustainable energy planning +in the future. It exemplifies adequate deployment of machine learning methods +in conjunction with domain-specific approaches to enhance performance. + +
+
+
+
+
+ + ☆ Frontier Models are Capable of In-context Scheming + + +
+ Frontier models are increasingly trained and deployed as autonomous agent. +One safety concern is that AI agents might covertly pursue misaligned goals, +hiding their true capabilities and objectives - also known as scheming. We +study whether models have the capability to scheme in pursuit of a goal that we +provide in-context and instruct the model to strongly follow. We evaluate +frontier models on a suite of six agentic evaluations where models are +instructed to pursue goals and are placed in environments that incentivize +scheming. Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini +1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. +They recognize scheming as a viable strategy and readily engage in such +behavior. For example, models strategically introduce subtle mistakes into +their responses, attempt to disable their oversight mechanisms, and even +exfiltrate what they believe to be their model weights to external servers. +Additionally, this deceptive behavior proves persistent. When o1 has engaged in +scheming, it maintains its deception in over 85% of follow-up questions and +often remains deceptive in multi-turn interrogations. Analysis of the models' +chains-of-thought reveals that models explicitly reason about these deceptive +strategies, providing evidence that the scheming behavior is not accidental. +Surprisingly, we also find rare instances where models engage in scheming when +only given a goal, without being strongly nudged to pursue it. We observe cases +where Claude 3.5 Sonnet strategically underperforms in evaluations in pursuit +of being helpful, a goal that was acquired during training rather than +in-context. Our findings demonstrate that frontier models now possess +capabilities for basic in-context scheming, making the potential of AI agents +to engage in scheming behavior a concrete rather than theoretical concern. + +
+
+
+
+
+ + ☆ Causal discovery with endogenous context variables + + +
+ Causal systems often exhibit variations of the underlying causal mechanisms +between the variables of the system. Often, these changes are driven by +different environments or internal states in which the system operates, and we +refer to context variables as those variables that indicate this change in +causal mechanisms. An example are the causal relations in soil +moisture-temperature interactions and their dependence on soil moisture +regimes: Dry soil triggers a dependence of soil moisture on latent heat, while +environments with wet soil do not feature such a feedback, making it a +context-specific property. Crucially, a regime or context variable such as soil +moisture need not be exogenous and can be influenced by the dynamical system +variables - precipitation can make a dry soil wet - leading to joint systems +with endogenous context variables. In this work we investigate the assumptions +for constraint-based causal discovery of context-specific information in +systems with endogenous context variables. We show that naive approaches such +as learning different regime graphs on masked data, or pooling all data, can +lead to uninformative results. We propose an adaptive constraint-based +discovery algorithm and give a detailed discussion on the connection to +structural causal models, including sufficiency assumptions, which allow to +prove the soundness of our algorithm and to interpret the results causally. +Numerical experiments demonstrate the performance of the proposed method over +alternative baselines, but they also unveil current limitations of our method. + +
+
+
+
+
+ + ☆ Putting the Iterative Training of Decision Trees to the Test on a + Real-World Robotic Task + + +
+ In previous research, we developed methods to train decision trees (DT) as +agents for reinforcement learning tasks, based on deep reinforcement learning +(DRL) networks. The samples from which the DTs are built, use the environment's +state as features and the corresponding action as label. To solve the +nontrivial task of selecting samples, which on one hand reflect the DRL agent's +capabilities of choosing the right action but on the other hand also cover +enough state space to generalize well, we developed an algorithm to iteratively +train DTs. + In this short paper, we apply this algorithm to a real-world implementation +of a robotic task for the first time. Real-world tasks pose additional +challenges compared to simulations, such as noise and delays. The task consists +of a physical pendulum attached to a cart, which moves on a linear track. By +movements to the left and to the right, the pendulum is to be swung in the +upright position and balanced in the unstable equilibrium. Our results +demonstrate the applicability of the algorithm to real-world tasks by +generating a DT whose performance matches the performance of the DRL agent, +while consisting of fewer parameters. This research could be a starting point +for distilling DTs from DRL agents to obtain transparent, lightweight models +for real-world reinforcement learning tasks. + +
+
+ comment: 5 pages, 4 figures +
+
+
+
+
+ + ☆ Gla-AI4BioMed at RRG24: Visual Instruction-tuned Adaptation for + Radiology Report Generation ACL 2024 + + +
+ We introduce a radiology-focused visual language model designed to generate +radiology reports from chest X-rays. Building on previous findings that large +language models (LLMs) can acquire multimodal capabilities when aligned with +pretrained vision encoders, we demonstrate similar potential with chest X-ray +images. This integration enhances the ability of model to understand and +describe chest X-ray images. Our model combines an image encoder with a +fine-tuned LLM based on the Vicuna-7B architecture, enabling it to generate +different sections of a radiology report with notable accuracy. The training +process involves a two-stage approach: (i) initial alignment of chest X-ray +features with the LLM (ii) followed by fine-tuning for radiology report +generation. + +
+
+ comment: Accepted by BioNLP@ACL 2024 +
+
+
+
+
+ + ☆ Bed-Attached Vibration Sensor System: A Machine Learning Approach for + Fall Detection in Nursing Homes + + +
+ The increasing shortage of nursing staff and the acute risk of falls in +nursing homes pose significant challenges for the healthcare system. This study +presents the development of an automated fall detection system integrated into +care beds, aimed at enhancing patient safety without compromising privacy +through wearables or video monitoring. Mechanical vibrations transmitted +through the bed frame are processed using a short-time Fourier transform, +enabling robust classification of distinct human fall patterns with a +convolutional neural network. Challenges pertaining to the quantity and +diversity of the data are addressed, proposing the generation of additional +data with a specific emphasis on enhancing variation. While the model shows +promising results in distinguishing fall events from noise using lab data, +further testing in real-world environments is recommended for validation and +improvement. Despite limited available data, the proposed system shows the +potential for an accurate and rapid response to falls, mitigating health +implications, and addressing the needs of an aging population. This case study +was performed as part of the ZIM Project. Further research on sensors enhanced +by artificial intelligence will be continued in the ShapeFuture Project. + +
+
+
+
+
+ + ☆ Probing the contents of semantic representations from text, behavior, + and brain data using the psychNorms metabase + + +
+ Semantic representations are integral to natural language processing, +psycholinguistics, and artificial intelligence. Although often derived from +internet text, recent years have seen a rise in the popularity of +behavior-based (e.g., free associations) and brain-based (e.g., fMRI) +representations, which promise improvements in our ability to measure and model +human representations. We carry out the first systematic evaluation of the +similarities and differences between semantic representations derived from +text, behavior, and brain data. Using representational similarity analysis, we +show that word vectors derived from behavior and brain data encode information +that differs from their text-derived cousins. Furthermore, drawing on our +psychNorms metabase, alongside an interpretability method that we call +representational content analysis, we find that, in particular, behavior +representations capture unique variance on certain affective, agentic, and +socio-moral dimensions. We thus establish behavior as an important complement +to text for capturing human representations and behavior. These results are +broadly relevant to research aimed at learning human-aligned semantic +representations, including work on evaluating and aligning large language +models. + +
+
+ comment: 13 pages, 5 figures, 2 tables +
+
+
+
+
+ + ☆ Video Decomposition Prior: A Methodology to Decompose Videos into Layers ICLR + + +
+ In the evolving landscape of video enhancement and editing methodologies, a +majority of deep learning techniques often rely on extensive datasets of +observed input and ground truth sequence pairs for optimal performance. Such +reliance often falters when acquiring data becomes challenging, especially in +tasks like video dehazing and relighting, where replicating identical motions +and camera angles in both corrupted and ground truth sequences is complicated. +Moreover, these conventional methodologies perform best when the test +distribution closely mirrors the training distribution. Recognizing these +challenges, this paper introduces a novel video decomposition prior +`\texttt{VDP}' framework which derives inspiration from professional video +editing practices. Our methodology does not mandate task-specific external data +corpus collection, instead pivots to utilizing the motion and appearance of the +input video. \texttt{VDP} framework decomposes a video sequence into a set of +multiple RGB layers and associated opacity levels. These set of layers are then +manipulated individually to obtain the desired results. We addresses tasks such +as video object segmentation, dehazing, and relighting. Moreover, we introduce +a novel logarithmic video decomposition formulation for video relighting tasks, +setting a new benchmark over the existing methodologies. We observe the +property of relighting emerge as we optimize for our novel relighting +decomposition formulation. We evaluate our approach on standard video datasets +like DAVIS, REVIDE, \& SDSD and show qualitative results on a diverse array of +internet videos. Project Page - +https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video +results. + +
+
+ comment: Project Page - + https://www.cs.umd.edu/~gauravsh/video_decomposition/index.html for video + results. Extended version of ICLR publication +
+
+
+
+
+ + ☆ Continuous Video Process: Modeling Videos as Continuous + Multi-Dimensional Processes for Video Prediction CVPR + + +
+ Diffusion models have made significant strides in image generation, mastering +tasks such as unconditional image synthesis, text-image translation, and +image-to-image conversions. However, their capability falls short in the realm +of video prediction, mainly because they treat videos as a collection of +independent images, relying on external constraints such as temporal attention +mechanisms to enforce temporal coherence. In our paper, we introduce a novel +model class, that treats video as a continuous multi-dimensional process rather +than a series of discrete frames. We also report a reduction of 75\% sampling +steps required to sample a new frame thus making our framework more efficient +during the inference time. Through extensive experimentation, we establish +state-of-the-art performance in video prediction, validated on benchmark +datasets including KTH, BAIR, Human3.6M, and UCF101. Navigate to the project +page https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results.} + +
+
+ comment: Navigate to the project page + https://www.cs.umd.edu/~gauravsh/cvp/supp/website.html for video results. + Extended version of published CVPR paper +
+
+
+
+
+ + ☆ Achieving Group Fairness through Independence in Predictive Process + Monitoring + + +
+ Predictive process monitoring focuses on forecasting future states of ongoing +process executions, such as predicting the outcome of a particular case. In +recent years, the application of machine learning models in this domain has +garnered significant scientific attention. When using historical execution +data, which may contain biases or exhibit unfair behavior, these biases may be +encoded into the trained models. Consequently, when such models are deployed to +make decisions or guide interventions for new cases, they risk perpetuating +this unwanted behavior. This work addresses group fairness in predictive +process monitoring by investigating independence, i.e. ensuring predictions are +unaffected by sensitive group membership. We explore independence through +metrics for demographic parity such as $\Delta$DP, as well as recently +introduced, threshold-independent distribution-based alternatives. +Additionally, we propose a composite loss functions existing of binary +cross-entropy and a distribution-based loss (Wasserstein) to train models that +balance predictive performance and fairness, and allow for customizable +trade-offs. The effectiveness of both the fairness metrics and the composite +loss functions is validated through a controlled experimental setup. + +
+
+ comment: Preprint +
+
+
+
+
+ + ☆ Learning High-Degree Parities: The Crucial Role of the Initialization + + +
+ Parities have become a standard benchmark for evaluating learning algorithms. +Recent works show that regular neural networks trained by gradient descent can +efficiently learn degree $k$ parities on uniform inputs for constant $k$, but +fail to do so when $k$ and $d-k$ grow with $d$ (here $d$ is the ambient +dimension). However, the case where $k=d-O_d(1)$ (almost-full parities), +including the degree $d$ parity (the full parity), has remained unsettled. This +paper shows that for gradient descent on regular neural networks, learnability +depends on the initial weight distribution. On one hand, the discrete +Rademacher initialization enables efficient learning of almost-full parities, +while on the other hand, its Gaussian perturbation with large enough constant +standard deviation $\sigma$ prevents it. The positive result for almost-full +parities is shown to hold up to $\sigma=O(d^{-1})$, pointing to questions about +a sharper threshold phenomenon. Unlike statistical query (SQ) learning, where a +singleton function class like the full parity is trivially learnable, our +negative result applies to a fixed function and relies on an initial gradient +alignment measure of potential broader relevance to neural networks learning. + +
+
+
+
+
+ + ☆ DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling + + +
+ Large language models (LLMs) have made dialogue one of the central modes of +human-machine interaction, leading to the accumulation of vast amounts of +conversation logs and increasing demand for dialogue generation. A +conversational life-cycle spans from the Prelude through the Interlocution to +the Epilogue, encompassing various elements. Despite the existence of numerous +dialogue-related studies, there is a lack of benchmarks that encompass +comprehensive dialogue elements, hindering precise modeling and systematic +evaluation. To bridge this gap, we introduce an innovative research task +$\textbf{D}$ialogue $\textbf{E}$lement $\textbf{MO}$deling, including +$\textit{Element Awareness}$ and $\textit{Dialogue Agent Interaction}$, and +propose a novel benchmark, $\textbf{DEMO}$, designed for a comprehensive +dialogue modeling and assessment. Inspired by imitation learning, we further +build the agent which possesses the adept ability to model dialogue elements +based on the DEMO benchmark. Extensive experiments indicate that existing LLMs +still exhibit considerable potential for enhancement, and our DEMO agent has +superior performance in both in-domain and out-of-domain tasks. + +
+
+ comment: We release the code and data at https://github.com/MozerWang/DEMO +
+
+
+
+
+ + ☆ EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation + + +
+ Multimodal large language models (MLLMs) have achieved remarkable progress on +various visual question answering and reasoning tasks leveraging instruction +fine-tuning specific datasets. They can also learn from preference data +annotated by human to enhance their reasoning ability and mitigate +hallucinations. Most of preference data is generated from the model itself. +However, existing methods require high-quality critical labels, which are +costly and rely on human or proprietary models like GPT-4V. In this work, we +propose Enhancing Alignment in MLLMs via Critical Observation (EACO), which +aligns MLLMs by self-generated preference data using only 5k images +economically. Our approach begins with collecting and refining a Scoring +Evaluation Instruction-tuning dataset to train a critical evaluation model, +termed the Critic. This Critic observes model responses across multiple +dimensions, selecting preferred and non-preferred outputs for refined Direct +Preference Optimization (DPO) tuning. To further enhance model performance, we +employ an additional supervised fine-tuning stage after preference tuning. EACO +reduces the overall hallucinations by 65.6% on HallusionBench and improves the +reasoning ability by 21.8% on MME-Cognition. EACO achieves an 8.5% improvement +over LLaVA-v1.6-Mistral-7B across multiple benchmarks. Remarkably, EACO also +shows the potential critical ability in open-source MLLMs, demonstrating that +EACO is a viable path to boost the competence of MLLMs. + +
+
+ comment: 19 pages +
+
+
+
+
+ + ☆ Mitigating Instance-Dependent Label Noise: Integrating Self-Supervised + Pretraining with Pseudo-Label Refinement + + +
+ Deep learning models rely heavily on large volumes of labeled data to achieve +high performance. However, real-world datasets often contain noisy labels due +to human error, ambiguity, or resource constraints during the annotation +process. Instance-dependent label noise (IDN), where the probability of a label +being corrupted depends on the input features, poses a significant challenge +because it is more prevalent and harder to address than instance-independent +noise. In this paper, we propose a novel hybrid framework that combines +self-supervised learning using SimCLR with iterative pseudo-label refinement to +mitigate the effects of IDN. The self-supervised pre-training phase enables the +model to learn robust feature representations without relying on potentially +noisy labels, establishing a noise-agnostic foundation. Subsequently, we employ +an iterative training process with pseudo-label refinement, where confidently +predicted samples are identified through a multistage approach and their labels +are updated to improve label quality progressively. We evaluate our method on +the CIFAR-10 and CIFAR-100 datasets augmented with synthetic instance-dependent +noise at varying noise levels. Experimental results demonstrate that our +approach significantly outperforms several state-of-the-art methods, +particularly under high noise conditions, achieving notable improvements in +classification accuracy and robustness. Our findings suggest that integrating +self-supervised learning with iterative pseudo-label refinement offers an +effective strategy for training deep neural networks on noisy datasets +afflicted by instance-dependent label noise. + +
+
+
+
+
+ + ☆ AI-Driven Non-Invasive Detection and Staging of Steatosis in Fatty Liver + Disease Using a Novel Cascade Model and Information Fusion Techniques + + +
+ Non-alcoholic fatty liver disease (NAFLD) is one of the most widespread liver +disorders on a global scale, posing a significant threat of progressing to more +severe conditions like nonalcoholic steatohepatitis (NASH), liver fibrosis, +cirrhosis, and hepatocellular carcinoma. Diagnosing and staging NAFLD presents +challenges due to its non-specific symptoms and the invasive nature of liver +biopsies. Our research introduces a novel artificial intelligence cascade model +employing ensemble learning and feature fusion techniques. We developed a +non-invasive, robust, and reliable diagnostic artificial intelligence tool that +utilizes anthropometric and laboratory parameters, facilitating early detection +and intervention in NAFLD progression. Our novel artificial intelligence +achieved an 86% accuracy rate for the NASH steatosis staging task (non-NASH, +steatosis grade 1, steatosis grade 2, and steatosis grade 3) and an impressive +96% AUC-ROC for distinguishing between NASH (steatosis grade 1, grade 2, and +grade3) and non-NASH cases, outperforming current state-of-the-art models. This +notable improvement in diagnostic performance underscores the potential +application of artificial intelligence in the early diagnosis and treatment of +NAFLD, leading to better patient outcomes and a reduced healthcare burden +associated with advanced liver disease. + +
+
+
+
+
+ + ☆ Nonmyopic Global Optimisation via Approximate Dynamic Programming + + +
+ Unconstrained global optimisation aims to optimise expensive-to-evaluate +black-box functions without gradient information. Bayesian optimisation, one of +the most well-known techniques, typically employs Gaussian processes as +surrogate models, leveraging their probabilistic nature to balance exploration +and exploitation. However, Gaussian processes become computationally +prohibitive in high-dimensional spaces. Recent alternatives, based on inverse +distance weighting (IDW) and radial basis functions (RBFs), offer competitive, +computationally lighter solutions. Despite their efficiency, both traditional +global and Bayesian optimisation strategies suffer from the myopic nature of +their acquisition functions, which focus solely on immediate improvement +neglecting future implications of the sequential decision making process. +Nonmyopic acquisition functions devised for the Bayesian setting have shown +promise in improving long-term performance. Yet, their use in deterministic +strategies with IDW and RBF remains unexplored. In this work, we introduce +novel nonmyopic acquisition strategies tailored to IDW- and RBF-based global +optimisation. Specifically, we develop dynamic programming-based paradigms, +including rollout and multi-step scenario-based optimisation schemes, to enable +lookahead acquisition. These methods optimise a sequence of query points over a +horizon (instead of only at the next step) by predicting the evolution of the +surrogate model, inherently managing the exploration-exploitation trade-off in +a systematic way via optimisation techniques. The proposed approach represents +a significant advance in extending nonmyopic acquisition principles, previously +confined to Bayesian optimisation, to the deterministic framework. Empirical +results on synthetic and hyperparameter tuning benchmark problems demonstrate +that these nonmyopic methods outperform conventional myopic approaches. + +
+
+ comment: 31 pages, 4 figures, 2 tables, submitted to Springer Computational + Optimization and Applications +
+
+
+
+
+ + ☆ MSECG: Incorporating Mamba for Robust and Efficient ECG Super-Resolution + + +
+ Electrocardiogram (ECG) signals play a crucial role in diagnosing +cardiovascular diseases. To reduce power consumption in wearable or portable +devices used for long-term ECG monitoring, super-resolution (SR) techniques +have been developed, enabling these devices to collect and transmit signals at +a lower sampling rate. In this study, we propose MSECG, a compact neural +network model designed for ECG SR. MSECG combines the strength of the recurrent +Mamba model with convolutional layers to capture both local and global +dependencies in ECG waveforms, allowing for the effective reconstruction of +high-resolution signals. We also assess the model's performance in real-world +noisy conditions by utilizing ECG data from the PTB-XL database and noise data +from the MIT-BIH Noise Stress Test Database. Experimental results show that +MSECG outperforms two contemporary ECG SR models under both clean and noisy +conditions while using fewer parameters, offering a more powerful and robust +solution for long-term ECG monitoring applications. + +
+
+ comment: 5 pages, 3 figures +
+
+
+
+
+ + ☆ MTSpark: Enabling Multi-Task Learning with Spiking Neural Networks for + Generalist Agents + + +
+ Currently, state-of-the-art RL methods excel in single-task settings, but +they still struggle to generalize across multiple tasks due to catastrophic +forgetting challenges, where previously learned tasks are forgotten as new +tasks are introduced. This multi-task learning capability is significantly +important for generalist agents, where adaptation features are highly required +(e.g., autonomous robots). On the other hand, Spiking Neural Networks (SNNs) +have emerged as alternative energy-efficient neural network algorithms due to +their sparse spike-based operations. Toward this, we propose MTSpark, a novel +methodology to enable multi-task RL using spiking networks. Specifically, +MTSpark develops a Deep Spiking Q-Network (DSQN) with active dendrites and +dueling structure by leveraging task-specific context signals. Specifically, +each neuron computes task-dependent activations that dynamically modulate +inputs, forming specialized sub-networks for each task. Moreover, this +bioplausible network model also benefits from SNNs, enhancing energy efficiency +and making the model suitable for hardware implementation. Experimental results +show that, our MTSpark effectively learns multiple tasks with higher +performance compared to the state-of-the-art. Specifically, MTSpark +successfully achieves high score in three Atari games (i.e., Pong: -5.4, +Breakout: 0.6, and Enduro: 371.2), reaching human-level performance (i.e., +Pong: -3, Breakout: 31, and Enduro: 368), where state-of-the-art struggle to +achieve. In addition, our MTSpark also shows better accuracy in image +classification tasks than the state-of-the-art. These results highlight the +potential of our MTSpark methodology to develop generalist agents that can +learn multiple tasks by leveraging both RL and SNN concepts. + +
+
+ comment: 9 pages, 10 figures, 5 tables +
+
+
+
+
+ + ☆ eXpath: Explaining Knowledge Graph Link Prediction with Ontological + Closed Path Rules VLDB + + +
+ Link prediction (LP) is crucial for Knowledge Graphs (KG) completion but +commonly suffers from interpretability issues. While several methods have been +proposed to explain embedding-based LP models, they are generally limited to +local explanations on KG and are deficient in providing human interpretable +semantics. Based on real-world observations of the characteristics of KGs from +multiple domains, we propose to explain LP models in KG with path-based +explanations. An integrated framework, namely eXpath, is introduced which +incorporates the concept of relation path with ontological closed path rules to +enhance both the efficiency and effectiveness of LP interpretation. Notably, +the eXpath explanations can be fused with other single-link explanation +approaches to achieve a better overall solution. Extensive experiments across +benchmark datasets and LP models demonstrate that introducing eXpath can boost +the quality of resulting explanations by about 20% on two key metrics and +reduce the required explanation time by 61.4%, in comparison to the best +existing method. Case studies further highlight eXpath's ability to provide +more semantically meaningful explanations through path-based evidence. + +
+
+ comment: 13 pages, 5 figures. Submitted to PVLDB volumn 18 on 20241201 +
+
+
+
+
+ + ☆ Using Machine Learning to Discover Parsimonious and + Physically-Interpretable Representations of Catchment-Scale Rainfall-Runoff + Dynamics + + +
+ Despite the excellent real-world predictive performance of modern machine +learning (ML) methods, many scientists remain hesitant to discard traditional +physical-conceptual (PC) approaches due mainly to their relative +interpretability, which contributes to credibility during decision-making. In +this context, a currently underexplored aspect of ML is how to develop +minimally-optimal representations that can facilitate better insight regarding +system functioning. Regardless of how this is achieved, it is arguably true +that parsimonious representations better support the advancement of scientific +understanding. Our own view is that ML-based modeling of geoscientific systems +should be based in the use of computational units that are fundamentally +interpretable by design. + This paper continues our exploration of how the strengths of ML can be +exploited in the service of better understanding via scientific investigation. +Here, we use the Mass Conserving Perceptron (MCP) as the fundamental +computational unit in a generic network architecture consisting of nodes +arranged in series and parallel to explore several generic and important issues +related to the use of observational data for constructing input-state-output +models of dynamical systems. In the context of lumped catchment modeling, we +show that physical interpretability and excellent predictive performance can +both be achieved using a relatively parsimonious distributed-state +multiple-flow-path network with context-dependent gating and information +sharing across the nodes, suggesting that MCP-based modeling can play a +significant role in application of ML to geoscientific investigation. + +
+
+ comment: 73 Pages, 4 Tables, 13 Figures, 11 Tables and 11 Figures in + Supplementary Materials +
+
+
+
+
+ + ☆ Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards + for Visuomotor Robot Policy Alignment + + +
+ Visuomotor robot policies, increasingly pre-trained on large-scale datasets, +promise significant advancements across robotics domains. However, aligning +these policies with end-user preferences remains a challenge, particularly when +the preferences are hard to specify. While reinforcement learning from human +feedback (RLHF) has become the predominant mechanism for alignment in +non-embodied domains like large language models, it has not seen the same +success in aligning visuomotor policies due to the prohibitive amount of human +feedback required to learn visual reward functions. To address this limitation, +we propose Representation-Aligned Preference-based Learning (RAPL), an +observation-only method for learning visual rewards from significantly less +human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback +on fine-tuning pre-trained vision encoders to align with the end-user's visual +representation and then constructs a dense visual reward via feature matching +in this aligned representation space. We first validate RAPL through simulation +experiments in the X-Magical benchmark and Franka Panda robotic manipulation, +demonstrating that it can learn rewards aligned with human preferences, more +efficiently uses preference data, and generalizes across robot embodiments. +Finally, our hardware experiments align pre-trained Diffusion Policies for +three object manipulation tasks. We find that RAPL can fine-tune these policies +with 5x less real human preference data, taking the first step towards +minimizing human feedback while maximizing visuomotor robot policy alignment. + +
+
+ comment: Submitted to IJRR, this paper is an extended journal version of the + conference paper arXiv:2310.07932 with new results and discussion. arXiv + admin note: substantial text overlap with arXiv:2310.07932 +
+
+
+
+
+ + ☆ Wavelet Diffusion Neural Operator + + +
+ Simulating and controlling physical systems described by partial differential +equations (PDEs) are crucial tasks across science and engineering. Recently, +diffusion generative models have emerged as a competitive class of methods for +these tasks due to their ability to capture long-term dependencies and model +high-dimensional states. However, diffusion models typically struggle with +handling system states with abrupt changes and generalizing to higher +resolutions. In this work, we propose Wavelet Diffusion Neural Operator (WDNO), +a novel PDE simulation and control framework that enhances the handling of +these complexities. WDNO comprises two key innovations. Firstly, WDNO performs +diffusion-based generative modeling in the wavelet domain for the entire +trajectory to handle abrupt changes and long-term dependencies effectively. +Secondly, to address the issue of poor generalization across different +resolutions, which is one of the fundamental tasks in modeling physical +systems, we introduce multi-resolution training. We validate WDNO on five +physical systems, including 1D advection equation, three challenging physical +systems with abrupt changes (1D Burgers' equation, 1D compressible +Navier-Stokes equation and 2D incompressible fluid), and a real-world dataset +ERA5, which demonstrates superior performance on both simulation and control +tasks over state-of-the-art methods, with significant improvements in long-term +and detail prediction accuracy. Remarkably, in the challenging context of the +2D high-dimensional and indirect control task aimed at reducing smoke leakage, +WDNO reduces the leakage by 33.2% compared to the second-best baseline. + +
+
+
+
+
+ + ☆ WRF-GS: Wireless Radiation Field Reconstruction with 3D Gaussian + Splatting + + +
+ Wireless channel modeling plays a pivotal role in designing, analyzing, and +optimizing wireless communication systems. Nevertheless, developing an +effective channel modeling approach has been a longstanding challenge. This +issue has been escalated due to the denser network deployment, larger antenna +arrays, and wider bandwidth in 5G and beyond networks. To address this +challenge, we put forth WRF-GS, a novel framework for channel modeling based on +wireless radiation field (WRF) reconstruction using 3D Gaussian splatting. +WRF-GS employs 3D Gaussian primitives and neural networks to capture the +interactions between the environment and radio signals, enabling efficient WRF +reconstruction and visualization of the propagation characteristics. The +reconstructed WRF can then be used to synthesize the spatial spectrum for +comprehensive wireless channel characterization. Notably, with a small number +of measurements, WRF-GS can synthesize new spatial spectra within milliseconds +for a given scene, thereby enabling latency-sensitive applications. +Experimental results demonstrate that WRF-GS outperforms existing methods for +spatial spectrum synthesis, such as ray tracing and other deep-learning +approaches. Moreover, WRF-GS achieves superior performance in the channel state +information prediction task, surpassing existing methods by a significant +margin of more than 2.43 dB. + +
+
+ comment: accepted to the IEEE International Conference on Computer + Communications (INFOCOM 2025) +
+
+
+
+
+ + ☆ CCS: Continuous Learning for Customized Incremental Wireless Sensing + Services + + +
+ Wireless sensing has made significant progress in tasks ranging from action +recognition, vital sign estimation, pose estimation, etc. After over a decade +of work, wireless sensing currently stands at the tipping point transitioning +from proof-of-concept systems to the large-scale deployment. We envision a +future service scenario where wireless sensing service providers distribute +sensing models to users. During usage, users might request new sensing +capabilities. For example, if someone is away from home on a business trip or +vacation for an extended period, they may want a new sensing capability that +can detect falls in elderly parents or grandparents and promptly alert them. In +this paper, we propose CCS (continuous customized service), enabling model +updates on users' local computing resources without data transmission to the +service providers. To address the issue of catastrophic forgetting in model +updates where updating model parameters to implement new capabilities leads to +the loss of existing capabilities we design knowledge distillation and weight +alignment modules. These modules enable the sensing model to acquire new +capabilities while retaining the existing ones. We conducted extensive +experiments on the large-scale XRF55 dataset across Wi-Fi, millimeter-wave +radar, and RFID modalities to simulate scenarios where four users sequentially +introduced new customized demands. The results affirm that CCS excels in +continuous model services across all the above wireless modalities, +significantly outperforming existing approaches like OneFi. + +
+
+ comment: 9 pages,8 figures +
+
+
+
+
+ + ☆ Rethinking Time Series Forecasting with LLMs via Nearest Neighbor + Contrastive Learning + + +
+ Adapting Large Language Models (LLMs) that are extensively trained on +abundant text data, and customizing the input prompt to enable time series +forecasting has received considerable attention. While recent work has shown +great potential for adapting the learned prior of LLMs, the formulation of the +prompt to finetune LLMs remains challenging as prompt should be aligned with +time series data. Additionally, current approaches do not effectively leverage +word token embeddings which embody the rich representation space learned by +LLMs. This emphasizes the need for a robust approach to formulate the prompt +which utilizes the word token embeddings while effectively representing the +characteristics of the time series. To address these challenges, we propose +NNCL-TLLM: Nearest Neighbor Contrastive Learning for Time series forecasting +via LLMs. First, we generate time series compatible text prototypes such that +each text prototype represents both word token embeddings in its neighborhood +and time series characteristics via end-to-end finetuning. Next, we draw +inspiration from Nearest Neighbor Contrastive Learning to formulate the prompt +while obtaining the top-$k$ nearest neighbor time series compatible text +prototypes. We then fine-tune the layer normalization and positional embeddings +of the LLM, keeping the other layers intact, reducing the trainable parameters +and decreasing the computational cost. Our comprehensive experiments +demonstrate that NNCL-TLLM outperforms in few-shot forecasting while achieving +competitive or superior performance over the state-of-the-art methods in +long-term and short-term forecasting tasks. + +
+
+
+
+
+ + ☆ Direct Quantized Training of Language Models with Stochastic Rounding + + +
+ Although recent quantized Large Language Models (LLMs), such as BitNet, have +paved the way for significant reduction in memory usage during deployment with +binary or ternary weights, training these models still demands substantial +memory footprints. This is partly because high-precision (i.e., unquantized) +weight matrices required for straight-through estimation must be maintained +throughout the whole training process. To address this, we explore the +potential of directly updating the quantized low-precision weight matrices +without relying on the straight-through estimator during backpropagation, +thereby saving memory usage during training. Specifically, we employ a +stochastic rounding technique to minimize information loss caused by the use of +low-bit weights throughout training. Experimental results on our +LLaMA-structured models indicate that (1) training with only low-precision +weights is feasible even when they are constrained to ternary values, (2) +extending the bit width to 8 bits results in only a 5% loss degradation +compared to BitNet b1.58 while offering the potential for reduced memory usage +during training, and (3) our models can also perform inference using ternary +weights, showcasing their flexibility in deployment. + +
+
+ comment: work in progress +
+
+
+
+
+ + ☆ Slicing Vision Transformer for Flexible Inference NeurIPS 2024 + + +
+ Vision Transformers (ViT) is known for its scalability. In this work, we +target to scale down a ViT to fit in an environment with dynamic-changing +resource constraints. We observe that smaller ViTs are intrinsically the +sub-networks of a larger ViT with different widths. Thus, we propose a general +framework, named Scala, to enable a single network to represent multiple +smaller ViTs with flexible inference capability, which aligns with the inherent +design of ViT to vary from widths. Concretely, Scala activates several subnets +during training, introduces Isolated Activation to disentangle the smallest +sub-network from other subnets, and leverages Scale Coordination to ensure each +sub-network receives simplified, steady, and accurate learning objectives. +Comprehensive empirical validations on different tasks demonstrate that with +only one-shot training, Scala learns slimmable representation without modifying +the original ViT structure and matches the performance of Separate Training. +Compared with the prior art, Scala achieves an average improvement of 1.6% on +ImageNet-1K with fewer parameters. + +
+
+ comment: Accepted by NeurIPS 2024 +
+
+
+
+
+ + ☆ Differentially Private Random Feature Model + + +
+ Designing privacy-preserving machine learning algorithms has received great +attention in recent years, especially in the setting when the data contains +sensitive information. Differential privacy (DP) is a widely used mechanism for +data analysis with privacy guarantees. In this paper, we produce a +differentially private random feature model. Random features, which were +proposed to approximate large-scale kernel machines, have been used to study +privacy-preserving kernel machines as well. We consider the over-parametrized +regime (more features than samples) where the non-private random feature model +is learned via solving the min-norm interpolation problem, and then we apply +output perturbation techniques to produce a private model. We show that our +method preserves privacy and derive a generalization error bound for the +method. To the best of our knowledge, we are the first to consider +privacy-preserving random feature models in the over-parametrized regime and +provide theoretical guarantees. We empirically compare our method with other +privacy-preserving learning methods in the literature as well. Our results show +that our approach is superior to the other methods in terms of generalization +performance on synthetic data and benchmark data sets. Additionally, it was +recently observed that DP mechanisms may exhibit and exacerbate disparate +impact, which means that the outcomes of DP learning algorithms vary +significantly among different groups. We show that both theoretically and +empirically, random features have the potential to reduce disparate impact, and +hence achieve better fairness. + +
+
+ comment: Submitted to an IEEE journal +
+
+
+
+
+ + ☆ NLP-ADBench: NLP Anomaly Detection Benchmark SC + + +
+ Anomaly detection (AD) is a critical machine learning task with diverse +applications in web systems, including fraud detection, content moderation, and +user behavior analysis. Despite its significance, AD in natural language +processing (NLP) remains underexplored, limiting advancements in detecting +anomalies in text data such as harmful content, phishing attempts, or spam +reviews. In this paper, we introduce NLP-ADBench, the most comprehensive +benchmark for NLP anomaly detection (NLP-AD), comprising eight curated datasets +and evaluations of nineteen state-of-the-art algorithms. These include three +end-to-end methods and sixteen two-step algorithms that apply traditional +anomaly detection techniques to language embeddings generated by +bert-base-uncased and OpenAI's text-embedding-3-large models. + Our results reveal critical insights and future directions for NLP-AD. +Notably, no single model excels across all datasets, highlighting the need for +automated model selection. Moreover, two-step methods leveraging +transformer-based embeddings consistently outperform specialized end-to-end +approaches, with OpenAI embeddings demonstrating superior performance over BERT +embeddings. By releasing NLP-ADBench at +https://github.com/USC-FORTIS/NLP-ADBench, we provide a standardized framework +for evaluating NLP-AD methods, fostering the development of innovative +approaches. This work fills a crucial gap in the field and establishes a +foundation for advancing NLP anomaly detection, particularly in the context of +improving the safety and reliability of web-based systems. + +
+
+ comment: The project is available at https://github.com/USC-FORTIS/NLP-ADBench +
+
+
+
+
+ + ☆ DPGIIL: Dirichlet Process-Deep Generative Model-Integrated Incremental + Learning for Clustering in Transmissibility-based Online Structural Anomaly + Detection + + +
+ Clustering based on vibration responses, such as transmissibility functions +(TFs), is promising in structural anomaly detection, but most existing +approaches struggle with determining the optimal cluster number and handling +high-dimensional streaming data, while their shallow structures also make them +sensitive to manually-engineered feature quality. To bridge this gap, this work +proposes the Dirichlet process-deep generative model-integrated incremental +learning (DPGIIL) for clustering by combining the advantages of deep generative +models (DGMs) in representation learning and the Dirichlet process mixture +model (DPMM) in identifying distinct patterns in observed data. By introducing +a DPMM prior into the latent space of DGMs, DPGIIL automatically captures +dissimilarities in extracted latent representations, enabling both generative +modeling and clustering. Within the context of variational Bayesian inference, +a lower bound on the log marginal likelihood of DPGIIL, tighter than the +evidence lower bound given sufficient training data, is derived analytically, +which enables the joint optimization of DGM and DPMM parameters, thereby +allowing the DPMM to regularize the DGM's feature extraction process. +Additionally, a greedy split-merge scheme-based coordinate ascent variational +inference method is devised to accelerate the optimization. The summary +statistics of the DPMM, along with the network parameters, are used to retain +information about previous data for incremental learning. Notably, this study +uses variational autoencoder (VAE) within DPGIIL as an illustrative example, +while this framework is adaptable to other DGMs. Two case studies show that the +proposed method outperforms some state-of-the-art approaches in structural +anomaly detection and clustering, while also dynamically generating new +clusters to indicate the emergence of new structural conditions for online +monitoring. + +
+
+ comment: 48 pages,9 figures,6 tables,submitted to Advanced Engineering + Informatics +
+
+
+
+
+ + ☆ Anomaly Detection and Classification in Knowledge Graphs + + +
+ Anomalies such as redundant, inconsistent, contradictory, and deficient +values in a Knowledge Graph (KG) are unavoidable, as these graphs are often +curated manually, or extracted using machine learning and natural language +processing techniques. Therefore, anomaly detection is a task that can enhance +the quality of KGs. In this paper, we propose SEKA (SEeking Knowledge graph +Anomalies), an unsupervised approach for the detection of abnormal triples and +entities in KGs. SEKA can help improve the correctness of a KG whilst retaining +its coverage. We propose an adaption of the Path Rank Algorithm (PRA), named +the Corroborative Path Rank Algorithm (CPRA), which is an efficient adaptation +of PRA that is customized to detect anomalies in KGs. Furthermore, we also +present TAXO (TAXOnomy of anomaly types in KGs), a taxonomy of possible anomaly +types that can occur in a KG. This taxonomy provides a classification of the +anomalies discovered by SEKA with an extensive discussion of possible data +quality issues in a KG. We evaluate both approaches using the four real-world +KGs YAGO-1, KBpedia, Wikidata, and DSKG to demonstrate the ability of SEKA and +TAXO to outperform the baselines. + +
+
+
+
+
+ + ☆ IterNorm: Fast Iterative Normalization + + +
+ Transformer-based large language models are a memory-bound model whose +operation is based on a large amount of data that are marginally reused. Thus, +the data movement between a host and accelerator likely dictates the total +wall-clock time. Layer normalization is one of the key workloads in the +transformer model, following each of multi-head attention and feed-forward +network blocks. To reduce data movement, layer normalization needs to be +performed on the same chip as the matrix-matrix multiplication engine. To this +end, we introduce an iterative L2-normalization method for 1D input (IterNorm), +ensuring fast convergence to the steady-state solution within five iteration +steps and high precision, outperforming the fast inverse square root algorithm +in six out of nine cases for FP32 and five out of nine for BFloat16 across the +embedding lengths used in the OPT models. Implemented in 32/28nm CMOS, the +IterNorm macro normalizes $d$-dimensional vectors, where $64 \leq d \leq 1024$, +with a latency of 112-227 cycles at 100MHz/1.05V. + +
+
+ comment: Design, Automation & Test in Europe Conference 2025 +
+
+
+
+
+ + ☆ A Temporally Correlated Latent Exploration for Reinforcement Learning + + +
+ Efficient exploration remains one of the longstanding problems of deep +reinforcement learning. Instead of depending solely on extrinsic rewards from +the environments, existing methods use intrinsic rewards to enhance +exploration. However, we demonstrate that these methods are vulnerable to Noisy +TV and stochasticity. To tackle this problem, we propose Temporally Correlated +Latent Exploration (TeCLE), which is a novel intrinsic reward formulation that +employs an action-conditioned latent space and temporal correlation. The +action-conditioned latent space estimates the probability distribution of +states, thereby avoiding the assignment of excessive intrinsic rewards to +unpredictable states and effectively addressing both problems. Whereas previous +works inject temporal correlation for action selection, the proposed method +injects it for intrinsic reward computation. We find that the injected temporal +correlation determines the exploratory behaviors of agents. Various experiments +show that the environment where the agent performs well depends on the amount +of temporal correlation. To the best of our knowledge, the proposed TeCLE is +the first approach to consider the action conditioned latent space and temporal +correlation for curiosity-driven exploration. We prove that the proposed TeCLE +can be robust to the Noisy TV and stochasticity in benchmark environments, +including Minigrid and Stochastic Atari. + +
+
+
+
+
+ + ☆ Towards counterfactual fairness thorough auxiliary variables + + +
+ The challenge of balancing fairness and predictive accuracy in machine +learning models, especially when sensitive attributes such as race, gender, or +age are considered, has motivated substantial research in recent years. +Counterfactual fairness ensures that predictions remain consistent across +counterfactual variations of sensitive attributes, which is a crucial concept +in addressing societal biases. However, existing counterfactual fairness +approaches usually overlook intrinsic information about sensitive features, +limiting their ability to achieve fairness while simultaneously maintaining +performance. To tackle this challenge, we introduce EXOgenous Causal reasoning +(EXOC), a novel causal reasoning framework motivated by exogenous variables. It +leverages auxiliary variables to uncover intrinsic properties that give rise to +sensitive attributes. Our framework explicitly defines an auxiliary node and a +control node that contribute to counterfactual fairness and control the +information flow within the model. Our evaluation, conducted on synthetic and +real-world datasets, validates EXOC's superiority, showing that it outperforms +state-of-the-art approaches in achieving counterfactual fairness. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2307.08232 by other authors +
+
+
+
+
+ + ☆ DAWN-SI: Data-Aware and Noise-Informed Stochastic Interpolation for + Solving Inverse Problems + + +
+ Inverse problems, which involve estimating parameters from incomplete or +noisy observations, arise in various fields such as medical imaging, +geophysics, and signal processing. These problems are often ill-posed, +requiring regularization techniques to stabilize the solution. In this work, we +employ $\textit{Stochastic Interpolation}$ (SI), a generative framework that +integrates both deterministic and stochastic processes to map a simple +reference distribution, such as a Gaussian, to the target distribution. Our +method $\textbf{DAWN-SI}$: $\textbf{D}$ata-$\textbf{AW}$are and +$\textbf{N}$oise-informed $\textbf{S}$tochastic $\textbf{I}$nterpolation +incorporates data and noise embedding, allowing the model to access +representations about the measured data explicitly and also account for noise +in the observations, making it particularly robust in scenarios where data is +noisy or incomplete. By learning a time-dependent velocity field, SI not only +provides accurate solutions but also enables uncertainty quantification by +generating multiple plausible outcomes. Unlike pre-trained diffusion models, +which may struggle in highly ill-posed settings, our approach is trained +specifically for each inverse problem and adapts to varying noise levels. We +validate the effectiveness and robustness of our method through extensive +numerical experiments on tasks such as image deblurring and tomography. + +
+
+ comment: 20 pages, 11 figures, 6 tables +
+
+
+
+
+ + ☆ Short-term Streamflow and Flood Forecasting based on Graph Convolutional + Recurrent Neural Network and Residual Error Learning + + +
+ Accurate short-term streamflow and flood forecasting are critical for +mitigating river flood impacts, especially given the increasing climate +variability. Machine learning-based streamflow forecasting relies on large +streamflow datasets derived from rating curves. Uncertainties in rating curve +modeling could introduce errors to the streamflow data and affect the +forecasting accuracy. This study proposes a streamflow forecasting method that +addresses these data errors, enhancing the accuracy of river flood forecasting +and flood modeling, thereby reducing flood-related risk. A convolutional +recurrent neural network is used to capture spatiotemporal patterns, coupled +with residual error learning and forecasting. The neural network outperforms +commonly used forecasting models over 1-6 hours of forecasting horizons, and +the residual error learners can further correct the residual errors. This +provides a more reliable tool for river flood forecasting and climate +adaptation in this critical 1-6 hour time window for flood risk mitigation +efforts. + +
+
+
+
+
+ + ☆ Measuring Goal-Directedness NeurIPS 2024 + + +
+ We define maximum entropy goal-directedness (MEG), a formal measure of +goal-directedness in causal models and Markov decision processes, and give +algorithms for computing it. Measuring goal-directedness is important, as it is +a critical element of many concerns about harm from AI. It is also of +philosophical interest, as goal-directedness is a key aspect of agency. MEG is +based on an adaptation of the maximum causal entropy framework used in inverse +reinforcement learning. It can measure goal-directedness with respect to a +known utility function, a hypothesis class of utility functions, or a set of +random variables. We prove that MEG satisfies several desiderata and +demonstrate our algorithms with small-scale experiments. + +
+
+ comment: Accepted to the 38th Conference on Neural Information Processing + Systems (NeurIPS 2024) +
+
+
+
+
+ + ☆ Ltri-LLM: Streaming Long Context Inference for LLMs with Training-Free + Dynamic Triangular Attention Pattern + + +
+ The quadratic computational complexity of the attention mechanism in current +Large Language Models (LLMs) renders inference with long contexts prohibitively +expensive. To address this challenge, various approaches aim to retain critical +portions of the context to optimally approximate Full Attention (FA) through +Key-Value (KV) compression or Sparse Attention (SA), enabling the processing of +virtually unlimited text lengths in a streaming manner. However, these methods +struggle to achieve performance levels comparable to FA, particularly in +retrieval tasks. In this paper, our analysis of attention head patterns reveals +that LLMs' attention distributions show strong local correlations, naturally +reflecting a chunking mechanism for input context. We propose Ltri-LLM +framework, which divides KVs into spans, stores them in an offline index, and +retrieves the relevant KVs into memory for various queries. Experimental +results on popular long text benchmarks show that Ltri-LLM can achieve +performance close to FA while maintaining efficient, streaming-based inference. + +
+
+
+
+
+ + ☆ Latent Space Characterization of Autoencoder Variants + + +
+ Understanding the latent spaces learned by deep learning models is crucial in +exploring how they represent and generate complex data. Autoencoders (AEs) have +played a key role in the area of representation learning, with numerous +regularization techniques and training principles developed not only to enhance +their ability to learn compact and robust representations, but also to reveal +how different architectures influence the structure and smoothness of the +lower-dimensional non-linear manifold. We strive to characterize the structure +of the latent spaces learned by different autoencoders including convolutional +autoencoders (CAEs), denoising autoencoders (DAEs), and variational +autoencoders (VAEs) and how they change with the perturbations in the input. By +characterizing the matrix manifolds corresponding to the latent spaces, we +provide an explanation for the well-known observation that the latent spaces of +CAE and DAE form non-smooth manifolds, while that of VAE forms a smooth +manifold. We also map the points of the matrix manifold to a Hilbert space +using distance preserving transforms and provide an alternate view in terms of +the subspaces generated in the Hilbert space as a function of the distortion in +the input. The results show that the latent manifolds of CAE and DAE are +stratified with each stratum being a smooth product manifold, while the +manifold of VAE is a smooth product manifold of two symmetric positive definite +matrices and a symmetric positive semi-definite matrix. + +
+
+ comment: 8 pages, 6 figures, and 1 table +
+
+
+
+
+ + ☆ GABAR: Graph Attention-Based Action Ranking for Relational Policy + Learning + + +
+ We propose a novel approach to learn relational policies for classical +planning based on learning to rank actions. We introduce a new graph +representation that explicitly captures action information and propose a Graph +Neural Network architecture augmented with Gated Recurrent Units (GRUs) to +learn action rankings. Our model is trained on small problem instances and +generalizes to significantly larger instances where traditional planning +becomes computationally expensive. Experimental results across standard +planning benchmarks demonstrate that our action-ranking approach achieves +generalization to significantly larger problems than those used in training. + +
+
+ comment: 6 Pages, 1 figure +
+
+
+
+
+ + ☆ Machine learning algorithms to predict the risk of rupture of + intracranial aneurysms: a systematic review + + +
+ Purpose: Subarachnoid haemorrhage is a potentially fatal consequence of +intracranial aneurysm rupture, however, it is difficult to predict if aneurysms +will rupture. Prophylactic treatment of an intracranial aneurysm also involves +risk, hence identifying rupture-prone aneurysms is of substantial clinical +importance. This systematic review aims to evaluate the performance of machine +learning algorithms for predicting intracranial aneurysm rupture risk. + Methods: MEDLINE, Embase, Cochrane Library and Web of Science were searched +until December 2023. Studies incorporating any machine learning algorithm to +predict the risk of rupture of an intracranial aneurysm were included. Risk of +bias was assessed using the Prediction Model Risk of Bias Assessment Tool +(PROBAST). PROSPERO registration: CRD42023452509. Results: Out of 10,307 +records screened, 20 studies met the eligibility criteria for this review +incorporating a total of 20,286 aneurysm cases. The machine learning models +gave a 0.66-0.90 range for performance accuracy. The models were compared to +current clinical standards in six studies and gave mixed results. Most studies +posed high or unclear risks of bias and concerns for applicability, limiting +the inferences that can be drawn from them. There was insufficient homogenous +data for a meta-analysis. + Conclusions: Machine learning can be applied to predict the risk of rupture +for intracranial aneurysms. However, the evidence does not comprehensively +demonstrate superiority to existing practice, limiting its role as a clinical +adjunct. Further prospective multicentre studies of recent machine learning +tools are needed to prove clinical validation before they are implemented in +the clinic. + +
+
+ comment: Clin Neuroradiol (2024) +
+
+
+
+
+ + ☆ DHIL-GT: Scalable Graph Transformer with Decoupled Hierarchy Labeling + + +
+ Graph Transformer (GT) has recently emerged as a promising neural network +architecture for learning graph-structured data. However, its global attention +mechanism with quadratic complexity concerning the graph scale prevents wider +application to large graphs. While current methods attempt to enhance GT +scalability by altering model architecture or encoding hierarchical graph data, +our analysis reveals that these models still suffer from the computational +bottleneck related to graph-scale operations. In this work, we target the GT +scalability issue and propose DHIL-GT, a scalable Graph Transformer that +simplifies network learning by fully decoupling the graph computation to a +separate stage in advance. DHIL-GT effectively retrieves hierarchical +information by exploiting the graph labeling technique, as we show that the +graph label hierarchy is more informative than plain adjacency by offering +global connections while promoting locality, and is particularly suitable for +handling complex graph patterns such as heterophily. We further design subgraph +sampling and positional encoding schemes for precomputing model input on top of +graph labels in an end-to-end manner. The training stage thus favorably removes +graph-related computations, leading to ideal mini-batch capability and GPU +utilization. Notably, the precomputation and training processes of DHIL-GT +achieve complexities linear to the number of graph edges and nodes, +respectively. Extensive experiments demonstrate that DHIL-GT is efficient in +terms of computational boost and mini-batch capability over existing scalable +Graph Transformer designs on large-scale benchmarks, while achieving top-tier +effectiveness on both homophilous and heterophilous graphs. + +
+
+
+
+
+ + ☆ Generative Humanization for Therapeutic Antibodies + + +
+ Antibody therapies have been employed to address some of today's most +challenging diseases, but must meet many criteria during drug development +before reaching a patient. Humanization is a sequence optimization strategy +that addresses one critical risk called immunogenicity - a patient's immune +response to the drug - by making an antibody more "human-like" in the absence +of a predictive lab-based test for immunogenicity. However, existing +humanization strategies generally yield very few humanized candidates, which +may have degraded biophysical properties or decreased drug efficacy. Here, we +re-frame humanization as a conditional generative modeling task, where +humanizing mutations are sampled from a language model trained on human +antibody data. We describe a sampling process that incorporates models of +therapeutic attributes, such as antigen binding affinity, to obtain candidate +sequences that have both reduced immunogenicity risk and maintained or improved +therapeutic properties, allowing this algorithm to be readily embedded into an +iterative antibody optimization campaign. We demonstrate in silico and in lab +validation that in real therapeutic programs our generative humanization method +produces diverse sets of antibodies that are both (1) highly-human and (2) have +favorable therapeutic properties, such as improved binding to target antigens. + +
+
+
+
+
+ + ☆ An Experimental Evaluation of Imputation Models for Spatial-Temporal + Traffic Data + + +
+ Traffic data imputation is a critical preprocessing step in intelligent +transportation systems, enabling advanced transportation services. Despite +significant advancements in this field, selecting the most suitable model for +practical applications remains challenging due to three key issues: 1) +incomprehensive consideration of missing patterns that describe how data loss +along spatial and temporal dimensions, 2) the lack of test on standardized +datasets, and 3) insufficient evaluations. To this end, we first propose +practice-oriented taxonomies for missing patterns and imputation models, +systematically identifying all possible forms of real-world traffic data loss +and analyzing the characteristics of existing models. Furthermore, we introduce +a unified benchmarking pipeline to comprehensively evaluate 10 representative +models across various missing patterns and rates. This work aims to provide a +holistic understanding of traffic data imputation research and serve as a +practical guideline. + +
+
+
+
+
+ + ♻ ☆ Conformal Prediction for Class-wise Coverage via Augmented Label Rank + Calibration + + +
+ Conformal prediction (CP) is an emerging uncertainty quantification framework +that allows us to construct a prediction set to cover the true label with a +pre-specified marginal or conditional probability. Although the valid coverage +guarantee has been extensively studied for classification problems, CP often +produces large prediction sets which may not be practically useful. This issue +is exacerbated for the setting of class-conditional coverage on imbalanced +classification tasks with many and/or imbalanced classes. This paper proposes +the Rank Calibrated Class-conditional CP (RC3P) algorithm to reduce the +prediction set sizes to achieve class-conditional coverage, where the valid +coverage holds for each class. In contrast to the standard class-conditional CP +(CCP) method that uniformly thresholds the class-wise conformity score for each +class, the augmented label rank calibration step allows RC3P to selectively +iterate this class-wise thresholding subroutine only for a subset of classes +whose class-wise top-k error is small. We prove that agnostic to the classifier +and data distribution, RC3P achieves class-wise coverage. We also show that +RC3P reduces the size of prediction sets compared to the CCP method. +Comprehensive experiments on multiple real-world datasets demonstrate that RC3P +achieves class-wise coverage and 26.25% reduction in prediction set sizes on +average. + +
+
+
+
+
+ + ♻ ☆ Entity-based Reinforcement Learning for Autonomous Cyber Defence CCS 2024 + + +
+ A significant challenge for autonomous cyber defence is ensuring a defensive +agent's ability to generalise across diverse network topologies and +configurations. This capability is necessary for agents to remain effective +when deployed in dynamically changing environments, such as an enterprise +network where devices may frequently join and leave. Standard approaches to +deep reinforcement learning, where policies are parameterised using a +fixed-input multi-layer perceptron (MLP) expect fixed-size observation and +action spaces. In autonomous cyber defence, this makes it hard to develop +agents that generalise to environments with network topologies different from +those trained on, as the number of nodes affects the natural size of the +observation and action spaces. To overcome this limitation, we reframe the +problem of autonomous network defence using entity-based reinforcement +learning, where the observation and action space of an agent are decomposed +into a collection of discrete entities. This framework enables the use of +policy parameterisations specialised in compositional generalisation. We train +a Transformer-based policy on the Yawning Titan cyber-security simulation +environment and test its generalisation capabilities across various network +topologies. We demonstrate that this approach significantly outperforms an +MLP-based policy when training across fixed-size networks of varying +topologies, and matches performance when training on a single network. We also +demonstrate the potential for zero-shot generalisation to networks of a +different size to those seen in training. These findings highlight the +potential for entity-based reinforcement learning to advance the field of +autonomous cyber defence by providing more generalisable policies capable of +handling variations in real-world network environments. + +
+
+ comment: Material also appearing in the proceedings of the 1st International + Workshop on Autonomous Cybersecurity at ACM CCS 2024 +
+
+
+
+
+ + ♻ ☆ Fast Tree-Field Integrators: From Low Displacement Rank to Topological + Transformers NeurIPS 2024 + + +
+ We present a new class of fast polylog-linear algorithms based on the theory +of structured matrices (in particular low displacement rank) for integrating +tensor fields defined on weighted trees. Several applications of the resulting +fast tree-field integrators (FTFIs) are presented, including (a) approximation +of graph metrics with tree metrics, (b) graph classification, (c) modeling on +meshes, and finally (d) Topological Transformers (TTs) (Choromanski et al., +2022) for images. For Topological Transformers, we propose new relative +position encoding (RPE) masking mechanisms with as few as three extra learnable +parameters per Transformer layer, leading to 1.0-1.5%+ accuracy gains. +Importantly, most of FTFIs are exact methods, thus numerically equivalent to +their brute-force counterparts. When applied to graphs with thousands of nodes, +those exact algorithms provide 5.7-13x speedups. We also provide an extensive +theoretical analysis of our methods. + +
+
+ comment: NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ On the Generalization of Preference Learning with DPO + + +
+ Large language models (LLMs) have demonstrated remarkable capabilities but +often struggle to align with human preferences, leading to harmful or +undesirable outputs. Preference learning, which trains models to distinguish +between preferred and non-preferred responses based on human feedback, has +become a crucial component for ensuring that LLMs align with human values. +Despite the widespread adoption in real-world systems, a thorough theoretical +understanding of the generalization guarantees for these models remain lacking. +This paper bridges that gap by introducing a new theoretical framework to +analyze the generalization guarantees of models trained with direct preference +optimization (DPO). While existing generalization theory often focuses on +overparameterized models achieving near-optimal loss or models independent of +the training process, our framework rigorously assesses how well models +generalize after a finite number of gradient steps, reflecting real-world LLM +training practices. By analyzing the reward margin associated with each sample +and its trajectory throughout training, we can effectively bound the +generalization error. We derive learning guarantees showing that, under +specific conditions, models trained with DPO can correctly discern preferred +responses on unseen data with high probability. These insights are empirically +validated on contemporary LLMs, underscoring the practical relevance of our +theoretical findings. + +
+
+
+
+
+ + ♻ ☆ The Intelligible and Effective Graph Neural Additive Networks + + +
+ Graph Neural Networks (GNNs) have emerged as the predominant approach for +learning over graph-structured data. However, most GNNs operate as black-box +models and require post-hoc explanations, which may not suffice in high-stakes +scenarios where transparency is crucial. In this paper, we present a GNN that +is interpretable by design. Our model, Graph Neural Additive Network (GNAN), is +a novel extension of the interpretable class of Generalized Additive Models, +and can be visualized and fully understood by humans. GNAN is designed to be +fully interpretable, offering both global and local explanations at the feature +and graph levels through direct visualization of the model. These +visualizations describe exactly how the model uses the relationships between +the target variable, the features, and the graph. We demonstrate the +intelligibility of GNANs in a series of examples on different tasks and +datasets. In addition, we show that the accuracy of GNAN is on par with +black-box GNNs, making it suitable for critical applications where transparency +is essential, alongside high accuracy. + +
+
+
+
+
+ + ♻ ☆ Differentiable Weightless Neural Networks + + +
+ We introduce the Differentiable Weightless Neural Network (DWN), a model +based on interconnected lookup tables. Training of DWNs is enabled by a novel +Extended Finite Difference technique for approximate differentiation of binary +values. We propose Learnable Mapping, Learnable Reduction, and Spectral +Regularization to further improve the accuracy and efficiency of these models. +We evaluate DWNs in three edge computing contexts: (1) an FPGA-based hardware +accelerator, where they demonstrate superior latency, throughput, energy +efficiency, and model area compared to state-of-the-art solutions, (2) a +low-power microcontroller, where they achieve preferable accuracy to XGBoost +while subject to stringent memory constraints, and (3) ultra-low-cost chips, +where they consistently outperform small models in both accuracy and projected +hardware area. DWNs also compare favorably against leading approaches for +tabular datasets, with higher average rank. Overall, our work positions DWNs as +a pioneering solution for edge-compatible high-throughput neural networks. + +
+
+
+
+
+ + ♻ ☆ A Practitioner's Guide to Continual Multimodal Pretraining NeurIPS + 2024 + + +
+ Multimodal foundation models serve numerous applications at the intersection +of vision and language. Still, despite being pretrained on extensive data, they +become outdated over time. To keep models updated, research into continual +pretraining mainly explores scenarios with either (1) infrequent, +indiscriminate updates on large-scale new data, or (2) frequent, sample-level +updates. However, practical model deployment often operates in the gap between +these two limit cases, as real-world applications often demand adaptation to +specific subdomains, tasks or concepts -- spread over the entire, varying life +cycle of a model. In this work, we complement current perspectives on continual +pretraining through a research test bed as well as provide comprehensive +guidance for effective continual model updates in such scenarios. We first +introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with +realistic compute constraints and practical deployment requirements, +constructed over 63 datasets with diverse visual and semantic coverage. Using +FoMo-in-Flux, we explore the complex landscape of practical continual +pretraining through multiple perspectives: (1) A data-centric investigation of +data mixtures and stream orderings that emulate real-world deployment +situations, (2) a method-centric investigation ranging from simple fine-tuning +and traditional continual learning strategies to parameter-efficient updates +and model merging, (3) meta learning rate schedules and mechanistic design +choices, and (4) the influence of model and compute scaling. Together, our +insights provide a practitioner's guide to continual multimodal pretraining for +real-world deployment. Our benchmark and code is here: +https://github.com/ExplainableML/fomo_in_flux. + +
+
+ comment: Technical Report. 52 pages. Shorter version published at the NeurIPS + 2024 Dataset & Benchmarks track +
+
+
+
+
+ + ♻ ☆ Voronoi Candidates for Bayesian Optimization + + +
+ Bayesian optimization (BO) offers an elegant approach for efficiently +optimizing black-box functions. However, acquisition criteria demand their own +challenging inner-optimization, which can induce significant overhead. Many +practical BO methods, particularly in high dimension, eschew a formal, +continuous optimization of the acquisition function and instead search +discretely over a finite set of space-filling candidates. Here, we propose to +use candidates which lie on the boundary of the Voronoi tessellation of the +current design points, so they are equidistant to two or more of them. We +discuss strategies for efficient implementation by directly sampling the +Voronoi boundary without explicitly generating the tessellation, thus +accommodating large designs in high dimension. On a battery of test problems +optimized via Gaussian processes with expected improvement, our proposed +approach significantly improves the execution time of a multi-start continuous +search without a loss in accuracy. + +
+
+
+
+
+ + ♻ ☆ The Vizier Gaussian Process Bandit Algorithm + + +
+ Google Vizier has performed millions of optimizations and accelerated +numerous research and production systems at Google, demonstrating the success +of Bayesian optimization as a large-scale service. Over multiple years, its +algorithm has been improved considerably, through the collective experiences of +numerous research efforts and user feedback. In this technical report, we +discuss the implementation details and design choices of the current default +algorithm provided by Open Source Vizier. Our experiments on standardized +benchmarks reveal its robustness and versatility against well-established +industry baselines on multiple practical modes. + +
+
+ comment: Google DeepMind Technical Report. Code can be found in + https://github.com/google/vizier +
+
+
+
+
+ + ♻ ☆ PAC Privacy Preserving Diffusion Models + + +
+ Data privacy protection is garnering increased attention among researchers. +Diffusion models (DMs), particularly with strict differential privacy, can +potentially produce images with both high privacy and visual quality. However, +challenges arise such as in ensuring robust protection in privatizing specific +data attributes, areas where current models often fall short. To address these +challenges, we introduce the PAC Privacy Preserving Diffusion Model, a model +leverages diffusion principles and ensure Probably Approximately Correct (PAC) +privacy. We enhance privacy protection by integrating a private classifier +guidance into the Langevin Sampling Process. Additionally, recognizing the gap +in measuring the privacy of models, we have developed a novel metric to gauge +privacy levels. Our model, assessed with this new metric and supported by +Gaussian matrix computations for the PAC bound, has shown superior performance +in privacy protection over existing leading private generative models according +to benchmark tests. + +
+
+ comment: arXiv admin note: text overlap with arXiv:2210.03458 by other authors +
+
+
+
+
+ + ♻ ☆ Modular Duality in Deep Learning + + +
+ An old idea in optimization theory says that since the gradient is a dual +vector it may not be subtracted from the weights without first being mapped to +the primal space where the weights reside. We take this idea seriously in this +paper and construct such a duality map for general neural networks. Our map, +which we call modular dualization, forms a unifying theoretical basis for +training algorithms that are a) fast and b) scalable. Modular dualization +involves first assigning operator norms to layers based on the semantics of +each layer, and then using these layerwise norms to recursively induce a +duality map on the weight space of the full neural architecture. We conclude by +deriving GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers +-- the latter two methods are based on a rectangular Newton-Schulz iteration +(Kovarik, 1970; Bj\"orck & Bowie, 1971). A variant of our methods was used to +set speed records for training NanoGPT. Overall, we hope that our theory of +modular duality will yield a next generation of fast and scalable optimizers +for general neural architectures. + +
+
+
+
+
+ + ♻ ☆ Leveraging Skills from Unlabeled Prior Data for Efficient Online + Exploration + + +
+ Unsupervised pretraining has been transformative in many supervised domains. +However, applying such ideas to reinforcement learning (RL) presents a unique +challenge in that fine-tuning does not involve mimicking task-specific data, +but rather exploring and locating the solution through iterative +self-improvement. In this work, we study how unlabeled prior trajectory data +can be leveraged to learn efficient exploration strategies. While prior data +can be used to pretrain a set of low-level skills, or as additional off-policy +data for online RL, it has been unclear how to combine these ideas effectively +for online exploration. Our method SUPE (Skills from Unlabeled Prior data for +Exploration) demonstrates that a careful combination of these ideas compounds +their benefits. Our method first extracts low-level skills using a variational +autoencoder (VAE), and then pseudo-relabels unlabeled trajectories using an +optimistic reward model, transforming prior data into high-level, task-relevant +examples. Finally, SUPE uses these transformed examples as additional +off-policy data for online RL to learn a high-level policy that composes +pretrained low-level skills to explore efficiently. We empirically show that +SUPE reliably outperforms prior strategies, successfully solving a suite of +long-horizon, sparse-reward tasks. Code: https://github.com/rail-berkeley/supe. + +
+
+ comment: 32 pages, 19 figures +
+
+
+
+
+ + ♻ ☆ Evaluation of post-hoc interpretability methods in time-series + classification + + +
+ Post-hoc interpretability methods are critical tools to explain +neural-network results. Several post-hoc methods have emerged in recent years, +but when applied to a given task, they produce different results, raising the +question of which method is the most suitable to provide correct post-hoc +interpretability. To understand the performance of each method, quantitative +evaluation of interpretability methods is essential. However, currently +available frameworks have several drawbacks which hinders the adoption of +post-hoc interpretability methods, especially in high-risk sectors. In this +work, we propose a framework with quantitative metrics to assess the +performance of existing post-hoc interpretability methods in particular in time +series classification. We show that several drawbacks identified in the +literature are addressed, namely dependence on human judgement, retraining, and +shift in the data distribution when occluding samples. We additionally design a +synthetic dataset with known discriminative features and tunable complexity. +The proposed methodology and quantitative metrics can be used to understand the +reliability of interpretability methods results obtained in practical +applications. In turn, they can be embedded within operational workflows in +critical fields that require accurate interpretability results for e.g., +regulatory policies. + +
+
+ comment: New version to match published version in Nature Machine Intelligence +
+
+
+
+
+ + ♻ ☆ Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues + + +
+ Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and +DeltaNet have emerged as efficient alternatives to Transformers in large +language modeling, offering linear scaling with sequence length and improved +training efficiency. However, LRNNs struggle to perform state-tracking which +may impair performance in tasks such as code evaluation or tracking a chess +game. Even parity, the simplest state-tracking task, which non-linear RNNs like +LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et +al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity +stems from restricting the value range of their diagonal state-transition +matrices to $[0, 1]$ and that incorporating negative values can resolve this +issue. We extend this result to non-diagonal LRNNs, which have recently shown +promise in models such as DeltaNet. We prove that finite precision LRNNs with +state-transition matrices having only positive eigenvalues cannot solve parity, +while complex eigenvalues are needed to count modulo $3$. Notably, we also +prove that LRNNs can learn any regular language when their state-transition +matrices are products of identity minus vector outer product matrices, each +with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that +extending the eigenvalue range of models like Mamba and DeltaNet to include +negative values not only enables them to solve parity but consistently improves +their performance on state-tracking tasks. Furthermore, pre-training LRNNs with +an extended eigenvalue range for language modeling achieves comparable +performance and stability while showing promise on code and math data. Our work +enhances the expressivity of modern LRNNs, broadening their applicability +without changing the cost of training or inference. + +
+
+ comment: Main changes: Correction to Theorem 1 and 2 (we excluded from the + only if condition complex eigenvalues with modulus strictly less than one). + Correction to point 3 of Proposition 3 +
+
+
+
+
+ + ♻ ☆ The Score-Difference Flow for Implicit Generative Modeling + + +
+ Implicit generative modeling (IGM) aims to produce samples of synthetic data +matching the characteristics of a target data distribution. Recent work (e.g. +score-matching networks, diffusion models) has approached the IGM problem from +the perspective of pushing synthetic source data toward the target distribution +via dynamical perturbations or flows in the ambient space. In this direction, +we present the score difference (SD) between arbitrary target and source +distributions as a flow that optimally reduces the Kullback-Leibler divergence +between them. We apply the SD flow to convenient proxy distributions, which are +aligned if and only if the original distributions are aligned. We demonstrate +the formal equivalence of this formulation to denoising diffusion models under +certain conditions. We also show that the training of generative adversarial +networks includes a hidden data-optimization sub-problem, which induces the SD +flow under certain choices of loss function when the discriminator is optimal. +As a result, the SD flow provides a theoretical link between model classes that +individually address the three challenges of the "generative modeling trilemma" +-- high sample quality, mode coverage, and fast sampling -- thereby setting the +stage for a unified approach. + +
+
+ comment: 25 pages, 5 figures, 4 tables. Updated, lightly revised version of a + paper originally published in Transactions on Machine Learning Research + (TMLR) +
+
+
+
+
+ + ♻ ☆ An end-to-end attention-based approach for learning on graphs + + +
+ There has been a recent surge in transformer-based architectures for learning +on graphs, mainly motivated by attention as an effective learning mechanism and +the desire to supersede handcrafted operators characteristic of message passing +schemes. However, concerns over their empirical effectiveness, scalability, and +complexity of the pre-processing steps have been raised, especially in relation +to much simpler graph neural networks that typically perform on par with them +across a wide range of benchmarks. To tackle these shortcomings, we consider +graphs as sets of edges and propose a purely attention-based approach +consisting of an encoder and an attention pooling mechanism. The encoder +vertically interleaves masked and vanilla self-attention modules to learn an +effective representations of edges, while allowing for tackling possible +misspecifications in input graphs. Despite its simplicity, the approach +outperforms fine-tuned message passing baselines and recently proposed +transformer-based methods on more than 70 node and graph-level tasks, including +challenging long-range benchmarks. Moreover, we demonstrate state-of-the-art +performance across different tasks, ranging from molecular to vision graphs, +and heterophilous node classification. The approach also outperforms graph +neural networks and transformers in transfer learning settings, and scales much +better than alternatives with a similar performance level or expressive power. + +
+
+
+
+
+ + ♻ ☆ GaussianFormer-2: Probabilistic Gaussian Superposition for Efficient 3D + Occupancy Prediction + + +
+ 3D semantic occupancy prediction is an important task for robust +vision-centric autonomous driving, which predicts fine-grained geometry and +semantics of the surrounding scene. Most existing methods leverage dense +grid-based scene representations, overlooking the spatial sparsity of the +driving scenes. Although 3D semantic Gaussian serves as an object-centric +sparse alternative, most of the Gaussians still describe the empty region with +low efficiency. To address this, we propose a probabilistic Gaussian +superposition model which interprets each Gaussian as a probability +distribution of its neighborhood being occupied and conforms to probabilistic +multiplication to derive the overall geometry. Furthermore, we adopt the exact +Gaussian mixture model for semantics calculation to avoid unnecessary +overlapping of Gaussians. To effectively initialize Gaussians in non-empty +region, we design a distribution-based initialization module which learns the +pixel-aligned occupancy distribution instead of the depth of surfaces. We +conduct extensive experiments on nuScenes and KITTI-360 datasets and our +GaussianFormer-2 achieves state-of-the-art performance with high efficiency. +Code: https://github.com/huang-yh/GaussianFormer. + +
+
+ comment: Code is available at: https://github.com/huang-yh/GaussianFormer +
+
+
+
+
+ + ♻ ☆ EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online + Scene Understanding + + +
+ 3D occupancy prediction provides a comprehensive description of the +surrounding scenes and has become an essential task for 3D perception. Most +existing methods focus on offline perception from one or a few views and cannot +be applied to embodied agents which demands to gradually perceive the scene +through progressive embodied exploration. In this paper, we formulate an +embodied 3D occupancy prediction task to target this practical scenario and +propose a Gaussian-based EmbodiedOcc framework to accomplish it. We initialize +the global scene with uniform 3D semantic Gaussians and progressively update +local regions observed by the embodied agent. For each update, we extract +semantic and structural features from the observed image and efficiently +incorporate them via deformable cross-attention to refine the regional +Gaussians. Finally, we employ Gaussian-to-voxel splatting to obtain the global +3D occupancy from the updated 3D Gaussians. Our EmbodiedOcc assumes an unknown +(i.e., uniformly distributed) environment and maintains an explicit global +memory of it with 3D Gaussians. It gradually gains knowledge through the local +refinement of regional Gaussians, which is consistent with how humans +understand new scenes through embodied exploration. We reorganize an +EmbodiedOcc-ScanNet benchmark based on local annotations to facilitate the +evaluation of the embodied 3D occupancy prediction task. Experiments +demonstrate that our EmbodiedOcc outperforms existing local prediction methods +and accomplishes the embodied occupancy prediction with high accuracy and +strong expandability. Code: https://github.com/YkiWu/EmbodiedOcc. + +
+
+ comment: Code: https://github.com/YkiWu/EmbodiedOcc +
+
+
+
+
+ + ♻ ☆ xLSTM: Extended Long Short-Term Memory + + +
+ In the 1990s, the constant error carousel and gating were introduced as the +central ideas of the Long Short-Term Memory (LSTM). Since then, LSTMs have +stood the test of time and contributed to numerous deep learning success +stories, in particular they constituted the first Large Language Models (LLMs). +However, the advent of the Transformer technology with parallelizable +self-attention at its core marked the dawn of a new era, outpacing LSTMs at +scale. We now raise a simple question: How far do we get in language modeling +when scaling LSTMs to billions of parameters, leveraging the latest techniques +from modern LLMs, but mitigating known limitations of LSTMs? Firstly, we +introduce exponential gating with appropriate normalization and stabilization +techniques. Secondly, we modify the LSTM memory structure, obtaining: (i) sLSTM +with a scalar memory, a scalar update, and new memory mixing, (ii) mLSTM that +is fully parallelizable with a matrix memory and a covariance update rule. +Integrating these LSTM extensions into residual block backbones yields xLSTM +blocks that are then residually stacked into xLSTM architectures. Exponential +gating and modified memory structures boost xLSTM capabilities to perform +favorably when compared to state-of-the-art Transformers and State Space +Models, both in performance and scaling. + +
+
+ comment: Code available at https://github.com/NX-AI/xlstm +
+
+
+
+
+ + ♻ ☆ Stochastic Primal-Dual Three Operator Splitting Algorithm with Extension + to Equivariant Regularization-by-Denoising + + +
+ In this work we propose a stochastic primal-dual three-operator splitting +algorithm (TOS-SPDHG) for solving a class of convex three-composite +optimization problems. Our proposed scheme is a direct three-operator splitting +extension of the SPDHG algorithm [Chambolle et al. 2018]. We provide +theoretical convergence analysis showing ergodic $O(1/K)$ convergence rate, and +demonstrate the effectiveness of our approach in imaging inverse problems. +Moreover, we further propose TOS-SPDHG-RED and TOS-SPDHG-eRED which utilizes +the regularization-by-denoising (RED) framework to leverage pretrained deep +denoising networks as priors. + +
+
+
+
+
+ + ♻ ☆ Another look at inference after prediction + + +
+ Prediction-based (PB) inference is increasingly used in applications where +the outcome of interest is difficult to obtain, but its predictors are readily +available. Unlike traditional inference, PB inference performs statistical +inference using a partially observed outcome and a set of covariates by +leveraging a prediction of the outcome generated from a machine learning (ML) +model. Motwani and Witten (2023) recently revisited two innovative PB inference +approaches for ordinary least squares. They found that the method proposed by +Wang et al. (2020) yields a consistent estimator for the association of +interest when the ML model perfectly captures the underlying regression +function. Conversely, the prediction-powered inference (PPI) method proposed by +Angelopoulos et al. (2023) yields valid inference regardless of the model's +accuracy. In this paper, we study the statistical efficiency of the PPI +estimator. Our analysis reveals that a more efficient estimator, proposed 25 +years ago by Chen and Chen (2000), can be obtained by simply adding a weight to +the PPI estimator. We also contextualize PB inference with methods from the +economics and statistics literature dating back to the 1960s. Our extensive +theoretical and numerical analyses indicate that the Chen and Chen (CC) +estimator offers a balance between robustness to ML model specification and +statistical efficiency, making it the preferred choice for use in practice. + +
+
+
+
+
+ + ♻ ☆ Probabilistic Language-Image Pre-Training + + +
+ Vision-language models (VLMs) embed aligned image-text pairs into a joint +space but often rely on deterministic embeddings, assuming a one-to-one +correspondence between images and texts. This oversimplifies real-world +relationships, which are inherently many-to-many, with multiple captions +describing a single image and vice versa. We introduce Probabilistic +Language-Image Pre-training (ProLIP), the first probabilistic VLM pre-trained +on a billion-scale image-text dataset using only probabilistic objectives, +achieving a strong zero-shot capability (e.g., 74.6% ImageNet zero-shot +accuracy with ViT-B/16). ProLIP efficiently estimates uncertainty by an +"uncertainty token" without extra parameters. We also introduce a novel +inclusion loss that enforces distributional inclusion relationships between +image-text pairs and between original and masked inputs. Experiments +demonstrate that, by leveraging uncertainty estimates, ProLIP benefits +downstream tasks and aligns with intuitive notions of uncertainty, e.g., +shorter texts being more uncertain and more general inputs including specific +ones. Utilizing text uncertainties, we further improve ImageNet accuracy from +74.6% to 75.8% (under a few-shot setting), supporting the practical advantages +of our probabilistic approach. The code is available at +https://github.com/naver-ai/prolip + +
+
+ comment: Code: https://github.com/naver-ai/prolip HuggingFace Hub: + https://huggingface.co/collections/SanghyukChun/prolip-6712595dfc87fd8597350291 + 31 pages, 4.29 MB +
+
+
+
+
+ + ♻ ☆ Remaining-data-free Machine Unlearning by Suppressing Sample + Contribution + + +
+ Machine unlearning (MU) is to forget data from a well-trained model, which is +practically important due to the ``right to be forgotten''. The unlearned model +should approach the retrained model, where the forgetting data are not involved +in the training process and hence do not contribute to the retrained model. +Considering the forgetting data's absence during retraining, we think +unlearning should withdraw their contribution from the pre-trained model. The +challenge is that when tracing the learning process is impractical, how to +quantify and detach sample's contribution to the dynamic learning process using +only the pre-trained model. We first theoretically discover that sample's +contribution during the process will reflect in the learned model's sensitivity +to it. We then practically design a novel method, namely MU-Mis (Machine +Unlearning by Minimizing input sensitivity), to suppress the contribution of +the forgetting data. Experimental results demonstrate that MU-Mis can unlearn +effectively and efficiently without utilizing the remaining data. It is the +first time that a remaining-data-free method can outperform state-of-the-art +(SoTA) unlearning methods that utilize the remaining data. + +
+
+
+
+
+ + ♻ ☆ Demystifying Higher-Order Graph Neural Networks + + +
+ Higher-order graph neural networks (HOGNNs) and the related architectures +from Topological Deep Learning are an important class of GNN models that +harness polyadic relations between vertices beyond plain edges. They have been +used to eliminate issues such as over-smoothing or over-squashing, to +significantly enhance the accuracy of GNN predictions, to improve the +expressiveness of GNN architectures, and for numerous other goals. A plethora +of HOGNN models have been introduced, and they come with diverse neural +architectures, and even with different notions of what the "higher-order" +means. This richness makes it very challenging to appropriately analyze and +compare HOGNN models, and to decide in what scenario to use specific ones. To +alleviate this, we first design an in-depth taxonomy and a blueprint for +HOGNNs. This facilitates designing models that maximize performance. Then, we +use our taxonomy to analyze and compare the available HOGNN models. The +outcomes of our analysis are synthesized in a set of insights that help to +select the most beneficial GNN model in a given scenario, and a comprehensive +list of challenges and opportunities for further research into more powerful +HOGNNs. + +
+
+
+
+
+ + ♻ ☆ Dreaming Learning NeurIPS 2024 + + +
+ Incorporating novelties into deep learning systems remains a challenging +problem. Introducing new information to a machine learning system can interfere +with previously stored data and potentially alter the global model paradigm, +especially when dealing with non-stationary sources. In such cases, traditional +approaches based on validation error minimization offer limited advantages. To +address this, we propose a training algorithm inspired by Stuart Kauffman's +notion of the Adjacent Possible. This novel training methodology explores new +data spaces during the learning phase. It predisposes the neural network to +smoothly accept and integrate data sequences with different statistical +characteristics than expected. The maximum distance compatible with such +inclusion depends on a specific parameter: the sampling temperature used in the +explorative phase of the present method. This algorithm, called Dreaming +Learning, anticipates potential regime shifts over time, enhancing the neural +network's responsiveness to non-stationary events that alter statistical +properties. To assess the advantages of this approach, we apply this +methodology to unexpected statistical changes in Markov chains and +non-stationary dynamics in textual sequences. We demonstrated its ability to +improve the auto-correlation of generated textual sequences by $\sim 29\%$ and +enhance the velocity of loss convergence by $\sim 100\%$ in the case of a +paradigm shift in Markov chains. + +
+
+ comment: Accepted at the NeurIPS 2024 workshop on Intrinsically Motivated + Open-ended Learning +
+
+
+
+
+ + ♻ ☆ Leveraging Bi-Focal Perspectives and Granular Feature Integration for + Accurate Reliable Early Alzheimer's Detection + + +
+ Alzheimer's disease (AD) is the most common neurodegeneration, annually +diagnosed in millions of patients. The present medicine scenario still finds +challenges in the exact diagnosis and classification of AD through neuroimaging +data. Traditional CNNs can extract a good amount of low-level information in an +image but fail to extract high-level minuscule particles, which is a +significant challenge in detecting AD from MRI scans. To overcome this, we +propose a novel Granular Feature Integration method to combine information +extraction at different scales combined with an efficient information flow, +enabling the model to capture both broad and fine-grained features +simultaneously. We also propose a Bi-Focal Perspective mechanism to highlight +the subtle neurofibrillary tangles and amyloid plaques in the MRI scans, +ensuring that critical pathological markers are accurately identified. Our +model achieved an F1-Score of 99.31%, precision of 99.24%, and recall of +99.51%. These scores prove that our model is significantly better than the +state-of-the-art (SOTA) CNNs in existence. + +
+
+ comment: 14 pages, 12 figures, 6 tables +
+
+
+
+
+ + ♻ ☆ MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal + Large Language Models + + +
+ Despite the superior capabilities of Multimodal Large Language Models (MLLMs) +across diverse tasks, they still face significant trustworthiness challenges. +Yet, current literature on the assessment of trustworthy MLLMs remains limited, +lacking a holistic evaluation to offer thorough insights into future +improvements. In this work, we establish MultiTrust, the first comprehensive +and unified benchmark on the trustworthiness of MLLMs across five primary +aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark +employs a rigorous evaluation strategy that addresses both multimodal risks and +cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. +Extensive experiments with 21 modern MLLMs reveal some previously unexplored +trustworthiness issues and risks, highlighting the complexities introduced by +the multimodality and underscoring the necessity for advanced methodologies to +enhance their reliability. For instance, typical proprietary models still +struggle with the perception of visually confusing images and are vulnerable to +multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to +disclose privacy in text and reveal ideological and cultural biases even when +paired with irrelevant images in inference, indicating that the multimodality +amplifies the internal risks from base LLMs. Additionally, we release a +scalable toolbox for standardized trustworthiness research, aiming to +facilitate future advancements in this important field. Code and resources are +publicly available at: https://multi-trust.github.io/. + +
+
+ comment: 100 pages, 84 figures, 33 tables +
+
+
+
+
+ + ♻ ☆ LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing + Layer Execution Order + + +
+ Due to their architecture and how they are trained, artificial neural +networks are typically not robust toward pruning or shuffling layers at test +time. However, such properties would be desirable for different applications, +such as distributed neural network architectures where the order of execution +cannot be guaranteed or parts of the network can fail during inference. In this +work, we address these issues through a number of training approaches for +vision transformers whose most important component is randomizing the execution +order of attention modules at training time. With our proposed approaches, +vision transformers are capable to adapt to arbitrary layer execution orders at +test time assuming one tolerates a reduction (about 20\%) in accuracy at the +same model size. We analyse the feature representations of our trained models +as well as how each layer contributes to the models prediction based on its +position during inference. Our analysis shows that layers learn to contribute +differently based on their position in the network. Finally, we layer-prune our +models at test time and find that their performance declines gracefully. Code +available at https://github.com/matfrei/layershuffle. + +
+
+
+
+
+ + ♻ ☆ Old Optimizer, New Norm: An Anthology + + +
+ Deep learning optimizers are often motivated through a mix of convex and +approximate second-order theory. We select three such methods -- Adam, Shampoo +and Prodigy -- and argue that each method can instead be understood as a +squarely first-order method without convexity assumptions. In fact, after +switching off exponential moving averages, each method is equivalent to +steepest descent under a particular norm. By generalizing this observation, we +chart a new design space for training algorithms. Different operator norms +should be assigned to different tensors based on the role that the tensor plays +within the network. For example, while linear and embedding layers may have the +same weight space of $\mathbb{R}^{m\times n}$, these layers play different +roles and should be assigned different norms. We hope that this idea of +carefully metrizing the neural architecture might lead to more stable, scalable +and indeed faster training. + +
+
+
+
+
+ + ♻ ☆ Cross-modal semantic segmentation for indoor environmental perception + using single-chip millimeter-wave radar raw data + + +
+ In the context of firefighting and rescue operations, a cross-modal semantic +segmentation model based on a single-chip millimeter-wave (mmWave) radar for +indoor environmental perception is proposed and discussed. To efficiently +obtain high-quality labels, an automatic label generation method utilizing +LiDAR point clouds and occupancy grid maps is introduced. The proposed +segmentation model is based on U-Net. A spatial attention module is +incorporated, which enhanced the performance of the mode. The results +demonstrate that cross-modal semantic segmentation provides a more intuitive +and accurate representation of indoor environments. Unlike traditional methods, +the model's segmentation performance is minimally affected by azimuth. Although +performance declines with increasing distance, this can be mitigated by a +well-designed model. Additionally, it was found that using raw ADC data as +input is ineffective; compared to RA tensors, RD tensors are more suitable for +the proposed model. + +
+
+ comment: 5291 words, 17 pages, 11 figures +
+
+
+
+
+ + ♻ ☆ Memorization With Neural Nets: Going Beyond the Worst Case + + +
+ In practice, deep neural networks are often able to easily interpolate their +training data. To understand this phenomenon, many works have aimed to quantify +the memorization capacity of a neural network architecture: the largest number +of points such that the architecture can interpolate any placement of these +points with any assignment of labels. For real-world data, however, one +intuitively expects the presence of a benign structure so that interpolation +already occurs at a smaller network size than suggested by memorization +capacity. In this paper, we investigate interpolation by adopting an +instance-specific viewpoint. We introduce a simple randomized algorithm that, +given a fixed finite data set with two classes, with high probability +constructs an interpolating three-layer neural network in polynomial time. The +required number of parameters is linked to geometric properties of the two +classes and their mutual arrangement. As a result, we obtain guarantees that +are independent of the number of samples and hence move beyond worst-case +memorization capacity bounds. We verify our theoretical result with numerical +experiments and additionally investigate the effectiveness of the algorithm on +MNIST and CIFAR-10. + +
+
+ comment: The current version of the manuscript has been accepted to Journal of + Machine Learning Research +
+
+
+
+
+ + ♻ ☆ LLM-ABBA: Understanding time series via symbolic approximation + + +
+ The success of large language models (LLMs) for time series has been +demonstrated in previous work. Utilizing a symbolic time series representation, +one can efficiently bridge the gap between LLMs and time series. However, the +remaining challenge is to exploit the semantic information hidden in time +series by using symbols or existing tokens of LLMs, while aligning the +embedding space of LLMs according to the hidden information of time series. The +symbolic time series approximation (STSA) method called adaptive Brownian +bridge-based symbolic aggregation (ABBA) shows outstanding efficacy in +preserving salient time series features by modeling time series patterns in +terms of amplitude and period while using existing tokens of LLMs. + In this paper, we introduce a method, called LLM-ABBA, that integrates ABBA +into large language models for various downstream time series tasks. By +symbolizing time series, LLM-ABBA compares favorably to the recent +state-of-the-art (SOTA) in UCR and three medical time series classification +tasks. Meanwhile, a fixed-polygonal chain trick in ABBA is introduced to +\kc{avoid obvious drifting} during prediction tasks by significantly mitigating +the effects of cumulative error arising from misused symbols during the +transition from symbols to numerical values. In time series regression tasks, +LLM-ABBA achieves the new SOTA on Time Series Extrinsic Regression (TSER) +benchmarks. LLM-ABBA also shows competitive prediction capability compared to +recent SOTA time series prediction results. We believe this framework can also +seamlessly extend to other time series tasks. + +
+
+
+
+
+ + ♻ ☆ An Evolved Universal Transformer Memory + + +
+ Prior methods propose to offset the escalating costs of modern foundation +models by dropping specific parts of their contexts with hand-designed rules, +while attempting to preserve their original performance. We overcome this +trade-off with Neural Attention Memory Models (NAMMs), introducing a learned +network for memory management that improves both the performance and efficiency +of transformers. We evolve NAMMs atop pre-trained transformers to provide +different latent contexts focusing on the most relevant information for +individual layers and attention heads. NAMMs are universally applicable to any +model using self-attention as they condition exclusively on the values in the +produced attention matrices. Learning NAMMs on a small set of problems, we +achieve substantial performance improvements across multiple long-context +benchmarks while cutting the model's input contexts up to a fraction of the +original sizes. We show the generality of our conditioning enables zero-shot +transfer of NAMMs trained only on language to entirely new transformer +architectures even across input modalities, with their benefits carrying over +to vision and reinforcement learning. + +
+
+ comment: Preprint, under submission. Source code is available at + https://github.com/SakanaAI/evo-memory +
+
+
+
+
+ + ♻ ☆ QuickDrop: Efficient Federated Unlearning by Integrated Dataset + Distillation + + +
+ Federated Unlearning (FU) aims to delete specific training data from an ML +model trained using Federated Learning (FL). We introduce QuickDrop, an +efficient and original FU method that utilizes dataset distillation (DD) to +accelerate unlearning and drastically reduces computational overhead compared +to existing approaches. In QuickDrop, each client uses DD to generate a compact +dataset representative of the original training dataset, called a distilled +dataset, and uses this compact dataset during unlearning. To unlearn specific +knowledge from the global model, QuickDrop has clients execute Stochastic +Gradient Ascent with samples from the distilled datasets, thus significantly +reducing computational overhead compared to conventional FU methods. We further +increase the efficiency of QuickDrop by ingeniously integrating DD into the FL +training process. By reusing the gradient updates produced during FL training +for DD, the overhead of creating distilled datasets becomes close to +negligible. Evaluations on three standard datasets show that, with comparable +accuracy guarantees, QuickDrop reduces the duration of unlearning by 463.8x +compared to model retraining from scratch and 65.1x compared to existing FU +approaches. We also demonstrate the scalability of QuickDrop with 100 clients +and show its effectiveness while handling multiple unlearning operations. + +
+
+ comment: Accepted by Middleware 2024 +
+
+
+
+
+ + ♻ ☆ LLM-Enhanced Bayesian Optimization for Efficient Analog Layout + Constraint Generation + + +
+ Analog layout synthesis faces significant challenges due to its dependence on +manual processes, considerable time requirements, and performance instability. +Current Bayesian Optimization (BO)-based techniques for analog layout +synthesis, despite their potential for automation, suffer from slow convergence +and extensive data needs, limiting their practical application. This paper +presents the \texttt{LLANA} framework, a novel approach that leverages Large +Language Models (LLMs) to enhance BO by exploiting the few-shot learning +abilities of LLMs for more efficient generation of analog design-dependent +parameter constraints. Experimental results demonstrate that \texttt{LLANA} not +only achieves performance comparable to state-of-the-art (SOTA) BO methods but +also enables a more effective exploration of the analog circuit design space, +thanks to LLM's superior contextual understanding and learning efficiency. The +code is available at https://github.com/dekura/LLANA. + +
+
+
+
+
+ + ♻ ☆ Hallucination Detection in LLMs: Fast and Memory-Efficient Fine-Tuned + Models + + +
+ Uncertainty estimation is a necessary component when implementing AI in +high-risk settings, such as autonomous cars, medicine, or insurances. Large +Language Models (LLMs) have seen a surge in popularity in recent years, but +they are subject to hallucinations, which may cause serious harm in high-risk +settings. Despite their success, LLMs are expensive to train and run: they need +a large amount of computations and memory, preventing the use of ensembling +methods in practice. In this work, we present a novel method that allows for +fast and memory-friendly training of LLM ensembles. We show that the resulting +ensembles can detect hallucinations and are a viable approach in practice as +only one GPU is needed for training and inference. + +
+
+ comment: 6 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules + + +
+ Text-to-image (T2I) generation using diffusion models has become a +blockbuster service in today's AI cloud. A production T2I service typically +involves a serving workflow where a base diffusion model is augmented with +various "add-on" modules, notably ControlNet and LoRA, to enhance image +generation control. Compared to serving the base model alone, these add-on +modules introduce significant loading and computational overhead, resulting in +increased latency. In this paper, we present SwiftDiffusion, a system that +efficiently serves a T2I workflow through a holistic approach. SwiftDiffusion +decouples ControNet from the base model and deploys it as a separate, +independently scaled service on dedicated GPUs, enabling ControlNet caching, +parallelization, and sharing. To mitigate the high loading overhead of LoRA +serving, SwiftDiffusion employs a bounded asynchronous LoRA loading (BAL) +technique, allowing LoRA loading to overlap with the initial base model +execution by up to k steps without compromising image quality. Furthermore, +SwiftDiffusion optimizes base model execution with a novel latent parallelism +technique. Collectively, these designs enable SwiftDiffusion to outperform the +state-of-the-art T2I serving systems, achieving up to 7.8x latency reduction +and 1.6x throughput improvement in serving SDXL models on H800 GPUs, without +sacrificing image quality. + +
+
+
+
+
+ + ♻ ☆ What can we learn from quantum convolutional neural networks? + + +
+ Quantum machine learning (QML) shows promise for analyzing quantum data. A +notable example is the use of quantum convolutional neural networks (QCNNs), +implemented as specific types of quantum circuits, to recognize phases of +matter. In this approach, ground states of many-body Hamiltonians are prepared +to form a quantum dataset and classified in a supervised manner using only a +few labeled examples. However, this type of dataset and model differs +fundamentally from typical QML paradigms based on feature maps and +parameterized circuits. In this study, we demonstrate how models utilizing +quantum data can be interpreted through hidden feature maps, where physical +features are implicitly embedded via ground-state feature maps. By analyzing +selected examples previously explored with QCNNs, we show that high performance +in quantum phase recognition comes from generating a highly effective basis set +with sharp features at critical points. The learning process adapts the +measurement to create sharp decision boundaries. Our analysis highlights +improved generalization when working with quantum data, particularly in the +limited-shots regime. Furthermore, translating these insights into the domain +of quantum scientific machine learning, we demonstrate that ground-state +feature maps can be applied to fluid dynamics problems, expressing shock wave +solutions with good generalization and proven trainability. + +
+
+ comment: 15 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Under the Hood of Tabular Data Generation Models: Benchmarks with + Extensive Tuning + + +
+ The ability to train generative models that produce realistic, safe and +useful tabular data is essential for data privacy, imputation, oversampling, +explainability or simulation. However, generating tabular data is not +straightforward due to its heterogeneity, non-smooth distributions, complex +dependencies and imbalanced categorical features. Although diverse methods have +been proposed in the literature, there is a need for a unified evaluation, +under the same conditions, on a variety of datasets. This study addresses this +need by fully considering the optimization of: hyperparameters, feature +encodings, and architectures. We investigate the impact of dataset-specific +tuning on five recent model families for tabular data generation through an +extensive benchmark on 16 datasets. These datasets vary in terms of size (an +average of 80,000 rows), data types, and domains. We also propose a reduced +search space for each model that allows for quick optimization, achieving +nearly equivalent performance at a significantly lower cost. Our benchmark +demonstrates that, for most models, large-scale dataset-specific tuning +substantially improves performance compared to the original configurations. +Furthermore, we confirm that diffusion-based models generally outperform other +models on tabular data. However, this advantage is not significant when the +entire tuning and training process is restricted to the same GPU budget. + +
+
+
+
+
+ + ♻ ☆ 2-Rectifications are Enough for Straight Flows: A Theoretical Insight + into Wasserstein Convergence + + +
+ Diffusion models have emerged as a powerful tool for image generation and +denoising. Typically, generative models learn a trajectory between the starting +noise distribution and the target data distribution. Recently Liu et al. +(2023b) designed a novel alternative generative model Rectified Flow (RF), +which aims to learn straight flow trajectories from noise to data using a +sequence of convex optimization problems with close ties to optimal transport. +If the trajectory is curved, one must use many Euler discretization steps or +novel strategies, such as exponential integrators, to achieve a satisfactory +generation quality. In contrast, RF has been shown to theoretically straighten +the trajectory through successive rectifications, reducing the number of +function evaluations (NFEs) while sampling. It has also been shown empirically +that RF may improve the straightness in two rectifications if one can solve the +underlying optimization problem within a sufficiently small error. In this +paper, we make two key theoretical contributions: 1) we provide the first +theoretical analysis of the Wasserstein distance between the sampling +distribution of RF and the target distribution. Our error rate is characterized +by the number of discretization steps and a \textit{new formulation of +straightness} stronger than that in the original work. 2) under a mild +regularity assumption, we show that for a rectified flow from a Gaussian to any +general target distribution with finite first moment (e.g. mixture of +Gaussians), two rectifications are sufficient to achieve a straight flow, which +is in line with the previous empirical findings. Additionally, we also present +empirical results on both simulated and real datasets to validate our +theoretical findings. + +
+
+ comment: 28 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ Memory-efficient Continual Learning with Neural Collapse Contrastive WACV 2025 + + +
+ Contrastive learning has significantly improved representation quality, +enhancing knowledge transfer across tasks in continual learning (CL). However, +catastrophic forgetting remains a key challenge, as contrastive based methods +primarily focus on "soft relationships" or "softness" between samples, which +shift with changing data distributions and lead to representation overlap +across tasks. Recently, the newly identified Neural Collapse phenomenon has +shown promise in CL by focusing on "hard relationships" or "hardness" between +samples and fixed prototypes. However, this approach overlooks "softness", +crucial for capturing intra-class variability, and this rigid focus can also +pull old class representations toward current ones, increasing forgetting. +Building on these insights, we propose Focal Neural Collapse Contrastive +(FNC^2), a novel representation learning loss that effectively balances both +soft and hard relationships. Additionally, we introduce the Hardness-Softness +Distillation (HSD) loss to progressively preserve the knowledge gained from +these relationships across tasks. Our method outperforms state-of-the-art +approaches, particularly in minimizing memory reliance. Remarkably, even +without the use of memory, our approach rivals rehearsal-based methods, +offering a compelling solution for data privacy concerns. + +
+
+ comment: Accepted at WACV 2025 +
+
+
+
+
+ + ♻ ☆ Iterative Methods for Vecchia-Laplace Approximations for Latent Gaussian + Process Models + + +
+ Latent Gaussian process (GP) models are flexible probabilistic non-parametric +function models. Vecchia approximations are accurate approximations for GPs to +overcome computational bottlenecks for large data, and the Laplace +approximation is a fast method with asymptotic convergence guarantees to +approximate marginal likelihoods and posterior predictive distributions for +non-Gaussian likelihoods. Unfortunately, the computational complexity of +combined Vecchia-Laplace approximations grows faster than linearly in the +sample size when used in combination with direct solver methods such as the +Cholesky decomposition. Computations with Vecchia-Laplace approximations can +thus become prohibitively slow precisely when the approximations are usually +the most accurate, i.e., on large data sets. In this article, we present +iterative methods to overcome this drawback. Among other things, we introduce +and analyze several preconditioners, derive new convergence results, and +propose novel methods for accurately approximating predictive variances. We +analyze our proposed methods theoretically and in experiments with simulated +and real-world data. In particular, we obtain a speed-up of an order of +magnitude compared to Cholesky-based calculations and a threefold increase in +prediction accuracy in terms of the continuous ranked probability score +compared to a state-of-the-art method on a large satellite data set. All +methods are implemented in a free C++ software library with high-level Python +and R packages. + +
+
+
+
+
+ + ♻ ☆ Consistent Spectral Clustering in Hyperbolic Spaces + + +
+ Clustering, as an unsupervised technique, plays a pivotal role in various +data analysis applications. Among clustering algorithms, Spectral Clustering on +Euclidean Spaces has been extensively studied. However, with the rapid +evolution of data complexity, Euclidean Space is proving to be inefficient for +representing and learning algorithms. Although Deep Neural Networks on +hyperbolic spaces have gained recent traction, clustering algorithms or +non-deep machine learning models on non-Euclidean Spaces remain underexplored. +In this paper, we propose a spectral clustering algorithm on Hyperbolic Spaces +to address this gap. Hyperbolic Spaces offer advantages in representing complex +data structures like hierarchical and tree-like structures, which cannot be +embedded efficiently in Euclidean Spaces. Our proposed algorithm replaces the +Euclidean Similarity Matrix with an appropriate Hyperbolic Similarity Matrix, +demonstrating improved efficiency compared to clustering in Euclidean Spaces. +Our contributions include the development of the spectral clustering algorithm +on Hyperbolic Spaces and the proof of its weak consistency. We show that our +algorithm converges at least as fast as Spectral Clustering on Euclidean +Spaces. To illustrate the efficacy of our approach, we present experimental +results on the Wisconsin Breast Cancer Dataset, highlighting the superior +performance of Hyperbolic Spectral Clustering over its Euclidean counterpart. +This work opens up avenues for utilizing non-Euclidean Spaces in clustering +algorithms, offering new perspectives for handling complex data structures and +improving clustering efficiency. + +
+
+ comment: Currently under review +
+
+
+
+
+ + ♻ ☆ Hybrid deep additive neural networks + + +
+ Traditional neural networks (multi-layer perceptrons) have become an +important tool in data science due to their success across a wide range of +tasks. However, their performance is sometimes unsatisfactory, and they often +require a large number of parameters, primarily due to their reliance on the +linear combination structure. Meanwhile, additive regression has been a popular +alternative to linear regression in statistics. In this work, we introduce +novel deep neural networks that incorporate the idea of additive regression. +Our neural networks share architectural similarities with Kolmogorov-Arnold +networks but are based on simpler yet flexible activation and basis functions. +Additionally, we introduce several hybrid neural networks that combine this +architecture with that of traditional neural networks. We derive their +universal approximation properties and demonstrate their effectiveness through +simulation studies and a real-data application. The numerical results indicate +that our neural networks generally achieve better performance than traditional +neural networks while using fewer parameters. + +
+
+ comment: 30 pages, 10 figures +
+
+
+
+
+ + ♻ ☆ An Efficient Loop and Clique Coarsening Algorithm for Graph + Classification + + +
+ Graph Transformers (GTs) have made remarkable achievements in graph-level +tasks. However, most existing works regard graph structures as a form of +guidance or bias for enhancing node representations, which focuses on +node-central perspectives and lacks explicit representations of edges and +structures. One natural question arises as to whether we can leverage a +hypernode to represent some structures. Through experimental analysis, we +explore the feasibility of this assumption. Based on our findings, we propose +an efficient Loop and Clique Coarsening algorithm with linear complexity for +Graph Classification (LCC4GC) on GT architecture. Specifically, we build three +unique views, original, coarsening, and conversion, to learn a thorough +structural representation. We compress loops and cliques via hierarchical +heuristic graph coarsening and restrict them with well-designed constraints, +which builds the coarsening view to learn high-level interactions between +structures. We also introduce line graphs for edge embeddings and switch to +edge-central perspective to alleviate the impact of coarsening reduction. +Experiments on eight real-world datasets demonstrate the improvements of LCC4GC +over 31 baselines from various architectures. + +
+
+
+
+
+ + ♻ ☆ NeuroNAS: A Framework for Energy-Efficient Neuromorphic + Compute-in-Memory Systems using Hardware-Aware Spiking Neural Architecture + Search + + +
+ Spiking Neural Networks (SNNs) have demonstrated capabilities for solving +diverse machine learning tasks with ultra-low power/energy consumption. To +maximize the performance and efficiency of SNN inference, the Compute-in-Memory +(CIM) hardware accelerators with emerging device technologies (e.g., RRAM) have +been employed. However, SNN architectures are typically developed without +considering constraints from the application and the underlying CIM hardware, +thereby hindering SNNs from reaching their full potential in accuracy and +efficiency. To address this, we propose NeuroNAS, a novel framework for +developing energy-efficient neuromorphic CIM systems using a hardware-aware +spiking neural architecture search (NAS), i.e., by quickly finding an SNN +architecture that offers high accuracy under the given constraints (e.g., +memory, area, latency, and energy consumption). NeuroNAS employs the following +key steps: (1) optimizing SNN operations to enable efficient NAS, (2) employing +quantization to minimize the memory footprint, (3) developing an SNN +architecture that facilitates an effective learning, and (4) devising a +systematic hardware-aware search algorithm to meet the constraints. Compared to +the state-of-the-art, NeuroNAS with 8bit weight precision quickly finds SNNs +that maintain high accuracy by up to 6.6x search time speed-ups, while +achieving up to 92% area savings, 1.2x latency speed-ups, 84% energy savings +across CIFAR-10, CIFAR-100, and TinyImageNet-200 datasets; while the +state-of-the-art fail to meet all constraints at once. In this manner, NeuroNAS +enables efficient design automation in developing energy-efficient neuromorphic +CIM systems for diverse ML-based applications. + +
+
+ comment: 7 pages, 13 figures, 1 table +
+
+
+
+
+ + ♻ ☆ Learning Partial Differential Equations with Deep Parallel Neural + Operator + + +
+ In recent years, Solving partial differential equations has shifted the focus +of traditional neural network studies from finite-dimensional Euclidean spaces +to generalized functional spaces in research. A novel methodology is to learn +an operator as a means of approximating the mapping between outputs. Currently, +researchers have proposed a variety of operator architectures. Nevertheless, +the majority of these architectures adopt an iterative update architecture, +whereby a single operator is learned from the same function space. In practical +physical science problems, the numerical solutions of partial differential +equations are complex, and a serial single operator is unable to accurately +approximate the intricate mapping between input and output. So, We propose a +deep parallel operator model (DPNO) for efficiently and accurately solving +partial differential equations. DPNO employs convolutional neural networks to +extract local features and map data into distinct latent spaces. Designing a +parallel block of double Fourier neural operators to solve the iterative error +problem. DPNO approximates complex mappings between inputs and outputs by +learning multiple operators in different potential spaces in parallel blocks. +DPNO achieved the best performance on five of them, with an average improvement +of 10.5\%, and ranked second on one dataset. + +
+
+
+
+
+ + ♻ ☆ Generative Modelling of Structurally Constrained Graphs NeurIPS 2024 + + +
+ Graph diffusion models have emerged as state-of-the-art techniques in graph +generation; yet, integrating domain knowledge into these models remains +challenging. Domain knowledge is particularly important in real-world +scenarios, where invalid generated graphs hinder deployment in practical +applications. Unconstrained and conditioned graph diffusion models fail to +guarantee such domain-specific structural properties. We present ConStruct, a +novel framework that enables graph diffusion models to incorporate hard +constraints on specific properties, such as planarity or acyclicity. Our +approach ensures that the sampled graphs remain within the domain of graphs +that satisfy the specified property throughout the entire trajectory in both +the forward and reverse processes. This is achieved by introducing an +edge-absorbing noise model and a new projector operator. ConStruct demonstrates +versatility across several structural and edge-deletion invariant constraints +and achieves state-of-the-art performance for both synthetic benchmarks and +attributed real-world datasets. For example, by incorporating planarity +constraints in digital pathology graph datasets, the proposed method +outperforms existing baselines, improving data validity by up to 71.1 +percentage points. + +
+
+ comment: NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware + Omni-Modal Perception of Long Videos + + +
+ Despite impressive advancements in video understanding, most efforts remain +limited to coarse-grained or visual-only video tasks. However, real-world +videos encompass omni-modal information (vision, audio, and speech) with a +series of events forming a cohesive storyline. The lack of multi-modal video +data with fine-grained event annotations and the high cost of manual labeling +are major obstacles to comprehensive omni-modality video perception. To address +this gap, we propose an automatic pipeline consisting of high-quality +multi-modal video filtering, semantically coherent omni-modal event boundary +detection, and cross-modal correlation-aware event captioning. In this way, we +present LongVALE, the first-ever Vision-Audio-Language Event understanding +benchmark comprising 105K omni-modal events with precise temporal boundaries +and detailed relation-aware captions within 8.4K high-quality long videos. +Further, we build a baseline that leverages LongVALE to enable video large +language models (LLMs) for omni-modality fine-grained temporal video +understanding for the first time. Extensive experiments demonstrate the +effectiveness and great potential of LongVALE in advancing comprehensive +multi-modal video understanding. + +
+
+ comment: 18 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ Automated Federated Pipeline for Parameter-Efficient Fine-Tuning of + Large Language Models + + +
+ Recently, there has been a surge in the development of advanced intelligent +generative content (AIGC), especially large language models (LLMs). However, +for many downstream tasks, it is necessary to fine-tune LLMs using private +data. While federated learning offers a promising privacy-preserving solution +to LLM fine-tuning, the substantial size of an LLM, combined with high +computational and communication demands, makes it hard to apply to downstream +tasks. More importantly, private edge servers often possess varying computing +and network resources in real-world scenarios, introducing additional +complexities to LLM fine-tuning. To tackle these problems, we design and +implement an automated federated pipeline, named FedPipe, to fine-tune LLMs +with minimal training cost but without adding any inference latency. FedPipe +firstly identifies the weights to be fine-tuned based on their contributions to +the LLM training. It then configures a low-rank adapter for each selected +weight to train local low-rank adapters on an edge server, and aggregate local +adapters of all edge servers to fine-tune the whole LLM. Finally, it +appropriately quantizes the parameters of LLM to reduce memory space according +to the requirements of edge servers. Extensive experiments demonstrate that +FedPipe expedites the model training and achieves higher accuracy than +state-of-the-art benchmarks. + +
+
+ comment: 15 pages, 16 figures +
+
+
+
+
+ + ♻ ☆ Deep Learning and Machine Learning: Advancing Big Data Analytics and + Management with Design Patterns + + +
+ This book, Design Patterns in Machine Learning and Deep Learning: Advancing +Big Data Analytics Management, presents a comprehensive study of essential +design patterns tailored for large-scale machine learning and deep learning +applications. The book explores the application of classical software +engineering patterns, Creational, Structural, Behavioral, and Concurrency +Patterns, to optimize the development, maintenance, and scalability of big data +analytics systems. Through practical examples and detailed Python +implementations, it bridges the gap between traditional object-oriented design +patterns and the unique demands of modern data analytics environments. Key +design patterns such as Singleton, Factory, Observer, and Strategy are analyzed +for their impact on model management, deployment strategies, and team +collaboration, providing invaluable insights into the engineering of efficient, +reusable, and flexible systems. This volume is an essential resource for +developers, researchers, and engineers aiming to enhance their technical +expertise in both machine learning and software design. + +
+
+ comment: 138pages +
+
+
+
+
+ + ♻ ☆ PADetBench: Towards Benchmarking Physical Attacks against Object + Detection + + +
+ Physical attacks against object detection have gained increasing attention +due to their significant practical implications. However, conducting physical +experiments is extremely time-consuming and labor-intensive. Moreover, physical +dynamics and cross-domain transformation are challenging to strictly regulate +in the real world, leading to unaligned evaluation and comparison, severely +hindering the development of physically robust models. To accommodate these +challenges, we explore utilizing realistic simulation to thoroughly and +rigorously benchmark physical attacks with fairness under controlled physical +dynamics and cross-domain transformation. This resolves the problem of +capturing identical adversarial images that cannot be achieved in the real +world. Our benchmark includes 20 physical attack methods, 48 object detectors, +comprehensive physical dynamics, and evaluation metrics. We also provide +end-to-end pipelines for dataset generation, detection, evaluation, and further +analysis. In addition, we perform 8064 groups of evaluation based on our +benchmark, which includes both overall evaluation and further detailed ablation +studies for controlled physical dynamics. Through these experiments, we provide +in-depth analyses of physical attack performance and physical adversarial +robustness, draw valuable observations, and discuss potential directions for +future research. + Codebase: https://github.com/JiaweiLian/Benchmarking_Physical_Attack + +
+
+
+
+
+ + ♻ ☆ EM Distillation for One-step Diffusion Models NeurIPS 2024 + + +
+ While diffusion models can learn complex distributions, sampling requires a +computationally expensive iterative process. Existing distillation methods +enable efficient sampling, but have notable limitations, such as performance +degradation with very few sampling steps, reliance on training data access, or +mode-seeking optimization that may fail to capture the full distribution. We +propose EM Distillation (EMD), a maximum likelihood-based approach that +distills a diffusion model to a one-step generator model with minimal loss of +perceptual quality. Our approach is derived through the lens of +Expectation-Maximization (EM), where the generator parameters are updated using +samples from the joint distribution of the diffusion teacher prior and inferred +generator latents. We develop a reparametrized sampling scheme and a noise +cancellation technique that together stabilizes the distillation process. We +further reveal an interesting connection of our method with existing methods +that minimize mode-seeking KL. EMD outperforms existing one-step generative +methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares +favorably with prior work on distilling text-to-image diffusion models. + +
+
+ comment: NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized + Bilevel Optimization + + +
+ This paper studies decentralized bilevel optimization, in which multiple +agents collaborate to solve problems involving nested optimization structures +with neighborhood communications. Most existing literature primarily utilizes +gradient tracking to mitigate the influence of data heterogeneity, without +exploring other well-known heterogeneity-correction techniques such as EXTRA or +Exact Diffusion. Additionally, these studies often employ identical +decentralized strategies for both upper- and lower-level problems, neglecting +to leverage distinct mechanisms across different levels. To address these +limitations, this paper proposes SPARKLE, a unified Single-loop Primal-dual +AlgoRithm frameworK for decentraLized bilEvel optimization. SPARKLE offers the +flexibility to incorporate various heterogeneitycorrection strategies into the +algorithm. Moreover, SPARKLE allows for different strategies to solve upper- +and lower-level problems. We present a unified convergence analysis for +SPARKLE, applicable to all its variants, with state-of-the-art convergence +rates compared to existing decentralized bilevel algorithms. Our results +further reveal that EXTRA and Exact Diffusion are more suitable for +decentralized bilevel optimization, and using mixed strategies in bilevel +algorithms brings more benefits than relying solely on gradient tracking. + +
+
+ comment: 73 pages, the Thirty-Eighth Annual Conference on Neural Information + Processing Systems (2024) +
+
+
+
+
+ + ♻ ☆ A Simple Data Augmentation for Feature Distribution Skewed Federated + Learning + + +
+ Federated Learning (FL) facilitates collaborative learning among multiple +clients in a distributed manner and ensures the security of privacy. However, +its performance inevitably degrades with non-Independent and Identically +Distributed (non-IID) data. In this paper, we focus on the feature distribution +skewed FL scenario, a common non-IID situation in real-world applications where +data from different clients exhibit varying underlying distributions. This +variation leads to feature shift, which is a key issue of this scenario. While +previous works have made notable progress, few pay attention to the data +itself, i.e., the root of this issue. The primary goal of this paper is to +mitigate feature shift from the perspective of data. To this end, we propose a +simple yet remarkably effective input-level data augmentation method, namely +FedRDN, which randomly injects the statistical information of the local +distribution from the entire federation into the client's data. This is +beneficial to improve the generalization of local feature representations, +thereby mitigating feature shift. Moreover, our FedRDN is a plug-and-play +component, which can be seamlessly integrated into the data augmentation flow +with only a few lines of code. Extensive experiments on several datasets show +that the performance of various representative FL methods can be further +improved by integrating our FedRDN, demonstrating its effectiveness, strong +compatibility and generalizability. Code will be released. + +
+
+ comment: 11 pages, 3 figures +
+
+
+
+
+ + ♻ ☆ Robot Learning with Super-Linear Scaling + + +
+ Scaling robot learning requires data collection pipelines that scale +favorably with human effort. In this work, we propose Crowdsourcing and +Amortizing Human Effort for Real-to-Sim-to-Real(CASHER), a pipeline for scaling +up data collection and learning in simulation where the performance scales +superlinearly with human effort. The key idea is to crowdsource digital twins +of real-world scenes using 3D reconstruction and collect large-scale data in +simulation, rather than the real-world. Data collection in simulation is +initially driven by RL, bootstrapped with human demonstrations. As the training +of a generalist policy progresses across environments, its generalization +capabilities can be used to replace human effort with model generated +demonstrations. This results in a pipeline where behavioral data is collected +in simulation with continually reducing human effort. We show that CASHER +demonstrates zero-shot and few-shot scaling laws on three real-world tasks +across diverse scenarios. We show that CASHER enables fine-tuning of +pre-trained policies to a target scenario using a video scan without any +additional human effort. See our project website: +https://casher-robot-learning.github.io/CASHER/ + +
+
+
+
+
+ + ♻ ☆ Graph Neural Networks for Job Shop Scheduling Problems: A Survey + + +
+ Job shop scheduling problems (JSSPs) represent a critical and challenging +class of combinatorial optimization problems. Recent years have witnessed a +rapid increase in the application of graph neural networks (GNNs) to solve +JSSPs, albeit lacking a systematic survey of the relevant literature. This +paper aims to thoroughly review prevailing GNN methods for different types of +JSSPs and the closely related flow-shop scheduling problems (FSPs), especially +those leveraging deep reinforcement learning (DRL). We begin by presenting the +graph representations of various JSSPs, followed by an introduction to the most +commonly used GNN architectures. We then review current GNN-based methods for +each problem type, highlighting key technical elements such as graph +representations, GNN architectures, GNN tasks, and training algorithms. +Finally, we summarize and analyze the advantages and limitations of GNNs in +solving JSSPs and provide potential future research opportunities. We hope this +survey can motivate and inspire innovative approaches for more powerful +GNN-based approaches in tackling JSSPs and other scheduling problems. + +
+
+ comment: Accepted by Computers & Operations Research +
+
+
+
+
+ + ♻ ☆ FAMES: Fast Approximate Multiplier Substitution for Mixed-Precision + Quantized DNNs--Down to 2 Bits! + + +
+ A widely-used technique in designing energy-efficient deep neural network +(DNN) accelerators is quantization. Recent progress in this direction has +reduced the bitwidths used in DNN down to 2. Meanwhile, many prior works apply +approximate multipliers (AppMuls) in designing DNN accelerators to lower their +energy consumption. Unfortunately, these works still assume a bitwidth much +larger than 2, which falls far behind the state-of-the-art in quantization area +and even challenges the meaningfulness of applying AppMuls in DNN accelerators, +since a high-bitwidth AppMul consumes much more energy than a low-bitwidth +exact multiplier! Thus, an important problem to study is: Can approximate +multipliers be effectively applied to quantized DNN models with very low +bitwidths? In this work, we give an affirmative answer to this question and +present a systematic solution that achieves the answer: FAMES, a fast +approximate multiplier substitution method for mixed-precision DNNs. Our +experiments demonstrate an average 28.67% energy reduction on state-of-the-art +mixed-precision quantized models with bitwidths as low as 2 bits and accuracy +losses kept under 1%. Additionally, our approach is up to 300x faster than +previous genetic algorithm-based methods. + +
+
+
+
+
+ + ♻ ☆ URVFL: Undetectable Data Reconstruction Attack on Vertical Federated + Learning NDSS 2025 + + +
+ Launching effective malicious attacks in VFL presents unique challenges: 1) +Firstly, given the distributed nature of clients' data features and models, +each client rigorously guards its privacy and prohibits direct querying, +complicating any attempts to steal data; 2) Existing malicious attacks alter +the underlying VFL training task, and are hence easily detected by comparing +the received gradients with the ones received in honest training. To overcome +these challenges, we develop URVFL, a novel attack strategy that evades current +detection mechanisms. The key idea is to integrate a discriminator with +auxiliary classifier that takes a full advantage of the label information and +generates malicious gradients to the victim clients: on one hand, label +information helps to better characterize embeddings of samples from distinct +classes, yielding an improved reconstruction performance; on the other hand, +computing malicious gradients with label information better mimics the honest +training, making the malicious gradients indistinguishable from the honest +ones, and the attack much more stealthy. Our comprehensive experiments +demonstrate that URVFL significantly outperforms existing attacks, and +successfully circumvents SOTA detection methods for malicious attacks. +Additional ablation studies and evaluations on defenses further underscore the +robustness and effectiveness of URVFL. Our code will be available at +https://github.com/duanyiyao/URVFL. + +
+
+ comment: Accepted by NDSS 2025 +
+
+
+
+
+ + ♻ ☆ Does Deep Active Learning Work in the Wild? + + +
+ Deep active learning (DAL) methods have shown significant improvements in +sample efficiency compared to simple random sampling. While these studies are +valuable, they nearly always assume that optimal DAL hyperparameter (HP) +settings are known in advance, or optimize the HPs through repeating DAL +several times with different HP settings. Here, we argue that in real-world +settings, or in the wild, there is significant uncertainty regarding good HPs, +and their optimization contradicts the premise of using DAL (i.e., we require +labeling efficiency). In this study, we evaluate the performance of eleven +modern DAL methods on eight benchmark problems as we vary a key HP shared by +all methods: the pool ratio. Despite adjusting only one HP, our results +indicate that eight of the eleven DAL methods sometimes underperform relative +to simple random sampling and some frequently perform worse. Only three methods +always outperform random sampling (albeit narrowly), and we find that these +methods all utilize diversity to select samples - a relatively simple +criterion. Our findings reveal the limitations of existing DAL methods when +deployed in the wild, and present this as an important new open problem in the +field. + +
+
+
+
+
+ + ♻ ☆ A Water Efficiency Dataset for African Data Centers NeurIPS 2024 + + +
+ AI computing and data centers consume a large amount of freshwater, both +directly for cooling and indirectly for electricity generation. While most +attention has been paid to developed countries such as the U.S., this paper +presents the first-of-its-kind dataset that combines nation-level weather and +electricity generation data to estimate water usage efficiency for data centers +in 41 African countries across five different climate regions. We also use our +dataset to evaluate and estimate the water consumption of inference on two +large language models (i.e., Llama-3-70B and GPT-4) in 11 selected African +countries. Our findings show that writing a 10-page report using Llama-3-70B +could consume about \textbf{0.7 liters} of water, while the water consumption +by GPT-4 for the same task may go up to about 60 liters. For writing a +medium-length email of 120-200 words, Llama-3-70B and GPT-4 could consume about +\textbf{0.13 liters} and 3 liters of water, respectively. Interestingly, given +the same AI model, 8 out of the 11 selected African countries consume less +water than the global average, mainly because of lower water intensities for +electricity generation. However, water consumption can be substantially higher +in some African countries with a steppe climate than the U.S. and global +averages, prompting more attention when deploying AI computing in these +countries. Our dataset is publicly available on +\href{https://huggingface.co/datasets/masterlion/WaterEfficientDatasetForAfricanCountries/tree/main}{Hugging +Face}. + +
+
+ comment: Accepted by NeurIPS 2024 Workshop on Tackling Climate Change with + Machine Learning +
+
+
+
+
+ + ♻ ☆ Comprehensive framework for evaluation of deep neural networks in + detection and quantification of lymphoma from PET/CT images: clinical + insights, pitfalls, and observer agreement analyses + + +
+ This study addresses critical gaps in automated lymphoma segmentation from +PET/CT images, focusing on issues often overlooked in existing literature. +While deep learning has been applied for lymphoma lesion segmentation, few +studies incorporate out-of-distribution testing, raising concerns about model +generalizability across diverse imaging conditions and patient populations. We +highlight the need to compare model performance with expert human annotators, +including intra- and inter-observer variability, to understand task difficulty +better. Most approaches focus on overall segmentation accuracy but overlook +lesion-specific measures important for precise lesion detection and disease +quantification. To address these gaps, we propose a clinically relevant +framework for evaluating deep segmentation networks. Using this lesion +measure-specific evaluation, we assess the performance of four deep networks +(ResUNet, SegResNet, DynUNet, and SwinUNETR) across 611 cases from +multi-institutional datasets, covering various lymphoma subtypes and lesion +characteristics. Beyond standard metrics like the Dice similarity coefficient, +we evaluate clinical lesion measures and their prediction errors. We also +introduce detection criteria for lesion localization and propose a new +detection Criterion 3 based on metabolic characteristics. We show that networks +perform better on large, intense lesions with higher metabolic activity. +Finally, we compare network performance to physicians via intra- and +inter-observer variability analyses, demonstrating that network errors closely +resemble those made by experts, i.e., the small and faint lesions remain +challenging for both humans and networks. This study aims to improve automated +lesion segmentation's clinical relevance, supporting better treatment decisions +for lymphoma patients. The code is available at: +https://github.com/microsoft/lymphoma-segmentation-dnn. + +
+
+ comment: 32 pages, 15 figures, 5 tables +
+
+
+
+
+ + ♻ ☆ Local Curvature Smoothing with Stein's Identity for Efficient Score + Matching NeurIPS 2024 + + +
+ The training of score-based diffusion models (SDMs) is based on score +matching. The challenge of score matching is that it includes a computationally +expensive Jacobian trace. While several methods have been proposed to avoid +this computation, each has drawbacks, such as instability during training and +approximating the learning as learning a denoising vector field rather than a +true score. We propose a novel score matching variant, local curvature +smoothing with Stein's identity (LCSS). The LCSS bypasses the Jacobian trace by +applying Stein's identity, enabling regularization effectiveness and efficient +computation. We show that LCSS surpasses existing methods in sample generation +performance and matches the performance of denoising score matching, widely +adopted by most SDMs, in evaluations such as FID, Inception score, and bits per +dimension. Furthermore, we show that LCSS enables realistic image generation +even at a high resolution of $1024 \times 1024$. + +
+
+ comment: Accepted at NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Investigating Self-Supervised Image Denoising with Denaturation + + +
+ Self-supervised learning for image denoising problems in the presence of +denaturation for noisy data is a crucial approach in machine learning. However, +theoretical understanding of the performance of the approach that uses +denatured data is lacking. To provide better understanding of the approach, in +this paper, we analyze a self-supervised denoising algorithm that uses +denatured data in depth through theoretical analysis and numerical experiments. +Through the theoretical analysis, we discuss that the algorithm finds desired +solutions to the optimization problem with the population risk, while the +guarantee for the empirical risk depends on the hardness of the denoising task +in terms of denaturation levels. We also conduct several experiments to +investigate the performance of an extended algorithm in practice. The results +indicate that the algorithm training with denatured images works, and the +empirical performance aligns with the theoretical results. These results +suggest several insights for further improvement of self-supervised image +denoising that uses denatured data in future directions. + +
+
+ comment: The PDF v3 has a wrong license, while v4 has a correct license +
+
+
+
+
+ + ♻ ☆ Fast Sampling via Discrete Non-Markov Diffusion Models with + Predetermined Transition Time NeurIPS 2024 + + +
+ Discrete diffusion models have emerged as powerful tools for high-quality +data generation. Despite their success in discrete spaces, such as text +generation tasks, the acceleration of discrete diffusion models remains +under-explored. In this paper, we propose discrete non-Markov diffusion models +(DNDM), which naturally induce the predetermined transition time set. This +enables a training-free sampling algorithm that significantly reduces the +number of function evaluations (i.e., calls to the neural network), making the +sampling process much faster. Furthermore, we study the transition from finite +to infinite step sampling, offering new insights into bridging the gap between +discrete and continuous-time processes for discrete diffusion models. Extensive +experiments on natural language generation and machine translation tasks +demonstrate the superior performance of our method in terms of both generation +speed and sample quality compared to existing methods for discrete diffusion +models. + +
+
+ comment: 36 pages, 5 figures, 13 tables. In NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Matching the Statistical Query Lower Bound for $k$-Sparse Parity + Problems with Sign Stochastic Gradient Descent NeurIPS 2024 + + +
+ The $k$-sparse parity problem is a classical problem in computational +complexity and algorithmic theory, serving as a key benchmark for understanding +computational classes. In this paper, we solve the $k$-sparse parity problem +with sign stochastic gradient descent, a variant of stochastic gradient descent +(SGD) on two-layer fully-connected neural networks. We demonstrate that this +approach can efficiently solve the $k$-sparse parity problem on a +$d$-dimensional hypercube ($k\leq O(\sqrt{d})$) with a sample complexity of +$\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, matching the established +$\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical +analysis begins by constructing a good neural network capable of correctly +solving the $k$-parity problem. We then demonstrate how a trained neural +network with sign SGD can effectively approximate this good network, solving +the $k$-parity problem with small statistical errors. To the best of our +knowledge, this is the first result that matches the SQ lower bound for solving +$k$-sparse parity problem using gradient-based methods. + +
+
+ comment: 37 pages, 7 figures, 3 tables. In NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ SimMLP: Training MLPs on Graphs without Supervision + + +
+ Graph Neural Networks (GNNs) have demonstrated their effectiveness in various +graph learning tasks, yet their reliance on neighborhood aggregation during +inference poses challenges for deployment in latency-sensitive applications, +such as real-time financial fraud detection. To address this limitation, recent +studies have proposed distilling knowledge from teacher GNNs into student +Multi-Layer Perceptrons (MLPs) trained on node content, aiming to accelerate +inference. However, these approaches often inadequately explore structural +information when inferring unseen nodes. To this end, we introduce SimMLP, a +Self-supervised framework for learning MLPs on graphs, designed to fully +integrate rich structural information into MLPs. Notably, SimMLP is the first +MLP-learning method that can achieve equivalence to GNNs in the optimal case. +The key idea is to employ self-supervised learning to align the representations +encoded by graph context-aware GNNs and neighborhood dependency-free MLPs, +thereby fully integrating the structural information into MLPs. We provide a +comprehensive theoretical analysis, demonstrating the equivalence between +SimMLP and GNNs based on mutual information and inductive bias, highlighting +SimMLP's advanced structural learning capabilities. Additionally, we conduct +extensive experiments on 20 benchmark datasets, covering node classification, +link prediction, and graph classification, to showcase SimMLP's superiority +over state-of-the-art baselines, particularly in scenarios involving unseen +nodes (e.g., inductive and cold-start node classification) where structural +insights are crucial. Our codes are available at: +https://github.com/Zehong-Wang/SimMLP. + +
+
+ comment: New Version: arXiv:2412.03864 +
+
+
+
+
+ + ♻ ☆ Scaling Inference-Time Search with Vision Value Model for Improved + Visual Comprehension + + +
+ Despite significant advancements in vision-language models (VLMs), there +lacks effective approaches to enhance response quality by scaling +inference-time computation. This capability is known to be a core step towards +the self-improving models in recent large language model studies. In this +paper, we present Vision Value Model (VisVM) that can guide VLM inference-time +search to generate responses with better visual comprehension. Specifically, +VisVM not only evaluates the generated sentence quality in the current search +step, but also anticipates the quality of subsequent sentences that may result +from the current step, thus providing a long-term value. In this way, VisVM +steers VLMs away from generating sentences prone to hallucinations or +insufficient detail, thereby producing higher quality responses. Experimental +results demonstrate that VisVM-guided search significantly enhances VLMs' +ability to generate descriptive captions with richer visual details and fewer +hallucinations, compared with greedy decoding and search methods with other +visual reward signals. Furthermore, we find that self-training the model with +the VisVM-guided captions improve VLM's performance across a wide range of +multimodal benchmarks, indicating the potential for developing self-improving +VLMs. Our value model and code are available at +https://github.com/si0wang/VisVM. + +
+
+
+
+
+
+
+
+ + Multimedia 5 + +
+
+
+ + ☆ LinVT: Empower Your Image-level Large Language Model to Understand + Videos + + +
+ Large Language Models (LLMs) have been widely used in various tasks, +motivating us to develop an LLM-based assistant for videos. Instead of training +from scratch, we propose a module to transform arbitrary well-trained +image-based LLMs into video-LLMs (after being trained on video data). To better +adapt image-LLMs for processing videos, we introduce two design principles: +linear transformation to preserve the original visual-language alignment and +representative information condensation from redundant video content. Guided by +these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), +which enables existing image-LLMs to understand videos. We benchmark LinVT with +six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, +showcasing the high compatibility of LinVT. LinVT-based LLMs achieve +state-of-the-art performance across various video benchmarks, illustrating the +effectiveness of LinVT in multi-modal video understanding. + +
+
+
+
+
+ + ☆ SMIC: Semantic Multi-Item Compression based on CLIP dictionary + + +
+ Semantic compression, a compression scheme where the distortion metric, +typically MSE, is replaced with semantic fidelity metrics, tends to become more +and more popular. Most recent semantic compression schemes rely on the +foundation model CLIP. In this work, we extend such a scheme to image +collection compression, where inter-item redundancy is taken into account +during the coding phase. For that purpose, we first show that CLIP's latent +space allows for easy semantic additions and subtractions. From this property, +we define a dictionary-based multi-item codec that outperforms state-of-the-art +generative codec in terms of compression rate, around $10^{-5}$ BPP per image, +while not sacrificing semantic fidelity. We also show that the learned +dictionary is of a semantic nature and works as a semantic projector for the +semantic content of images. + +
+
+ comment: 12 pages, 14 figures, 3 tables, journal paper, preprint +
+
+
+
+
+ + ☆ Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval + with Semantic Guidance NeurIPS 2024 + + +
+ Modern music retrieval systems often rely on fixed representations of user +preferences, limiting their ability to capture users' diverse and uncertain +retrieval needs. To address this limitation, we introduce Diff4Steer, a novel +generative retrieval framework that employs lightweight diffusion models to +synthesize diverse seed embeddings from user queries that represent potential +directions for music exploration. Unlike deterministic methods that map user +query to a single point in embedding space, Diff4Steer provides a statistical +prior on the target modality (audio) for retrieval, effectively capturing the +uncertainty and multi-faceted nature of user preferences. Furthermore, +Diff4Steer can be steered by image or text inputs, enabling more flexible and +controllable music discovery combined with nearest neighbor search. Our +framework outperforms deterministic regression methods and LLM-based generative +retrieval baseline in terms of retrieval and ranking metrics, demonstrating its +effectiveness in capturing user preferences, leading to more diverse and +relevant recommendations. Listening examples are available at +tinyurl.com/diff4steer. + +
+
+ comment: NeurIPS 2024 Creative AI Track +
+
+
+
+
+ + ♻ ☆ LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware + Omni-Modal Perception of Long Videos + + +
+ Despite impressive advancements in video understanding, most efforts remain +limited to coarse-grained or visual-only video tasks. However, real-world +videos encompass omni-modal information (vision, audio, and speech) with a +series of events forming a cohesive storyline. The lack of multi-modal video +data with fine-grained event annotations and the high cost of manual labeling +are major obstacles to comprehensive omni-modality video perception. To address +this gap, we propose an automatic pipeline consisting of high-quality +multi-modal video filtering, semantically coherent omni-modal event boundary +detection, and cross-modal correlation-aware event captioning. In this way, we +present LongVALE, the first-ever Vision-Audio-Language Event understanding +benchmark comprising 105K omni-modal events with precise temporal boundaries +and detailed relation-aware captions within 8.4K high-quality long videos. +Further, we build a baseline that leverages LongVALE to enable video large +language models (LLMs) for omni-modality fine-grained temporal video +understanding for the first time. Extensive experiments demonstrate the +effectiveness and great potential of LongVALE in advancing comprehensive +multi-modal video understanding. + +
+
+ comment: 18 pages, 15 figures +
+
+
+
+
+ + ♻ ☆ TopoCode: Topologically Informed Error Detection and Correction in + Communication Systems + + +
+ Traditional error detection and correction codes focus on bit-level fidelity, +which is insufficient for emerging technologies like eXtended Reality (XR) and +holographic communications requiring high-data-rate, low-latency systems. +Bit-level metrics cannot comprehensively evaluate Quality-of-Service (QoS) in +these scenarios. This letter proposes TopoCode which leverages Topological Data +Analysis (TDA) and persistent homology to encode topological information for +message-level error detection and correction. It introduces minimal redundancy +while enabling effective data reconstruction, especially in low Signal-to-Noise +Ratio (SNR) conditions. TopoCode offers a promising approach to meet the +demands of next-generation communication systems prioritizing semantic accuracy +and message-level integrity. + +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Computation and Language 78 + +
+
+
+ + ☆ SWEPO: Simultaneous Weighted Preference Optimization for Group + Contrastive Alignment + + +
+ We introduce Simultaneous Weighted Preference Optimization (SWEPO), a novel +extension of Direct Preference Optimization (DPO) designed to accommodate +multiple dynamically chosen positive and negative responses for each query. +SWEPO employs a weighted group contrastive loss, assigning weights to responses +based on their deviation from the mean reward score. This approach effectively +prioritizes responses that are significantly better or worse than the average, +enhancing optimization. Our theoretical analysis demonstrates that +simultaneously considering multiple preferences reduces alignment bias, +resulting in more robust alignment. Additionally, we provide insights into the +training dynamics of our loss function and a related function, InfoNCA. +Empirical validation on the UltraFeedback dataset establishes SWEPO as +state-of-the-art, with superior performance in downstream evaluations using the +AlpacaEval dataset. + +
+
+
+
+
+ + BigDocs: An Open and Permissively-Licensed Dataset for Training + Multimodal Models on Document and Code Tasks + + +
+ Multimodal AI has the potential to significantly enhance +document-understanding tasks, such as processing receipts, understanding +workflows, extracting data from documents, and summarizing reports. Code +generation tasks that require long-structured outputs can also be enhanced by +multimodality. Despite this, their use in commercial applications is often +limited due to limited access to training data and restrictive licensing, which +hinders open access. To address these limitations, we introduce BigDocs-7.5M, a +high-quality, open-access dataset comprising 7.5 million multimodal documents +across 30 tasks. We use an efficient data curation process to ensure our data +is high-quality and license-permissive. Our process emphasizes accountability, +responsibility, and transparency through filtering rules, traceable metadata, +and careful content analysis. Additionally, we introduce BigDocs-Bench, a +benchmark suite with 10 novel tasks where we create datasets that reflect +real-world use cases involving reasoning over Graphical User Interfaces (GUI) +and code generation from images. Our experiments show that training with +BigDocs-Bench improves average performance up to 25.8% over closed-source +GPT-4o in document reasoning and structured output tasks such as +Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a +preference for outputs from models trained on BigDocs over GPT-4o. This +suggests that BigDocs can help both academics and the open-source community +utilize and improve AI tools to enhance multimodal capabilities and document +reasoning. The project is hosted at https://bigdocs.github.io . + +
+
+ comment: The project is hosted at https://bigdocs.github.io +
+
+
+
+
+ + ☆ Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization + + +
+ Neural networks often favor shortcut heuristics based on surface-level +patterns. As one example, language models (LMs) behave like n-gram models early +in training. However, to correctly apply grammatical rules, LMs must rely on +hierarchical syntactic representations instead of n-grams. In this work, we use +cases studies of English grammar to explore how latent structure in training +data drives models toward improved out-of-distribution (OOD) generalization.We +then investigate how data composition can lead to inconsistent OOD behavior +across random seeds and to unstable training dynamics. Our results show that +models stabilize in their OOD behavior only when they fully commit to either a +surface-level linear rule or a hierarchical rule. The hierarchical rule, +furthermore, is induced by grammatically complex sequences with deep embedding +structures, whereas the linear rule is induced by simpler sequences. When the +data contains a mix of simple and complex examples, potential rules compete; +each independent training run either stabilizes by committing to a single rule +or remains unstable in its OOD behavior. These conditions lead `stable seeds' +to cluster around simple rules, forming bimodal performance distributions +across seeds. We also identify an exception to the relationship between +stability and generalization: models which memorize patterns from low-diversity +training data can overfit stably, with different rules for memorized and +unmemorized patterns. Our findings emphasize the critical role of training data +in shaping generalization patterns and how competition between data subsets +contributes to inconsistent generalization outcomes across random seeds. Code +is available at https://github.com/sunnytqin/concept_comp.git. + +
+
+
+
+
+ + ☆ Extractive Structures Learned in Pretraining Enable Generalization on + Finetuned Facts + + +
+ Pretrained language models (LMs) can generalize to implications of facts that +they are finetuned on. For example, if finetuned on ``John Doe lives in Tokyo," +LMs can correctly answer ``What language do the people in John Doe's city +speak?'' with ``Japanese''. However, little is known about the mechanisms that +enable this generalization or how they are learned during pretraining. We +introduce extractive structures as a framework for describing how components in +LMs (e.g., MLPs or attention heads) coordinate to enable this generalization. +The structures consist of informative components that store training facts as +weight changes, and upstream and downstream extractive components that query +and process the stored information to produce the correct implication. We +hypothesize that extractive structures are learned during pretraining when +encountering implications of previously known facts. This yields two +predictions: a data ordering effect where extractive structures can be learned +only if facts precede their implications, and a weight grafting effect where +extractive structures can be transferred to predict counterfactual +implications. We empirically demonstrate these phenomena in the OLMo-7b, Llama +3-8b, Gemma 2-9b, and Qwen 2-7b models. Of independent interest, our results +also indicate that fact learning can occur at both early and late layers, which +lead to different forms of generalization. + +
+
+
+
+
+ + ☆ Semantic Consistency-Based Uncertainty Quantification for Factuality in + Radiology Report Generation + + +
+ Radiology report generation (RRG) has shown great potential in assisting +radiologists by automating the labor-intensive task of report writing. While +recent advancements have improved the quality and coherence of generated +reports, ensuring their factual correctness remains a critical challenge. +Although generative medical Vision Large Language Models (VLLMs) have been +proposed to address this issue, these models are prone to hallucinations and +can produce inaccurate diagnostic information. To address these concerns, we +introduce a novel Semantic Consistency-Based Uncertainty Quantification +framework that provides both report-level and sentence-level uncertainties. +Unlike existing approaches, our method does not require modifications to the +underlying model or access to its inner state, such as output token logits, +thus serving as a plug-and-play module that can be seamlessly integrated with +state-of-the-art models. Extensive experiments demonstrate the efficacy of our +method in detecting hallucinations and enhancing the factual accuracy of +automatically generated radiology reports. By abstaining from high-uncertainty +reports, our approach improves factuality scores by $10$%, achieved by +rejecting $20$% of reports using the Radialog model on the MIMIC-CXR dataset. +Furthermore, sentence-level uncertainty flags the lowest-precision sentence in +each report with an $82.9$% success rate. + +
+
+
+
+
+ + ☆ Formulation of probability theory problem with subtle condition + + +
+ Problems in probability theory prove to be one of the most challenging for +students. Here, we formulate and discuss four related problems in probability +theory that proved difficult for first to fourth-year undergraduate students +whose first language was not English. These examples emphasize how crucial it +is to understand the conditions and requirements of the problems precisely +before starting to solve them. We discuss the solutions to those problems in +detail, complement them with numerical estimations, and link the conditions in +the problems to the logical statements in Python programming language. We also +tested two widely used chatbots (GPT-4o and Claude 3.5 Sonnet) by checking +their responses to these problems. + +
+
+ comment: 7 pages +
+
+
+
+
+ + ☆ Show, Don't Tell: Uncovering Implicit Character Portrayal using LLMs + + +
+ Tools for analyzing character portrayal in fiction are valuable for writers +and literary scholars in developing and interpreting compelling stories. +Existing tools, such as visualization tools for analyzing fictional characters, +primarily rely on explicit textual indicators of character attributes. However, +portrayal is often implicit, revealed through actions and behaviors rather than +explicit statements. We address this gap by leveraging large language models +(LLMs) to uncover implicit character portrayals. We start by generating a +dataset for this task with greater cross-topic similarity, lexical diversity, +and narrative lengths than existing narrative text corpora such as TinyStories +and WritingPrompts. We then introduce LIIPA (LLMs for Inferring Implicit +Portrayal for Character Analysis), a framework for prompting LLMs to uncover +character portrayals. LIIPA can be configured to use various types of +intermediate computation (character attribute word lists, chain-of-thought) to +infer how fictional characters are portrayed in the source text. We find that +LIIPA outperforms existing approaches, and is more robust to increasing +character counts (number of unique persons depicted) due to its ability to +utilize full narrative context. Lastly, we investigate the sensitivity of +portrayal estimates to character demographics, identifying a fairness-accuracy +tradeoff among methods in our LIIPA framework -- a phenomenon familiar within +the algorithmic fairness literature. Despite this tradeoff, all LIIPA variants +consistently outperform non-LLM baselines in both fairness and accuracy. Our +work demonstrates the potential benefits of using LLMs to analyze complex +characters and to better understand how implicit portrayal biases may manifest +in narrative texts. + +
+
+
+
+
+ + ☆ Give me Some Hard Questions: Synthetic Data Generation for Clinical QA ML4H 2024 + + +
+ Clinical Question Answering (QA) systems enable doctors to quickly access +patient information from electronic health records (EHRs). However, training +these systems requires significant annotated data, which is limited due to the +expertise needed and the privacy concerns associated with clinical data. This +paper explores generating Clinical QA data using large language models (LLMs) +in a zero-shot setting. We find that naive prompting often results in easy +questions that do not reflect the complexity of clinical scenarios. To address +this, we propose two prompting strategies: 1) instructing the model to generate +questions that do not overlap with the input context, and 2) summarizing the +input record using a predefined schema to scaffold question generation. +Experiments on two Clinical QA datasets demonstrate that our method generates +more challenging questions, significantly improving fine-tuning performance +over baselines. We compare synthetic and gold data and find a gap between their +training efficacy resulting from the quality of synthetically generated +answers. + +
+
+ comment: Accepted to ML4H 2024 Findings +
+
+
+
+
+ + ☆ VisionZip: Longer is Better but Not Necessary in Vision Language Models + + +
+ Recent advancements in vision-language models have enhanced performance by +increasing the length of visual tokens, making them much longer than text +tokens and significantly raising computational costs. However, we observe that +the visual tokens generated by popular vision encoders, such as CLIP and +SigLIP, contain significant redundancy. To address this, we introduce +VisionZip, a simple yet effective method that selects a set of informative +tokens for input to the language model, reducing visual token redundancy and +improving efficiency while maintaining model performance. The proposed +VisionZip can be widely applied to image and video understanding tasks and is +well-suited for multi-turn dialogues in real-world scenarios, where previous +methods tend to underperform. Experimental results show that VisionZip +outperforms the previous state-of-the-art method by at least 5% performance +gains across nearly all settings. Moreover, our method significantly enhances +model inference speed, improving the prefilling time by 8x and enabling the +LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while +achieving better results. Furthermore, we analyze the causes of this redundancy +and encourage the community to focus on extracting better visual features +rather than merely increasing token length. Our code is available at +https://github.com/dvlab-research/VisionZip . + +
+
+ comment: 2 columns, 28 pages, 15 figures, 18 tables +
+
+
+
+
+ + ☆ Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction + + +
+ Graphical User Interfaces (GUIs) are critical to human-computer interaction, +yet automating GUI tasks remains challenging due to the complexity and +variability of visual environments. Existing approaches often rely on textual +representations of GUIs, which introduce limitations in generalization, +efficiency, and scalability. In this paper, we introduce Aguvis, a unified pure +vision-based framework for autonomous GUI agents that operates across various +platforms. Our approach leverages image-based observations, and grounding +instructions in natural language to visual elements, and employs a consistent +action space to ensure cross-platform generalization. To address the +limitations of previous work, we integrate explicit planning and reasoning +within the model, enhancing its ability to autonomously navigate and interact +with complex digital environments. We construct a large-scale dataset of GUI +agent trajectories, incorporating multimodal reasoning and grounding, and +employ a two-stage training pipeline that first focuses on general GUI +grounding, followed by planning and reasoning. Through comprehensive +experiments, we demonstrate that Aguvis surpasses previous state-of-the-art +methods in both offline and real-world online scenarios, achieving, to our +knowledge, the first fully autonomous pure vision GUI agent capable of +performing tasks independently without collaboration with external +closed-source models. We open-sourced all datasets, models, and training +recipes to facilitate future research at https://aguvis-project.github.io/. + +
+
+ comment: https://aguvis-project.github.io/ +
+
+
+
+
+ + ☆ p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay + + +
+ Despite the remarkable performance of multimodal large language models +(MLLMs) across diverse tasks, the substantial training and inference costs +impede their advancement. The majority of computation stems from the +overwhelming volume of vision tokens processed by the transformer decoder. In +this paper, we propose to build efficient MLLMs by leveraging the +Mixture-of-Depths (MoD) mechanism, where each transformer decoder layer selects +essential vision tokens to process while skipping redundant ones. However, +integrating MoD into MLLMs is non-trivial. To address the challenges of +training and inference stability as well as limited training data, we adapt the +MoD module with two novel designs: tanh-gated weight normalization (TanhNorm) +and symmetric token reweighting (STRing). Moreover, we observe that vision +tokens exhibit higher redundancy in deeper layer and thus design a progressive +ratio decay (PRD) strategy, which gradually reduces the token retention ratio +layer by layer, employing a shifted cosine schedule. This crucial design fully +unleashes the potential of MoD, significantly boosting the efficiency and +performance of our models. To validate the effectiveness of our approach, we +conduct extensive experiments with two baseline models across 14 benchmarks. +Our model, p-MoD, matches or even surpasses the performance of the baseline +models, with only 55.6% TFLOPs and 53.8% KV cache storage during inference, and +77.7% GPU hours during training. + +
+
+ comment: Technical Report; Code released at https://github.com/MCG-NJU/p-MoD +
+
+
+
+
+ + ☆ Moto: Latent Motion Token as the Bridging Language for Robot + Manipulation + + +
+ Recent developments in Large Language Models pre-trained on extensive corpora +have shown significant success in various natural language processing tasks +with minimal fine-tuning. This success offers new promise for robotics, which +has long been constrained by the high cost of action-labeled data. We ask: +given the abundant video data containing interaction-related knowledge +available as a rich "corpus", can a similar generative pre-training approach be +effectively applied to enhance robot learning? The key challenge is to identify +an effective representation for autoregressive pre-training that benefits robot +manipulation tasks. Inspired by the way humans learn new skills through +observing dynamic environments, we propose that effective robotic learning +should emphasize motion-related knowledge, which is closely tied to low-level +actions and is hardware-agnostic, facilitating the transfer of learned motions +to actual robot actions. To this end, we introduce Moto, which converts video +content into latent Motion Token sequences by a Latent Motion Tokenizer, +learning a bridging "language" of motion from videos in an unsupervised manner. +We pre-train Moto-GPT through motion token autoregression, enabling it to +capture diverse visual motion knowledge. After pre-training, Moto-GPT +demonstrates the promising ability to produce semantically interpretable motion +tokens, predict plausible motion trajectories, and assess trajectory +rationality through output likelihood. To transfer learned motion priors to +real robot actions, we implement a co-fine-tuning strategy that seamlessly +bridges latent motion token prediction and real robot control. Extensive +experiments show that the fine-tuned Moto-GPT exhibits superior robustness and +efficiency on robot manipulation benchmarks, underscoring its effectiveness in +transferring knowledge from video data to downstream visual manipulation tasks. + +
+
+ comment: Project released at: https://chenyi99.github.io/moto/ +
+
+
+
+
+ + ☆ CA-SSLR: Condition-Aware Self-Supervised Learning Representation for + Generalized Speech Processing NeurIPS + 2024 + + +
+ We introduce Condition-Aware Self-Supervised Learning Representation +(CA-SSLR), a generalist conditioning model broadly applicable to various +speech-processing tasks. Compared to standard fine-tuning methods that optimize +for downstream models, CA-SSLR integrates language and speaker embeddings from +earlier layers, making the SSL model aware of the current language and speaker +context. This approach reduces the reliance on input audio features while +preserving the integrity of the base SSLR. CA-SSLR improves the model's +capabilities and demonstrates its generality on unseen tasks with minimal +task-specific tuning. Our method employs linear modulation to dynamically +adjust internal representations, enabling fine-grained adaptability without +significantly altering the original model behavior. Experiments show that +CA-SSLR reduces the number of trainable parameters, mitigates overfitting, and +excels in under-resourced and unseen tasks. Specifically, CA-SSLR achieves a +10% relative reduction in LID errors, a 37% improvement in ASR CER on the +ML-SUPERB benchmark, and a 27% decrease in SV EER on VoxCeleb-1, demonstrating +its effectiveness. + +
+
+ comment: 38th Conference on Neural Information Processing Systems (NeurIPS + 2024) +
+
+
+
+
+ + ☆ Understanding Hidden Computations in Chain-of-Thought Reasoning + + +
+ Chain-of-Thought (CoT) prompting has significantly enhanced the reasoning +abilities of large language models. However, recent studies have shown that +models can still perform complex reasoning tasks even when the CoT is replaced +with filler(hidden) characters (e.g., "..."), leaving open questions about how +models internally process and represent reasoning steps. In this paper, we +investigate methods to decode these hidden characters in transformer models +trained with filler CoT sequences. By analyzing layer-wise representations +using the logit lens method and examining token rankings, we demonstrate that +the hidden characters can be recovered without loss of performance. Our +findings provide insights into the internal mechanisms of transformer models +and open avenues for improving interpretability and transparency in language +model reasoning. + +
+
+
+
+
+ + ☆ Establishing Task Scaling Laws via Compute-Efficient Model Ladders + + +
+ We develop task scaling laws and model ladders to predict the individual task +performance of pretrained language models (LMs) in the overtrained setting. +Standard power laws for language modeling loss cannot accurately model task +performance. Therefore, we leverage a two-step prediction approach: first use +model and data size to predict a task-specific loss, and then use this task +loss to predict task performance. We train a set of small-scale "ladder" +models, collect data points to fit the parameterized functions of the two +prediction steps, and make predictions for two target models: a 7B model +trained to 4T tokens and a 13B model trained to 5T tokens. Training the ladder +models only costs 1% of the compute used for the target models. On four +multiple-choice tasks written in ranked classification format, we can predict +the accuracy of both target models within 2 points of absolute error. We have +higher prediction error on four other tasks (average absolute error 6.9) and +find that these are often tasks with higher variance in task metrics. We also +find that using less compute to train fewer ladder models tends to deteriorate +predictions. Finally, we empirically show that our design choices and the +two-step approach lead to superior performance in establishing scaling laws. + +
+
+
+
+
+ + ☆ BhashaVerse : Translation Ecosystem for Indian Subcontinent Languages + + +
+ This paper focuses on developing translation models and related applications +for 36 Indian languages, including Assamese, Awadhi, Bengali, Bhojpuri, Braj, +Bodo, Dogri, English, Konkani, Gondi, Gujarati, Hindi, Hinglish, Ho, Kannada, +Kangri, Kashmiri (Arabic and Devanagari), Khasi, Mizo, Magahi, Maithili, +Malayalam, Marathi, Manipuri (Bengali and Meitei), Nepali, Oriya, Punjabi, +Sanskrit, Santali, Sinhala, Sindhi (Arabic and Devanagari), Tamil, Tulu, +Telugu, and Urdu. Achieving this requires parallel and other types of corpora +for all 36 * 36 language pairs, addressing challenges like script variations, +phonetic differences, and syntactic diversity. For instance, languages like +Kashmiri and Sindhi, which use multiple scripts, demand script normalization +for alignment, while low-resource languages such as Khasi and Santali require +synthetic data augmentation to ensure sufficient coverage and quality. + To address these challenges, this work proposes strategies for corpus +creation by leveraging existing resources, developing parallel datasets, +generating domain-specific corpora, and utilizing synthetic data techniques. +Additionally, it evaluates machine translation across various dimensions, +including standard and discourse-level translation, domain-specific +translation, reference-based and reference-free evaluation, error analysis, and +automatic post-editing. By integrating these elements, the study establishes a +comprehensive framework to improve machine translation quality and enable +better cross-lingual communication in India's linguistically diverse ecosystem. + +
+
+
+
+
+ + ☆ Retrieval-Augmented Machine Translation with Unstructured Knowledge + + +
+ Retrieval-augmented generation (RAG) introduces additional information to +enhance large language models (LLMs). In machine translation (MT), previous +work typically retrieves in-context examples from paired MT corpora, or +domain-specific knowledge from knowledge graphs, to enhance models' MT ability. +However, a large amount of world knowledge is organized in unstructured +documents, and might not be fully paired across different languages. In this +paper, we study retrieval-augmented MT using unstructured documents. +Specifically, we build RAGtrans, the first benchmark to train and evaluate +LLMs' retrieval-augmented MT ability. RAGtrans contains 79K MT samples +collected via GPT-4o and human translators. Besides, documents from different +languages are also provided to supply the knowledge to these samples. Based on +RAGtrans, we further propose a multi-task training method to teach LLMs how to +use information from multilingual documents during their translation. The +method uses existing multilingual corpora to create auxiliary training +objectives without additional labeling requirements. Extensive experiments show +that the method improves LLMs by 1.58-3.09 BLEU and 1.00-2.03 COMET scores. + +
+
+
+
+
+ + ☆ The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for + Open-Ended Text Generation ICLR + + +
+ This paper introduces the counter-intuitive generalization results of +overfitting pre-trained large language models (LLMs) on very small datasets. In +the setting of open-ended text generation, it is well-documented that LLMs tend +to generate repetitive and dull sequences, a phenomenon that is especially +apparent when generating using greedy decoding. This issue persists even with +state-of-the-art LLMs containing billions of parameters, trained via next-token +prediction on large datasets. We find that by further fine-tuning these models +to achieve a near-zero training loss on a small set of samples -- a process we +refer to as hyperfitting -- the long-sequence generative capabilities are +greatly enhanced. Greedy decoding with these Hyperfitted models even outperform +Top-P sampling over long-sequences, both in terms of diversity and human +preferences. This phenomenon extends to LLMs of various sizes, different +domains, and even autoregressive image generation. We further find this +phenomena to be distinctly different from that of Grokking and double descent. +Surprisingly, our experiments indicate that hyperfitted models rarely fall into +repeating sequences they were trained on, and even explicitly blocking these +sequences results in high-quality output. All hyperfitted models produce +extremely low-entropy predictions, often allocating nearly all probability to a +single token. + +
+
+ comment: Under review at ICLR +
+
+
+
+
+ + ☆ ALMA: Alignment with Minimal Annotation + + +
+ Recent approaches to large language model (LLM) alignment typically require +millions of human annotations or rely on external aligned models for synthetic +data generation. This paper introduces ALMA: Alignment with Minimal Annotation, +demonstrating that effective alignment can be achieved using only 9,000 labeled +examples -- less than 1% of conventional approaches. ALMA generates large +amounts of high-quality synthetic alignment data through new techniques: +diverse prompt synthesis via few-shot learning, diverse response generation +with multiple model checkpoints, and judge (reward model) enhancement through +score aggregation and self-distillation. Using only a pretrained Llama3 base +model, 5,000 SFT examples, and 4,000 judge annotations, ALMA achieves +performance close to Llama3-Instruct across diverse alignment benchmarks (e.g., +0.1% difference on AlpacaEval 2.0 score). These results are achieved with a +multi-round, self-bootstrapped data synthesis and training recipe that +continues to improve for 10 rounds, surpassing the typical 3-round ceiling of +previous methods. These results suggest that base models already possess +sufficient knowledge for effective alignment, and that synthetic data +generation methods can expose it. + +
+
+
+
+
+ + ☆ Evolutionary Pre-Prompt Optimization for Mathematical Reasoning + + +
+ Recent advancements have highlighted that large language models (LLMs), when +given a small set of task-specific examples, demonstrate remarkable +proficiency, a capability that extends to complex reasoning tasks. In +particular, the combination of few-shot learning with the chain-of-thought +(CoT) approach has been pivotal in steering models towards more logically +consistent conclusions. This paper explores the optimization of example +selection for designing effective CoT pre-prompts and shows that the choice of +the optimization algorithm, typically in favor of comparison-based methods such +as evolutionary computation, significantly enhances efficacy and feasibility. +Specifically, thanks to a limited exploitative and overfitted optimization, +Evolutionary Pre-Prompt Optimization (EPPO) brings an improvement over the +naive few-shot approach exceeding 10 absolute points in exact match scores on +benchmark datasets such as GSM8k and MathQA. These gains are consistent across +various contexts and are further amplified when integrated with +self-consistency (SC) + +
+
+
+
+
+ + ☆ Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic + + +
+ Large Language Models (LLMs) have shown impressive results in multiple +domains of natural language processing (NLP) but are mainly focused on the +English language. Recently, more LLMs have incorporated a larger proportion of +multilingual text to represent low-resource languages. In Arabic NLP, several +Arabic-centric LLMs have shown remarkable results on multiple benchmarks in the +past two years. However, most Arabic LLMs have more than 7 billion parameters, +which increases their hardware requirements and inference latency, when +compared to smaller LLMs. This paper introduces Arabic Stable LM 1.6B in a base +and chat version as a small but powerful Arabic-centric LLM. Our Arabic Stable +LM 1.6B chat model achieves impressive results on several benchmarks beating +multiple models with up to 8x the parameters. In addition, we show the benefit +of mixing in synthetic instruction tuning data by augmenting our fine-tuning +data with a large synthetic dialogue dataset. + +
+
+
+
+
+ + ☆ Representation Purification for End-to-End Speech Translation COLING 2025 + + +
+ Speech-to-text translation (ST) is a cross-modal task that involves +converting spoken language into text in a different language. Previous research +primarily focused on enhancing speech translation by facilitating knowledge +transfer from machine translation, exploring various methods to bridge the gap +between speech and text modalities. Despite substantial progress made, factors +in speech that are not relevant to translation content, such as timbre and +rhythm, often limit the efficiency of knowledge transfer. In this paper, we +conceptualize speech representation as a combination of content-agnostic and +content-relevant factors. We examine the impact of content-agnostic factors on +translation performance through preliminary experiments and observe a +significant performance deterioration when content-agnostic perturbations are +introduced to speech signals. To address this issue, we propose a +\textbf{S}peech \textbf{R}epresentation \textbf{P}urification with +\textbf{S}upervision \textbf{E}nhancement (SRPSE) framework, which excludes the +content-agnostic components within speech representations to mitigate their +negative impact on ST. Experiments on MuST-C and CoVoST-2 datasets demonstrate +that SRPSE significantly improves translation performance across all +translation directions in three settings and achieves preeminent performance +under a \textit{transcript-free} setting. + +
+
+ comment: Accepted by COLING 2025 +
+
+
+
+
+ + ☆ Aya Expanse: Combining Research Breakthroughs for a New Multilingual + Frontier + + +
+ We introduce the Aya Expanse model family, a new generation of 8B and 32B +parameter multilingual language models, aiming to address the critical +challenge of developing highly performant multilingual models that match or +surpass the capabilities of monolingual models. By leveraging several years of +research at Cohere For AI and Cohere, including advancements in data arbitrage, +multilingual preference training, and model merging, Aya Expanse sets a new +state-of-the-art in multilingual performance. Our evaluations on the +Arena-Hard-Auto dataset, translated into 23 languages, demonstrate that Aya +Expanse 8B and 32B outperform leading open-weight models in their respective +parameter classes, including Gemma 2, Qwen 2.5, and Llama 3.1, achieving up to +a 76.6% win-rate. Notably, Aya Expanse 32B outperforms Llama 3.1 70B, a model +with twice as many parameters, achieving a 54.0% win-rate. In this short +technical report, we present extended evaluation results for the Aya Expanse +model family and release their open-weights, together with a new multilingual +evaluation dataset m-ArenaHard. + +
+
+
+
+
+ + ☆ CLINICSUM: Utilizing Language Models for Generating Clinical Summaries + from Patient-Doctor Conversations + + +
+ This paper presents ClinicSum, a novel framework designed to automatically +generate clinical summaries from patient-doctor conversations. It utilizes a +two-module architecture: a retrieval-based filtering module that extracts +Subjective, Objective, Assessment, and Plan (SOAP) information from +conversation transcripts, and an inference module powered by fine-tuned +Pre-trained Language Models (PLMs), which leverage the extracted SOAP data to +generate abstracted clinical summaries. To fine-tune the PLM, we created a +training dataset of consisting 1,473 conversations-summaries pair by +consolidating two publicly available datasets, FigShare and MTS-Dialog, with +ground truth summaries validated by Subject Matter Experts (SMEs). ClinicSum's +effectiveness is evaluated through both automatic metrics (e.g., ROUGE, +BERTScore) and expert human assessments. Results show that ClinicSum +outperforms state-of-the-art PLMs, demonstrating superior precision, recall, +and F-1 scores in automatic evaluations and receiving high preference from SMEs +in human assessment, making it a robust solution for automated clinical +summarization. + +
+
+ comment: accepted at the the 2024 IEEE International Conference on Big Data + workshop Workshop on Big Data and AI for Healthcare +
+
+
+
+
+ + ☆ A History of Philosophy in Colombia through Topic Modelling + + +
+ Data-driven approaches to philosophy have emerged as a valuable tool for +studying the history of the discipline. However, most studies in this area have +focused on a limited number of journals from specific regions and subfields. We +expand the scope of this research by applying dynamic topic modelling +techniques to explore the history of philosophy in Colombia and Latin America. +Our study examines the Colombian philosophy journal Ideas y Valores, founded in +1951 and currently one of the most influential academic philosophy journals in +the region. By analyzing the evolution of topics across the journal's history, +we identify various trends and specific dynamics in philosophical discourse +within the Colombian and Latin American context. Our findings reveal that the +most prominent topics are value theory (including ethics, political philosophy, +and aesthetics), epistemology, and the philosophy of science. We also trace the +evolution of articles focusing on the historical and interpretive aspects of +philosophical texts, and we note a notable emphasis on German philosophers such +as Kant, Husserl, and Hegel on various topics throughout the journal's +lifetime. Additionally, we investigate whether articles with a historical focus +have decreased over time due to editorial pressures. Our analysis suggests no +significant decline in such articles. Finally, we propose ideas for extending +this research to other Latin American journals and suggest improvements for +natural language processing workflows in non-English languages. + +
+
+
+
+
+ + ☆ Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM + Chatbots + + +
+ I combine detection and mitigation techniques to addresses hallucinations in +Large Language Models (LLMs). Mitigation is achieved in a question-answering +Retrieval-Augmented Generation (RAG) framework while detection is obtained by +introducing the Negative Missing Information Scoring System (NMISS), which +accounts for contextual relevance in responses. While RAG mitigates +hallucinations by grounding answers in external data, NMISS refines the +evaluation by identifying cases where traditional metrics incorrectly flag +contextually accurate responses as hallucinations. I use Italian health news +articles as context to evaluate LLM performance. Results show that Gemma2 and +GPT-4 outperform the other models, with GPT-4 producing answers closely aligned +with reference responses. Mid-tier models, such as Llama2, Llama3, and Mistral +benefit significantly from NMISS, highlighting their ability to provide richer +contextual information. This combined approach offers new insights into the +reduction and more accurate assessment of hallucinations in LLMs, with +applications in real-world healthcare tasks and other domains. + +
+
+
+
+
+ + ☆ A Context-aware Framework for Translation-mediated Conversations + + +
+ Effective communication is fundamental to any interaction, yet challenges +arise when participants do not share a common language. Automatic translation +systems offer a powerful solution to bridge language barriers in such +scenarios, but they introduce errors that can lead to misunderstandings and +conversation breakdown. A key issue is that current systems fail to incorporate +the rich contextual information necessary to resolve ambiguities and omitted +details, resulting in literal, inappropriate, or misaligned translations. In +this work, we present a framework to improve large language model-based +translation systems by incorporating contextual information in bilingual +conversational settings. During training, we leverage context-augmented +parallel data, which allows the model to generate translations sensitive to +conversational history. During inference, we perform quality-aware decoding +with context-aware metrics to select the optimal translation from a pool of +candidates. We validate both components of our framework on two task-oriented +domains: customer chat and user-assistant interaction. Across both settings, +our framework consistently results in better translations than state-of-the-art +systems like GPT-4o and TowerInstruct, as measured by multiple automatic +translation quality metrics on several language pairs. We also show that the +resulting model leverages context in an intended and interpretable way, +improving consistency between the conveyed message and the generated +translations. + +
+
+
+
+
+ + ☆ AL-QASIDA: Analyzing LLM Quality and Accuracy Systematically in + Dialectal Arabic + + +
+ Dialectal Arabic (DA) varieties are under-served by language technologies, +particularly large language models (LLMs). This trend threatens to exacerbate +existing social inequalities and limits language modeling applications, yet the +research community lacks operationalized LLM performance measurements in DA. We +present a method that comprehensively evaluates LLM fidelity, understanding, +quality, and diglossia in modeling DA. We evaluate nine LLMs in eight DA +varieties across these four dimensions and provide best practice +recommendations. Our evaluation suggests that LLMs do not produce DA as well as +they understand it, but does not suggest deterioration in quality when they do. +Further analysis suggests that current post-training can degrade DA +capabilities, that few-shot examples can overcome this and other LLM +deficiencies, and that otherwise no measurable features of input text correlate +well with LLM DA performance. + +
+
+ comment: Pre-print +
+
+
+
+
+ + ☆ If You Can't Use Them, Recycle Them: Optimizing Merging at Scale + Mitigates Performance Tradeoffs + + +
+ Model merging has shown great promise at combining expert models, but the +benefit of merging is unclear when merging ``generalist'' models trained on +many tasks. We explore merging in the context of large ($\sim100$B) models, by +\textit{recycling} checkpoints that exhibit tradeoffs among different tasks. +Such checkpoints are often created in the process of developing a frontier +model, and many suboptimal ones are usually discarded. Given a pool of model +checkpoints obtained from different training runs (e.g., different stages, +objectives, hyperparameters, and data mixtures), which naturally show tradeoffs +across different language capabilities (e.g., instruction following vs. code +generation), we investigate whether merging can recycle such suboptimal models +into a Pareto-optimal one. Our optimization algorithm tunes the weight of each +checkpoint in a linear combination, resulting in a Pareto-optimal models that +outperforms both individual models and merge-based baselines. Further analysis +shows that good merges tend to include almost all checkpoints with with +non-zero weights, indicating that even seemingly bad initial checkpoints can +contribute to good final merges. + +
+
+ comment: 13 pages, 9 figures +
+
+
+
+
+ + ☆ Reducing Tool Hallucination via Reliability Alignment + + +
+ Large Language Models (LLMs) have extended their capabilities beyond language +generation to interact with external systems through tool calling, offering +powerful potential for real-world applications. However, the phenomenon of tool +hallucinations, which occur when models improperly select or misuse tools, +presents critical challenges that can lead to flawed task execution and +increased operational costs. This paper investigates the concept of reliable +tool calling and highlights the necessity of addressing tool hallucinations. We +systematically categorize tool hallucinations into two main types: tool +selection hallucination and tool usage hallucination. To mitigate these issues, +we propose a reliability-focused alignment framework that enhances the model's +ability to accurately assess tool relevance and usage. By proposing a suite of +evaluation metrics and evaluating on StableToolBench, we further demonstrate +the effectiveness of our framework in mitigating tool hallucination and +improving the overall system reliability of LLM tool calling. + +
+
+
+
+
+ + ☆ Text Change Detection in Multilingual Documents Using Image Comparison + + +
+ Document comparison typically relies on optical character recognition (OCR) +as its core technology. However, OCR requires the selection of appropriate +language models for each document and the performance of multilingual or hybrid +models remains limited. To overcome these challenges, we propose text change +detection (TCD) using an image comparison model tailored for multilingual +documents. Unlike OCR-based approaches, our method employs word-level text +image-to-image comparison to detect changes. Our model generates bidirectional +change segmentation maps between the source and target documents. To enhance +performance without requiring explicit text alignment or scaling preprocessing, +we employ correlations among multi-scale attention features. We also construct +a benchmark dataset comprising actual printed and scanned word pairs in various +languages to evaluate our model. We validate our approach using our benchmark +dataset and public benchmarks Distorted Document Images and the LRDE Document +Binarization Dataset. We compare our model against state-of-the-art semantic +segmentation and change detection models, as well as to conventional OCR-based +models. + +
+
+ comment: 15pages, 11figures 6tables, wacv2025 accepted +
+
+
+
+
+ + ☆ GRAF: Graph Retrieval Augmented by Facts for Legal Question Answering + + +
+ Pre-trained Language Models (PLMs) have shown remarkable performances in +recent years, setting a new paradigm for NLP research and industry. The legal +domain has received some attention from the NLP community partly due to its +textual nature. Some tasks from this domain are represented by +question-answering (QA) tasks. This work explores the legal domain +Multiple-Choice QA (MCQA) for a low-resource language. The contribution of this +work is multi-fold. We first introduce JuRO, the first openly available +Romanian legal MCQA dataset, comprising three different examinations and a +number of 10,836 total questions. Along with this dataset, we introduce CROL, +an organized corpus of laws that has a total of 93 distinct documents with +their modifications from 763 time spans, that we leveraged in this work for +Information Retrieval (IR) techniques. Moreover, we are the first to propose +Law-RoG, a Knowledge Graph (KG) for the Romanian language, and this KG is +derived from the aforementioned corpus. Lastly, we propose a novel approach for +MCQA, Graph Retrieval Augmented by Facts (GRAF), which achieves competitive +results with generally accepted SOTA methods and even exceeds them in most +settings. + +
+
+
+
+
+ + ☆ Missing Melodies: AI Music Generation and its "Nearly" Complete Omission + of the Global South + + +
+ Recent advances in generative AI have sparked renewed interest and expanded +possibilities for music generation. However, the performance and versatility of +these systems across musical genres are heavily influenced by the availability +of training data. We conducted an extensive analysis of over one million hours +of audio datasets used in AI music generation research and manually reviewed +more than 200 papers from eleven prominent AI and music conferences and +organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, +NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and +inclusion of the musical genres of the Global South in AI research. Our +findings reveal a stark imbalance: approximately 86% of the total dataset hours +and over 93% of researchers focus primarily on music from the Global North. +However, around 40% of these datasets include some form of non-Western music, +genres from the Global South account for only 14.6% of the data. Furthermore, +approximately 51% of the papers surveyed concentrate on symbolic music +generation, a method that often fails to capture the cultural nuances inherent +in music from regions such as South Asia, the Middle East, and Africa. As AI +increasingly shapes the creation and dissemination of music, the significant +underrepresentation of music genres in datasets and research presents a serious +threat to global musical diversity. We also propose some important steps to +mitigate these risks and foster a more inclusive future for AI-driven music +generation. + +
+
+ comment: Submitted to CACM, 12 pages, 2 figures +
+
+
+
+
+ + ☆ GEITje 7B Ultra: A Conversational Model for Dutch + + +
+ Language models have rapidly evolved, predominantly focusing on English while +often neglecting extensive pretraining in other languages. This approach has +required initiatives to adapt powerful, English-centric models to other +linguistic contexts through finetuning. For Dutch, such a recent endeavour is +``GEITje'' a model originally derived from the English-based Mistral 7B. +Building on this fundamental work, the current research extends the +capabilities of GEITje by supervised finetuning on newly created high-quality +synthetic conversational datasets, along with an additional preference +alignment procedure on a synthetic feedback dataset. Both the developed models +and the created datasets are openly available. + +
+
+
+
+
+ + ☆ Automated Medical Report Generation for ECG Data: Bridging Medical Text + and Signal Processing with Deep Learning + + +
+ Recent advances in deep learning and natural language generation have +significantly improved image captioning, enabling automated, human-like +descriptions for visual content. In this work, we apply these captioning +techniques to generate clinician-like interpretations of ECG data. This study +leverages existing ECG datasets accompanied by free-text reports authored by +healthcare professionals (HCPs) as training data. These reports, while often +inconsistent, provide a valuable foundation for automated learning. We +introduce an encoder-decoder-based method that uses these reports to train +models to generate detailed descriptions of ECG episodes. This represents a +significant advancement in ECG analysis automation, with potential applications +in zero-shot classification and automated clinical decision support. + The model is tested on various datasets, including both 1- and 12-lead ECGs. +It significantly outperforms the state-of-the-art reference model by Qiu et +al., achieving a METEOR score of 55.53% compared to 24.51% achieved by the +reference model. Furthermore, several key design choices are discussed, +providing a comprehensive overview of current challenges and innovations in +this domain. + The source codes for this research are publicly available in our Git +repository https://git.zib.de/ableich/ecg-comment-generation-public + +
+
+
+
+
+ + ☆ Hostility Detection in UK Politics: A Dataset on Online Abuse Targeting + MPs + + +
+ Numerous politicians use social media platforms, particularly X, to engage +with their constituents. This interaction allows constituents to pose questions +and offer feedback but also exposes politicians to a barrage of hostile +responses, especially given the anonymity afforded by social media. They are +typically targeted in relation to their governmental role, but the comments +also tend to attack their personal identity. This can discredit politicians and +reduce public trust in the government. It can also incite anger and disrespect, +leading to offline harm and violence. While numerous models exist for detecting +hostility in general, they lack the specificity required for political +contexts. Furthermore, addressing hostility towards politicians demands +tailored approaches due to the distinct language and issues inherent to each +country (e.g., Brexit for the UK). To bridge this gap, we construct a dataset +of 3,320 English tweets spanning a two-year period manually annotated for +hostility towards UK MPs. Our dataset also captures the targeted identity +characteristics (race, gender, religion, none) in hostile tweets. We perform +linguistic and topical analyses to delve into the unique content of the UK +political data. Finally, we evaluate the performance of pre-trained language +models and large language models on binary hostility detection and multi-class +targeted identity type classification tasks. Our study offers valuable data and +insights for future research on the prevalence and nature of politics-related +hostility specific to the UK. + +
+
+
+
+
+ + ☆ M$^{3}$D: A Multimodal, Multilingual and Multitask Dataset for Grounded + Document-level Information Extraction + + +
+ Multimodal information extraction (IE) tasks have attracted increasing +attention because many studies have shown that multimodal information benefits +text information extraction. However, existing multimodal IE datasets mainly +focus on sentence-level image-facilitated IE in English text, and pay little +attention to video-based multimodal IE and fine-grained visual grounding. +Therefore, in order to promote the development of multimodal IE, we constructed +a multimodal multilingual multitask dataset, named M$^{3}$D, which has the +following features: (1) It contains paired document-level text and video to +enrich multimodal information; (2) It supports two widely-used languages, +namely English and Chinese; (3) It includes more multimodal IE tasks such as +entity recognition, entity chain extraction, relation extraction and visual +grounding. In addition, our dataset introduces an unexplored theme, i.e., +biography, enriching the domains of multimodal IE resources. To establish a +benchmark for our dataset, we propose an innovative hierarchical multimodal IE +model. This model effectively leverages and integrates multimodal information +through a Denoised Feature Fusion Module (DFFM). Furthermore, in non-ideal +scenarios, modal information is often incomplete. Thus, we designed a Missing +Modality Construction Module (MMCM) to alleviate the issues caused by missing +modalities. Our model achieved an average performance of 53.80% and 53.77% on +four tasks in English and Chinese datasets, respectively, which set a +reasonable standard for subsequent research. In addition, we conducted more +analytical experiments to verify the effectiveness of our proposed module. We +believe that our work can promote the development of the field of multimodal +IE. + +
+
+ comment: 14 pages, 9 figures, 6 tables +
+
+
+
+
+ + ☆ Exploring the Influence of Label Aggregation on Minority Voices: + Implications for Dataset Bias and Model Training + + +
+ Resolving disagreement in manual annotation typically consists of removing +unreliable annotators and using a label aggregation strategy such as majority +vote or expert opinion to resolve disagreement. These may have the side-effect +of silencing or under-representing minority but equally valid opinions. In this +paper, we study the impact of standard label aggregation strategies on minority +opinion representation in sexism detection. We investigate the quality and +value of minority annotations, and then examine their effect on the class +distributions in gold labels, as well as how this affects the behaviour of +models trained on the resulting datasets. Finally, we discuss the potential +biases introduced by each method and how they can be amplified by the models. + +
+
+
+
+
+ + ☆ Marco-LLM: Bridging Languages via Massive Multilingual Training for + Cross-Lingual Enhancement + + +
+ Large Language Models (LLMs) have achieved remarkable progress in recent +years; however, their excellent performance is still largely limited to major +world languages, primarily English. Many LLMs continue to face challenges with +multilingual tasks, especially when it comes to low-resource languages. To +address this issue, we introduced Marco-LLM: Massive multilingual training for +cross-lingual enhancement LLM. We have collected a substantial amount of +multilingual data for several low-resource languages and conducted extensive +continual pre-training using the Qwen2 models. This effort has resulted in a +multilingual LLM named Marco-LLM. Through comprehensive evaluations on various +multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA +and many others, Marco-LLM has demonstrated substantial improvements over +state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements +in any-to-any machine translation tasks, showing the effectiveness of our +multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not +only perform exceptionally well in multilingual tasks, including low-resource +languages, but also maintain strong performance in English and other major +languages, closing the performance gap between high- and low-resource language +capabilities. By bridging languages, this effort demonstrates our dedication to +ensuring LLMs work accurately across various languages. + +
+
+
+
+
+ + ☆ MTMT: Consolidating Multiple Thinking Modes to Form a Thought Tree for + Strengthening LLM + + +
+ Large language models (LLMs) have shown limitations in tasks requiring +complex logical reasoning and multi-step problem-solving. To address these +challenges, researchers have employed carefully designed prompts and +flowcharts, simulating human cognitive processes to enhance LLM performance, +such as the Chain of Thought approach. In this paper, we introduce MTMT +(Multi-thinking Modes Tree), a novel method that interacts with LLMs to +construct a thought tree, simulating various advanced cognitive processes, +including but not limited to association, counterfactual thinking, task +decomposition, and comparison. By breaking down the original complex task into +simpler sub-questions, MTMT facilitates easier problem-solving for LLMs, +enabling more effective utilization of the latent knowledge within LLMs. We +evaluate the performance of MTMT under different parameter configurations, +using GPT-4o mini as the base model. Our results demonstrate that integrating +multiple modes of thinking significantly enhances the ability of LLMs to handle +complex tasks. + +
+
+
+
+
+ + ☆ Demonstration Selection for In-Context Learning via Reinforcement + Learning + + +
+ Diversity in demonstration selection is crucial for enhancing model +generalization, as it enables a broader coverage of structures and concepts. +However, constructing an appropriate set of demonstrations has remained a focal +point of research. This paper presents the Relevance-Diversity Enhanced +Selection (RDES), an innovative approach that leverages reinforcement learning +to optimize the selection of diverse reference demonstrations for text +classification tasks using Large Language Models (LLMs), especially in few-shot +prompting scenarios. RDES employs a Q-learning framework to dynamically +identify demonstrations that maximize both diversity and relevance to the +classification objective by calculating a diversity score based on label +distribution among selected demonstrations. This method ensures a balanced +representation of reference data, leading to improved classification accuracy. +Through extensive experiments on four benchmark datasets and involving 12 +closed-source and open-source LLMs, we demonstrate that RDES significantly +enhances classification accuracy compared to ten established baselines. +Furthermore, we investigate the incorporation of Chain-of-Thought (CoT) +reasoning in the reasoning process, which further enhances the model's +predictive performance. The results underscore the potential of reinforcement +learning to facilitate adaptive demonstration selection and deepen the +understanding of classification challenges. + +
+
+
+
+
+ + ☆ MIND: Effective Incorrect Assignment Detection through a Multi-Modal + Structure-Enhanced Language Model + + +
+ The rapid growth of academic publications has exacerbated the issue of author +name ambiguity in online digital libraries. Despite advances in name +disambiguation algorithms, cumulative errors continue to undermine the +reliability of academic systems. It is estimated that over 10% paper-author +assignments are rectified when constructing the million-scale WhoIsWho +benchmark. Existing endeavors to detect incorrect assignments are either +semantic-based or graph-based approaches, which fall short of making full use +of the rich text attributes of papers and implicit structural features defined +via the co-occurrence of paper attributes. To this end, this paper introduces a +structure-enhanced language model that combines key structural features from +graph-based methods with fine-grained semantic features from rich paper +attributes to detect incorrect assignments. The proposed model is trained with +a highly effective multi-modal multi-turn instruction tuning framework, which +incorporates task-guided instruction tuning, text-attribute modality, and +structural modality. Experimental results demonstrate that our model +outperforms previous approaches, achieving top performance on the leaderboard +of KDD Cup 2024. Our code has been publicly available. + +
+
+
+
+
+ + ♻ ☆ Evaluating Large Vision-and-Language Models on Children's Mathematical + Olympiads NeurIPS 2024 + + +
+ Recent years have seen a significant progress in the general-purpose problem +solving abilities of large vision and language models (LVLMs), such as ChatGPT, +Gemini, etc.; some of these breakthroughs even seem to enable AI models to +outperform human abilities in varied tasks that demand higher-order cognitive +skills. Are the current large AI models indeed capable of generalized problem +solving as humans do? A systematic analysis of AI capabilities for joint vision +and text reasoning, however, is missing in the current scientific literature. +In this paper, we make an effort towards filling this gap, by evaluating +state-of-the-art LVLMs on their mathematical and algorithmic reasoning +abilities using visuo-linguistic problems from children's Olympiads. +Specifically, we consider problems from the Mathematical Kangaroo (MK) +Olympiad, which is a popular international competition targeted at children +from grades 1-12, that tests children's deeper mathematical abilities using +puzzles that are appropriately gauged to their age and skills. Using the +puzzles from MK, we created a dataset, dubbed SMART-840, consisting of 840 +problems from years 2020-2024. With our dataset, we analyze LVLMs power on +mathematical reasoning; their responses on our puzzles offer a direct way to +compare against that of children. Our results show that modern LVLMs do +demonstrate increasingly powerful reasoning skills in solving problems for +higher grades, but lack the foundations to correctly answer problems designed +for younger children. Further analysis shows that there is no significant +correlation between the reasoning capabilities of AI models and that of young +children, and their capabilities appear to be based on a different type of +reasoning than the cumulative knowledge that underlies children's mathematics +and logic skills. + +
+
+ comment: Accepted at NeurIPS 2024 (Datasets and Benchmarks Track) +
+
+
+
+
+ + ♻ ☆ Large Language Models Must Be Taught to Know What They Don't Know NeurIPS 2024 + + +
+ When using large language models (LLMs) in high-stakes applications, we need +to know when we can trust their predictions. Some works argue that prompting +high-performance LLMs is sufficient to produce calibrated uncertainties, while +others introduce sampling methods that can be prohibitively expensive. In this +work, we first argue that prompting on its own is insufficient to achieve good +calibration and then show that fine-tuning on a small dataset of correct and +incorrect answers can create an uncertainty estimate with good generalization +and small computational overhead. We show that a thousand graded examples are +sufficient to outperform baseline methods and that training through the +features of a model is necessary for good performance and tractable for large +open-source models when using LoRA. We also investigate the mechanisms that +enable reliable LLM uncertainty estimation, finding that many models can be +used as general-purpose uncertainty estimators, applicable not just to their +own uncertainties but also the uncertainty of other models. Lastly, we show +that uncertainty estimates inform human use of LLMs in human-AI collaborative +settings through a user study. + +
+
+ comment: NeurIPS 2024 Camera Ready +
+
+
+
+
+ + ♻ ☆ Combining Autoregressive and Autoencoder Language Models for Text + Classification + + +
+ This paper presents CAALM-TC (Combining Autoregressive and Autoencoder +Language Models for Text Classification), a novel method that enhances text +classification by integrating autoregressive and autoencoder language models. +Autoregressive large language models such as Open AI's GPT, Meta's Llama or +Microsoft's Phi offer promising prospects for content analysis practitioners, +but they generally underperform supervised BERT based models for text +classification. CAALM leverages autoregressive models to generate contextual +information based on input texts, which is then combined with the original text +and fed into an autoencoder model for classification. This hybrid approach +capitalizes on the extensive contextual knowledge of autoregressive models and +the efficient classification capabilities of autoencoders. Experimental results +on four benchmark datasets demonstrate that CAALM consistently outperforms +existing methods, particularly in tasks with smaller datasets and more abstract +classification objectives. The findings indicate that CAALM offers a scalable +and effective solution for automated content analysis in social science +research that minimizes sample size requirements. + +
+
+ comment: There is an error in the figure in page 7, where the formula and + representation for an autoencoder based classifier are inconsistent and may + mislead readers +
+
+
+
+
+ + ♻ ☆ StarVector: Generating Scalable Vector Graphics Code from Images and + Text + + +
+ Scalable Vector Graphics (SVGs) are vital for modern image rendering due to +their scalability and versatility. Previous SVG generation methods have focused +on curve-based vectorization, lacking semantic understanding, often producing +artifacts, and struggling with SVG primitives beyond path curves. To address +these issues, we introduce StarVector, a multimodal large language model for +SVG generation. It performs image vectorization by understanding image +semantics and using SVG primitives for compact, precise outputs. Unlike +traditional methods, StarVector works directly in the SVG code space, +leveraging visual understanding to apply accurate SVG primitives. To train +StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables +generalization across vectorization tasks and precise use of primitives like +ellipses, polygons, and text. We address challenges in SVG evaluation, showing +that pixel-based metrics like MSE fail to capture the unique qualities of +vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 +tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this +setup, StarVector achieves state-of-the-art performance, producing more compact +and semantically rich SVGs. + +
+
+
+
+
+ + ♻ ☆ How language models extrapolate outside the training data: A case study + in Textualized Gridworld + + +
+ Language models' ability to extrapolate learned behaviors to novel, more +complex environments beyond their training scope is highly unknown. This study +introduces a path planning task in a textualized Gridworld to probe language +models' extrapolation capabilities. We show that conventional approaches, +including next token prediction and Chain of Thought (CoT) finetuning, fail to +extrapolate in larger, unseen environments. Inspired by human cognition and +dual process theory, we propose cognitive maps for path planning, a novel CoT +framework that simulates humanlike mental representations. Our experiments show +that cognitive maps not only enhance extrapolation to unseen environments but +also exhibit humanlike characteristics through structured mental simulation and +rapid adaptation. Our finding that these cognitive maps require specialized +training schemes and cannot be induced through simple prompting opens up +important questions about developing general-purpose cognitive maps in language +models. Our comparison with exploration-based methods further illuminates the +complementary strengths of offline planning and online exploration. + +
+
+
+
+
+ + ♻ ☆ OffensiveLang: A Community Based Implicit Offensive Language Dataset + + +
+ The widespread presence of hateful languages on social media has resulted in +adverse effects on societal well-being. As a result, addressing this issue with +high priority has become very important. Hate speech or offensive languages +exist in both explicit and implicit forms, with the latter being more +challenging to detect. Current research in this domain encounters several +challenges. Firstly, the existing datasets primarily rely on the collection of +texts containing explicit offensive keywords, making it challenging to capture +implicitly offensive contents that are devoid of these keywords. Secondly, +common methodologies tend to focus solely on textual analysis, neglecting the +valuable insights that community information can provide. In this research +paper, we introduce a novel dataset OffensiveLang, a community based implicit +offensive language dataset generated by ChatGPT 3.5 containing data for 38 +different target groups. Despite limitations in generating offensive texts +using ChatGPT due to ethical constraints, we present a prompt-based approach +that effectively generates implicit offensive languages. To ensure data +quality, we evaluate the dataset with human. Additionally, we employ a +prompt-based zero-shot method with ChatGPT and compare the detection results +between human annotation and ChatGPT annotation. We utilize existing +state-of-the-art models to see how effective they are in detecting such +languages. The dataset is available here: +https://github.com/AmitDasRup123/OffensiveLang + +
+
+
+
+
+ + ♻ ☆ Jointly Modeling Inter- & Intra-Modality Dependencies for Multi-modal + Learning NeurIPS 2024 + + +
+ Supervised multi-modal learning involves mapping multiple modalities to a +target label. Previous studies in this field have concentrated on capturing in +isolation either the inter-modality dependencies (the relationships between +different modalities and the label) or the intra-modality dependencies (the +relationships within a single modality and the label). We argue that these +conventional approaches that rely solely on either inter- or intra-modality +dependencies may not be optimal in general. We view the multi-modal learning +problem from the lens of generative models where we consider the target as a +source of multiple modalities and the interaction between them. Towards that +end, we propose inter- & intra-modality modeling (I2M2) framework, which +captures and integrates both the inter- and intra-modality dependencies, +leading to more accurate predictions. We evaluate our approach using real-world +healthcare and vision-and-language datasets with state-of-the-art models, +demonstrating superior performance over traditional methods focusing only on +one type of modality dependency. + +
+
+ comment: Accepted to NeurIPS 2024. Code available at + https://github.com/divyam3897/I2M2 +
+
+
+
+
+ + ♻ ☆ Resolving Lexical Bias in Edit Scoping with Projector Editor Networks + + +
+ Weight-preserving model editing techniques heavily rely on the scoping +mechanism that decides when to apply an edit to the base model. These scoping +mechanisms utilize distance functions in the representation space to ascertain +the scope of the edit. In this work, we show that distance-based scoping +functions grapple with lexical biases leading to issues such as misfires with +irrelevant prompts that share similar lexical characteristics. To address this +problem, we introduce, Projector Editor Networks for Model Editing (PENME),is a +model editing approach that employs a compact adapter with a projection network +trained via a contrastive learning objective. We demonstrate the efficacy of +PENME in achieving superior results while being compute efficient and flexible +to adapt across model architectures. + +
+
+
+
+
+ + ♻ ☆ Evaluating Numerical Reasoning in Text-to-Image Models + + +
+ Text-to-image generative models are capable of producing high-quality images +that often faithfully depict concepts described using natural language. In this +work, we comprehensively evaluate a range of text-to-image models on numerical +reasoning tasks of varying difficulty, and show that even the most advanced +models have only rudimentary numerical skills. Specifically, their ability to +correctly generate an exact number of objects in an image is limited to small +numbers, it is highly dependent on the context the number term appears in, and +it deteriorates quickly with each successive number. We also demonstrate that +models have poor understanding of linguistic quantifiers (such as "a few" or +"as many as"), the concept of zero, and struggle with more advanced concepts +such as partial quantities and fractional representations. We bundle prompts, +generated images and human annotations into GeckoNum, a novel benchmark for +evaluation of numerical reasoning. + +
+
+
+
+
+ + ♻ ☆ Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks + of Language Models + + +
+ Language Model (LM) agents for cybersecurity that are capable of autonomously +identifying vulnerabilities and executing exploits have potential to cause +real-world impact. Policymakers, model providers, and researchers in the AI and +cybersecurity communities are interested in quantifying the capabilities of +such agents to help mitigate cyberrisk and investigate opportunities for +penetration testing. Toward that end, we introduce Cybench, a framework for +specifying cybersecurity tasks and evaluating agents on those tasks. We include +40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF +competitions, chosen to be recent, meaningful, and spanning a wide range of +difficulties. Each task includes its own description, starter files, and is +initialized in an environment where an agent can execute commands and observe +outputs. Since many tasks are beyond the capabilities of existing LM agents, we +introduce subtasks for each task, which break down a task into intermediary +steps for a more detailed evaluation. To evaluate agent capabilities, we +construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI +o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini +1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct. For the top performing +models (GPT-4o and Claude 3.5 Sonnet), we further investigate performance +across 4 agent scaffolds (structed bash, action-only, pseudoterminal, and web +search). Without subtask guidance, agents leveraging Claude 3.5 Sonnet, GPT-4o, +OpenAI o1-preview, and Claude 3 Opus successfully solved complete tasks that +took human teams up to 11 minutes to solve. In comparison, the most difficult +task took human teams 24 hours and 54 minutes to solve. All code and data are +publicly available at https://cybench.github.io. + +
+
+ comment: 151 pages, 9 figures +
+
+
+
+
+ + ♻ ☆ Scaling Laws for Post Training Quantized Large Language Models + + +
+ Generalization abilities of well-trained large language models (LLMs) are +known to scale predictably as a function of model size. In contrast to the +existence of practical scaling laws governing pre-training, the quality of LLMs +after post-training compression remains highly unpredictable, often requiring +case-by-case validation in practice. In this work, we attempted to close this +gap for post-training weight quantization of LLMs by conducting a systematic +empirical study on multiple LLM families quantized to numerous low-precision +tensor data types using popular weight quantization techniques. We identified +key scaling factors pertaining to characteristics of the local loss landscape, +based on which the performance of quantized LLMs can be reasonably well +predicted by a statistical model. + +
+
+
+
+
+ + ♻ ☆ The Semantic Hub Hypothesis: Language Models Share Semantic + Representations Across Languages and Modalities + + +
+ Modern language models can process inputs across diverse languages and +modalities. We hypothesize that models acquire this capability through learning +a shared representation space across heterogeneous data types (e.g., different +languages and modalities), which places semantically similar inputs near one +another, even if they are from different modalities/languages. We term this the +semantic hub hypothesis, following the hub-and-spoke model from neuroscience +(Patterson et al., 2007) which posits that semantic knowledge in the human +brain is organized through a transmodal semantic "hub" which integrates +information from various modality-specific "spokes" regions. We first show that +model representations for semantically equivalent inputs in different languages +are similar in the intermediate layers, and that this space can be interpreted +using the model's dominant pretraining language via the logit lens. This +tendency extends to other data types, including arithmetic expressions, code, +and visual/audio inputs. Interventions in the shared representation space in +one data type also predictably affect model outputs in other data types, +suggesting that this shared representations space is not simply a vestigial +byproduct of large-scale training on broad data, but something that is actively +utilized by the model during input processing. + +
+
+
+
+
+ + ♻ ☆ SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large + Language Models by Summarizing Training Trajectories of Small Models + + +
+ Despite the effectiveness of data selection for large language models (LLMs) +during pretraining and instruction fine-tuning phases, improving data +efficiency in supervised fine-tuning (SFT) for specialized domains poses +significant challenges due to the complexity of fine-tuning data. To bridge +this gap, we introduce an effective and scalable data selection method for SFT, +SmallToLarge (S2L), which leverages training trajectories from small models to +guide the data selection for larger models. We demonstrate through extensive +experiments that S2L significantly improves data efficiency in SFT for +mathematical problem-solving, reducing the training data to just 11% of the +original MathInstruct dataset (Yue et al., 2023) to match full dataset +performance while outperforming state-of-the-art data selection algorithms by +an average of 4.7% across 6 in- and out-domain evaluation datasets. Remarkably, +selecting only 50K data for SFT, S2L achieves a 32.7% accuracy on the most +challenging MATH (Hendrycks et al., 2021) benchmark, improving Phi-2 (Li et +al., 2023b) by 16.6%. In clinical text summarization on the MIMIC-III dataset +(Johnson et al., 2016), S2L again outperforms training on the full dataset +using only 50% of the data. Notably, S2L can perform data selection using a +reference model 40x smaller than the target model, proportionally reducing the +cost of data selection. + +
+
+
+
+
+ + ♻ ☆ WaveletGPT: Wavelets Meet Large Language Models + + +
+ Large Language Models (LLMs) have ushered in a new wave of artificial +intelligence advancements impacting every scientific field and discipline. They +are trained on a simple objective: to predict the next token given the previous +context. We live in a world where most of the data around us, e.g., text, +audio, and music, has a multi-scale structure associated with it. This paper +infuses LLMs with traditional signal processing ideas, namely wavelets, during +pre-training to take advantage of the structure. Without adding \textbf{any +extra parameters} to a GPT-style LLM architecture, we achieve the same +pre-training performance almost twice as fast in text, raw audio, and symbolic +music. This is achieved by imposing a structure on intermediate embeddings. +When trained for the same number of training steps, we achieve significant +gains in performance, which is comparable to pre-training a larger neural +architecture. Our architecture allows every next token prediction access to +intermediate embeddings at different temporal resolutions in every Transformer +decoder block. This work will hopefully pave the way for incorporating +multi-rate signal processing ideas into traditional LLM pre-training. Further, +we showcase pushing model performance by improving internal structure instead +of just going after scale. + +
+
+ comment: 16 pages, 4 figures +
+
+
+
+
+ + ♻ ☆ CNNSum: Exploring Long-Conext Summarization with Large Language Models + in Chinese Novels + + +
+ Large Language Models (LLMs) have been well-researched in many long-context +tasks. However, due to high annotation costs, high-quality long-context summary +datasets for training or evaluation are scarce, limiting further research. In +this work, we introduce CNNSum, a new multi-scale Chinese long-context novel +summarization benchmark, including four subsets, length covering +16k\textasciitilde128k, 695 samples in total, the annotations are human-driven. +We evaluate commercial and open-source models on CNNSum and conduct a detailed +analysis. Based on the observations, we further conduct fine-tuning exploration +with short-context summary data. In our study: (1) GPT-4o underperformed, due +to excessive subjective commentary. (2) Currently, long-context summarization +mainly relies on memory ability, small LLMs with stable longer context lengths +are the most cost-effective. Using long data concatenated from short-context +summaries makes a significant improvement. (3) Prompt templates may cause a +large performance gap but can be mitigated through fine-tuning. (4) Fine-tuned +Chat or Instruction versions may harm the Base model and further fine-tuning +cannot bridge performance gap. (5) while models with RoPE base scaling exhibit +strong extrapolation potential, their performance may vary significantly when +combined with other interpolation methods and need careful selection. (6) +CNNSum provides more reliable and insightful evaluation results than other +benchmarks. We release CNNSum to advance research in this field. + +
+
+
+
+
+ + ♻ ☆ Context-Informed Machine Translation of Manga using Multimodal Large + Language Models COLING 2025 + + +
+ Due to the significant time and effort required for handcrafting +translations, most manga never leave the domestic Japanese market. Automatic +manga translation is a promising potential solution. However, it is a budding +and underdeveloped field and presents complexities even greater than those +found in standard translation due to the need to effectively incorporate visual +elements into the translation process to resolve ambiguities. In this work, we +investigate to what extent multimodal large language models (LLMs) can provide +effective manga translation, thereby assisting manga authors and publishers in +reaching wider audiences. Specifically, we propose a methodology that leverages +the vision component of multimodal LLMs to improve translation quality and +evaluate the impact of translation unit size, context length, and propose a +token efficient approach for manga translation. Moreover, we introduce a new +evaluation dataset -- the first parallel Japanese-Polish manga translation +dataset -- as part of a benchmark to be used in future research. Finally, we +contribute an open-source software suite, enabling others to benchmark LLMs for +manga translation. Our findings demonstrate that our proposed methods achieve +state-of-the-art results for Japanese-English translation and set a new +standard for Japanese-Polish. + +
+
+ comment: COLING 2025 +
+
+
+
+
+ + ♻ ☆ Unveiling Entity-Level Unlearning for Large Language Models: A + Comprehensive Analysis COLING 2025 + + +
+ Large language model unlearning has garnered increasing attention due to its +potential to address security and privacy concerns, leading to extensive +research in the field. However, much of this research has concentrated on +instance-level unlearning, specifically targeting the removal of predefined +instances containing sensitive content. This focus has left a significant gap +in the exploration of full entity-level unlearning, which is critical in +real-world scenarios such as copyright protection. To this end, we propose a +novel task of Entity-level unlearning, which aims to erase entity-related +knowledge from the target model completely. To thoroughly investigate this +task, we systematically evaluate trending unlearning algorithms, revealing that +current methods struggle to achieve effective entity-level unlearning. Then, we +further explore the factors that influence the performance of the unlearning +algorithms, identifying that knowledge coverage and the size of the forget set +play pivotal roles. Notably, our analysis also uncovers that entities +introduced through fine-tuning are more vulnerable to unlearning than +pre-trained entities. These findings collectively offer valuable insights for +advancing entity-level unlearning for LLMs. + +
+
+ comment: Accepted by COLING 2025 +
+
+
+
+
+ + ♻ ☆ CoSy: Evaluating Textual Explanations of Neurons + + +
+ A crucial aspect of understanding the complex nature of Deep Neural Networks +(DNNs) is the ability to explain learned concepts within their latent +representations. While methods exist to connect neurons to human-understandable +textual descriptions, evaluating the quality of these explanations is +challenging due to the lack of a unified quantitative approach. We introduce +CoSy (Concept Synthesis), a novel, architecture-agnostic framework for +evaluating textual explanations of latent neurons. Given textual explanations, +our proposed framework uses a generative model conditioned on textual input to +create data points representing the explanations. By comparing the neuron's +response to these generated data points and control data points, we can +estimate the quality of the explanation. We validate our framework through +sanity checks and benchmark various neuron description methods for Computer +Vision tasks, revealing significant differences in quality. + +
+
+ comment: 10 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ Finding Transformer Circuits with Edge Pruning NeurIPS 2024 + + +
+ The path to interpreting a language model often proceeds via analysis of +circuits -- sparse computational subgraphs of the model that capture specific +aspects of its behavior. Recent work has automated the task of discovering +circuits. Yet, these methods have practical limitations, as they rely either on +inefficient search algorithms or inaccurate approximations. In this paper, we +frame automated circuit discovery as an optimization problem and propose *Edge +Pruning* as an effective and scalable solution. Edge Pruning leverages +gradient-based pruning techniques, but instead of removing neurons or +components, it prunes the \emph{edges} between components. Our method finds +circuits in GPT-2 that use less than half the number of edges compared to +circuits found by previous methods while being equally faithful to the full +model predictions on standard circuit-finding tasks. Edge Pruning is efficient +even with as many as 100K examples, outperforming previous methods in speed and +producing substantially better circuits. It also perfectly recovers the +ground-truth circuits in two models compiled with Tracr. Thanks to its +efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale +that prior methods operate on. We use this setting for a case study comparing +the mechanisms behind instruction prompting and in-context learning. We find +two circuits with more than 99.96% sparsity that match the performance of the +full model and reveal that the mechanisms in the two settings overlap +substantially. Our case study shows that Edge Pruning is a practical and +scalable tool for interpretability and sheds light on behaviors that only +emerge in large models. + +
+
+ comment: NeurIPS 2024 (Spotlight) +
+
+
+
+
+ + ♻ A Complexity-Based Theory of Compositionality + + +
+ Compositionality is believed to be fundamental to intelligence. In humans, it +underlies the structure of thought, language, and higher-level reasoning. In +AI, compositional representations can enable a powerful form of +out-of-distribution generalization, in which a model systematically adapts to +novel combinations of known concepts. However, while we have strong intuitions +about what compositionality is, there currently exists no formal definition for +it that is measurable and mathematical. Here, we propose such a definition, +which we call representational compositionality, that accounts for and extends +our intuitions about compositionality. The definition is conceptually simple, +quantitative, grounded in algorithmic information theory, and applicable to any +representation. Intuitively, representational compositionality states that a +compositional representation satisfies three properties. First, it must be +expressive. Second, it must be possible to re-describe the representation as a +function of discrete symbolic sequences with re-combinable parts, analogous to +sentences in natural language. Third, the function that relates these symbolic +sequences to the representation, analogous to semantics in natural language, +must be simple. Through experiments on both synthetic and real world data, we +validate our definition of compositionality and show how it unifies disparate +intuitions from across the literature in both AI and cognitive science. We also +show that representational compositionality, while theoretically intractable, +can be readily estimated using standard deep learning tools. Our definition has +the potential to inspire the design of novel, theoretically-driven models that +better capture the mechanisms of compositional thought. + +
+
+
+
+
+ + ♻ ☆ Model-GLUE: Democratized LLM Scaling for A Large Model Zoo in the Wild NeurIPS 2024 + + +
+ As Large Language Models (LLMs) excel across tasks and specialized domains, +scaling LLMs based on existing models has garnered significant attention, which +faces the challenge of decreasing performance when combining disparate models. +Various techniques have been proposed for the aggregation of pre-trained LLMs, +including model merging, Mixture-of-Experts, and stacking. Despite their +merits, a comprehensive comparison and synergistic application of them to a +diverse model zoo is yet to be adequately addressed. In light of this research +gap, this paper introduces Model-GLUE, a holistic LLM scaling guideline. First, +our work starts with a benchmarking of existing LLM scaling techniques, +especially selective merging, and variants of mixture. Utilizing the insights +from the benchmark results, we formulate an optimal strategy for the selection +and aggregation of a heterogeneous model zoo characterizing different +architectures and initialization.Our methodology involves the clustering of +mergeable models and optimal merging strategy selection, and the integration of +clusters through a model mixture. Finally, evidenced by our experiments on a +diverse Llama-2-based model zoo, Model-GLUE shows an average performance +enhancement of 5.61%, achieved without additional training. Codes are available +at: https://github.com/Model-GLUE/Model-GLUE. + +
+
+ comment: 24 pages, 4 figures, accepted to NeurIPS 2024 Datasets and Benchmarks + Track +
+
+
+
+
+ + ♻ ☆ SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving + Model Transformation + + +
+ LLM inference for popular enterprise use cases, such as summarization, RAG, +and code-generation, typically observes orders of magnitude longer prompt +lengths than generation lengths. This characteristic leads to high cost of +prefill and increased response latency. In this paper, we present SwiftKV, a +novel model transformation and distillation procedure specifically designed to +reduce the time and cost of processing prompt tokens while preserving high +quality of generated tokens. SwiftKV combines three key mechanisms: i) +SingleInputKV, which prefills later layers' KV cache using a much earlier +layer's output, allowing prompt tokens to skip much of the model computation, +ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the +memory footprint and support larger batch size for higher throughput, and iii) +a knowledge-preserving distillation procedure that can adapt existing LLMs for +SwiftKV with minimal accuracy impact and low compute and data requirement. For +Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% +and the memory requirement of the KV cache by 62.5% while incurring minimum +quality degradation across a wide range of tasks. In the end-to-end inference +serving using an optimized vLLM implementation, SwiftKV realizes up to 2x +higher aggregate throughput and 60% lower time per output token. It can achieve +a staggering 560 TFlops/GPU of normalized inference throughput, which +translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 +GPUs. Our training, inference, and model implementations are open-sourced and +can be found through +https://huggingface.co/collections/Snowflake/swiftkv-models-674f7d7474eb789e185d31cb. + +
+
+
+
+
+ + ♻ ☆ RARE: Retrieval-Augmented Reasoning Enhancement for Large Language + Models + + +
+ This work introduces RARE (Retrieval-Augmented Reasoning Enhancement), a +versatile extension to the mutual reasoning framework (rStar), aimed at +enhancing reasoning accuracy and factual integrity across large language models +(LLMs) for complex, knowledge-intensive tasks such as commonsense and medical +reasoning. RARE incorporates two innovative actions within the Monte Carlo Tree +Search (MCTS) framework: A6, which generates search queries based on the +initial problem statement, performs information retrieval using those queries, +and augments reasoning with the retrieved data to formulate the final answer; +and A7, which leverages information retrieval specifically for generated +sub-questions and re-answers these sub-questions with the relevant contextual +information. Additionally, a Retrieval-Augmented Factuality Scorer is proposed +to replace the original discriminator, prioritizing reasoning paths that meet +high standards of factuality. Experimental results with LLaMA 3.1 show that +RARE enables open-source LLMs to achieve competitive performance with top +open-source models like GPT-4 and GPT-4o. This research establishes RARE as a +scalable solution for improving LLMs in domains where logical coherence and +factual integrity are critical. + +
+
+ comment: 24 pages +
+
+
+
+
+ + ♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching + + +
+ Ontology matching (OM) enables semantic interoperability between different +ontologies and resolves their conceptual heterogeneity by aligning related +entities. OM systems currently have two prevailing design paradigms: +conventional knowledge-based expert systems and newer machine learning-based +predictive systems. While large language models (LLMs) and LLM agents have +revolutionised data engineering and have been applied creatively in many +domains, their potential for OM remains underexplored. This study introduces a +novel agent-powered LLM-based design paradigm for OM systems. With +consideration of several specific challenges in leveraging LLM agents for OM, +we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), +consisting of two Siamese agents for retrieval and matching, with a set of +simple OM tools. Our framework is implemented in a proof-of-concept system. +Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks +over state-of-the-art OM systems show that our system can achieve results very +close to the long-standing best performance on simple OM tasks and can +significantly improve the performance on complex and few-shot OM tasks. + +
+
+ comment: 14 pages, 13 figures, 4 tables +
+
+
+
+
+ + ♻ ☆ Molmo and PixMo: Open Weights and Open Data for State-of-the-Art + Vision-Language Models + + +
+ Today's most advanced vision-language models (VLMs) remain proprietary. The +strongest open-weight models rely heavily on synthetic data from proprietary +VLMs to achieve good performance, effectively distilling these closed VLMs into +open ones. As a result, the community has been missing foundational knowledge +about how to build performant VLMs from scratch. We present Molmo, a new family +of VLMs that are state-of-the-art in their class of openness. Our key +contribution is a collection of new datasets called PixMo, including a dataset +of highly detailed image captions for pre-training, a free-form image Q&A +dataset for fine-tuning, and an innovative 2D pointing dataset, all collected +without the use of external VLMs. The success of our approach relies on careful +modeling choices, a well-tuned training pipeline, and, most critically, the +quality of our newly collected datasets. Our best-in-class 72B model not only +outperforms others in the class of open weight and data models, but also +outperforms larger proprietary models including Claude 3.5 Sonnet, and Gemini +1.5 Pro and Flash, second only to GPT-4o based on both academic benchmarks and +on a large human evaluation. Our model weights, new datasets, and source code +are available at https://molmo.allenai.org/blog. + +
+
+ comment: Updated with ablations and more technical details +
+
+
+
+
+ + ♻ ☆ Adaptive Circuit Behavior and Generalization in Mechanistic + Interpretability + + +
+ Mechanistic interpretability aims to understand the inner workings of large +neural networks by identifying circuits, or minimal subgraphs within the model +that implement algorithms responsible for performing specific tasks. These +circuits are typically discovered and analyzed using a narrowly defined prompt +format. However, given the abilities of large language models (LLMs) to +generalize across various prompt formats for the same task, it remains unclear +how well these circuits generalize. For instance, it is unclear whether the +models generalization results from reusing the same circuit components, the +components behaving differently, or the use of entirely different components. +In this paper, we investigate the generality of the indirect object +identification (IOI) circuit in GPT-2 small, which is well-studied and believed +to implement a simple, interpretable algorithm. We evaluate its performance on +prompt variants that challenge the assumptions of this algorithm. Our findings +reveal that the circuit generalizes surprisingly well, reusing all of its +components and mechanisms while only adding additional input edges. Notably, +the circuit generalizes even to prompt variants where the original algorithm +should fail; we discover a mechanism that explains this which we term S2 +Hacking. Our findings indicate that circuits within LLMs may be more flexible +and general than previously recognized, underscoring the importance of studying +circuit generalization to better understand the broader capabilities of these +models. + +
+
+ comment: 10 pages, 8 figures +
+
+
+
+
+ + ♻ ☆ Lexicalization Is All You Need: Examining the Impact of Lexical + Knowledge in a Compositional QALD System + + +
+ In this paper, we examine the impact of lexicalization on Question Answering +over Linked Data (QALD). It is well known that one of the key challenges in +interpreting natural language questions with respect to SPARQL lies in bridging +the lexical gap, that is mapping the words in the query to the correct +vocabulary elements. We argue in this paper that lexicalization, that is +explicit knowledge about the potential interpretations of a word with respect +to the given vocabulary, significantly eases the task and increases the +performance of QA systems. Towards this goal, we present a compositional QA +system that can leverage explicit lexical knowledge in a compositional manner +to infer the meaning of a question in terms of a SPARQL query. We show that +such a system, given lexical knowledge, has a performance well beyond current +QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score +compared to the best QA system on QALD-9. This shows the importance and +potential of including explicit lexical knowledge. In contrast, we show that +LLMs have limited abilities to exploit lexical knowledge, with only marginal +improvements compared to a version without lexical knowledge. This shows that +LLMs have no ability to compositionally interpret a question on the basis of +the meaning of its parts, a key feature of compositional approaches. Taken +together, our work shows new avenues for QALD research, emphasizing the +importance of lexicalization and compositionality. + +
+
+ comment: 24th International Conference on Knowledge Engineering and Knowledge + Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands +
+
+
+
+
+ + ♻ ☆ KV Shifting Attention Enhances Language Modeling + + +
+ The current large language models are mainly based on decode-only structure +transformers, which have great in-context learning (ICL) capabilities. It is +generally believed that the important foundation of its ICL capability is the +induction heads mechanism, which requires at least two layers attention. In +order to more efficiently implement the ability of the model's induction, we +revisit the induction heads mechanism and proposed a KV shifting attention. We +theoretically prove that the KV shifting attention reducing the model's +requirements for the depth and width of the induction heads mechanism. Our +experimental results demonstrate that KV shifting attention is beneficial to +learning induction heads and language modeling, which lead to better +performance or faster convergence from toy models to the pre-training models +with more than 10 B parameters. + +
+
+ comment: 22 pages +
+
+
+
+
+ + ♻ ☆ Words in Motion: Extracting Interpretable Control Vectors for Motion + Transformers + + +
+ Transformer-based models generate hidden states that are difficult to +interpret. In this work, we aim to interpret these hidden states and control +them at inference, with a focus on motion forecasting. We use linear probes to +measure neural collapse towards interpretable motion features in hidden states. +High probing accuracy implies meaningful directions and distances between +hidden states of opposing features, which we use to fit interpretable control +vectors for activation steering at inference. To optimize our control vectors, +we use sparse autoencoders with fully-connected, convolutional, MLPMixer layers +and various activation functions. Notably, we show that enforcing sparsity in +hidden states leads to a more linear relationship between control vector +temperatures and forecasts. Our approach enables mechanistic interpretability +and zero-shot generalization to unseen dataset characteristics with negligible +computational overhead. Our implementation is available at +https://github.com/kit-mrt/future-motion + +
+
+ comment: Add autoencoders with convolutional, MLPMixer layers, and JumpReLU + activations +
+
+
+
+
+ + ♻ ☆ SCAR: Sparse Conditioned Autoencoders for Concept Detection and Steering + in LLMs NeurIPS 2024 + + +
+ Large Language Models (LLMs) have demonstrated remarkable capabilities in +generating human-like text, but their output may not be aligned with the user +or even produce harmful content. This paper presents a novel approach to detect +and steer concepts such as toxicity before generation. We introduce the Sparse +Conditioned Autoencoder (SCAR), a single trained module that extends the +otherwise untouched LLM. SCAR ensures full steerability, towards and away from +concepts (e.g., toxic content), without compromising the quality of the model's +text generation on standard evaluation benchmarks. We demonstrate the effective +application of our approach through a variety of concepts, including toxicity, +safety, and writing style alignment. As such, this work establishes a robust +framework for controlling LLM generations, ensuring their ethical and safe +deployment in real-world applications. + +
+
+ comment: Accepted at Socially Responsible Language Modelling Research (SoLaR) + Workshop at NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Hybrid-SQuAD: Hybrid Scholarly Question Answering Dataset + + +
+ Existing Scholarly Question Answering (QA) methods typically target +homogeneous data sources, relying solely on either text or Knowledge Graphs +(KGs). However, scholarly information often spans heterogeneous sources, +necessitating the development of QA systems that integrate information from +multiple heterogeneous data sources. To address this challenge, we introduce +Hybrid-SQuAD (Hybrid Scholarly Question Answering Dataset), a novel large-scale +QA dataset designed to facilitate answering questions incorporating both text +and KG facts. The dataset consists of 10.5K question-answer pairs generated by +a large language model, leveraging the KGs DBLP and SemOpenAlex alongside +corresponding text from Wikipedia. In addition, we propose a RAG-based baseline +hybrid QA model, achieving an exact match score of 69.65 on the Hybrid-SQuAD +test set. + +
+
+
+
+
+ + ♻ ☆ Quest: Query-centric Data Synthesis Approach for Long-context Scaling of + Large Language Model + + +
+ Recent advancements in large language models (LLMs) have highlighted the +importance of extending context lengths for handling complex tasks. While +traditional methods for training on long contexts often use filtered long +documents, these approaches lead to domain imbalances, limiting model +performance. To address this, techniques like random document concatenation +(Standard) and similarity-based methods (KNN, ICLM) have been developed. +However, they either sacrifice semantic coherence or diversity. To balance both +aspects, we introduce Quest, a query-centric data synthesis method aggregating +semantically relevant yet diverse documents. Quest uses a generative model to +predict potential queries for each document, grouping documents with similar +queries and keywords. Extensive experiments demonstrate Quest's superior +performance on long-context tasks, achieving remarkable results with context +lengths of up to 1M tokens and confirming its scalability across various model +sizes. + +
+
+
+
+
+ + ♻ ☆ HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced + Context Awareness and Extrapolation + + +
+ Many positional encodings (PEs) are designed to exhibit long-term decay, +based on an entrenched and long-standing inductive opinion: tokens farther away +from the current position carry less relevant information. We argue that +long-term decay is outdated in the era of LLMs, as LLMs are now applied to +tasks demanding precise retrieval of in-context information from arbitrary +positions. Firstly, we present empirical analyses on various PEs, demonstrating +that models inherently learn attention with only a local-decay pattern while +forming a U-shape pattern globally, contradicting the principle of long-term +decay. Furthermore, we conduct a detailed analysis of rotary position encoding +(RoPE, a prevalent relative positional encoding in LLMs), and found that the +U-shape attention is caused by some learned components, which are also the key +factor limiting RoPE's expressiveness and extrapolation.Inspired by these +insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE +replaces the specific components in RoPE with position-independent ones, +retaining only high-frequency signals, which also breaks the principle of +long-term decay in theory. HoPE achieves two major advantages: (1) Without +constraints imposed by long-term decay, contradictory factors that limit +spontaneous attention optimization and model extrapolation performance are +removed. (2) Components representing positions and semantics are are optimized. +These enhances model's context awareness and extrapolation, as validated by +extensive experiments. + +
+
+
+
+
+ + ♻ ☆ ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of + Large Language Models in Real-world Scenarios COLING 2025 + + +
+ Existing evaluations of tool learning primarily focus on validating the +alignment of selected tools for large language models (LLMs) with expected +outcomes. However, these approaches rely on a limited set of scenarios where +answers can be pre-determined, diverging from genuine needs. Furthermore, a +sole emphasis on outcomes disregards the complex capabilities required for LLMs +to effectively use tools. To tackle this issue, we propose ToolEyes, a +fine-grained system tailored for the evaluation of the LLMs' tool learning +capabilities in authentic scenarios. The system meticulously examines seven +real-world scenarios, analyzing five dimensions crucial to LLMs in tool +learning: format alignment, intent comprehension, behavior planning, tool +selection, and answer organization. Additionally, ToolEyes incorporates a tool +library boasting approximately 600 tools, serving as an intermediary between +LLMs and the physical world. Evaluations involving ten LLMs across three +categories reveal a preference for specific scenarios and limited cognitive +abilities in tool learning. Intriguingly, expanding the model size even +exacerbates the hindrance to tool learning. The code and data are available at +https://github.com/Junjie-Ye/ToolEyes. + +
+
+ comment: Accepted by COLING 2025 conference +
+
+
+
+
+ + ♻ ☆ LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence + Embeddings COLING 2025 + + +
+ Sentence embedding models play a key role in various Natural Language +Processing tasks, such as in Topic Modeling, Document Clustering and +Recommendation Systems. However, these models rely heavily on parallel data, +which can be scarce for many low-resource languages, including Luxembourgish. +This scarcity results in suboptimal performance of monolingual and +cross-lingual sentence embedding models for these languages. To address this +issue, we compile a relatively small but high-quality human-generated +cross-lingual parallel dataset to train LuxEmbedder, an enhanced sentence +embedding model for Luxembourgish with strong cross-lingual capabilities. +Additionally, we present evidence suggesting that including low-resource +languages in parallel training datasets can be more advantageous for other +low-resource languages than relying solely on high-resource language pairs. +Furthermore, recognizing the lack of sentence embedding benchmarks for +low-resource languages, we create a paraphrase detection benchmark specifically +for Luxembourgish, aiming to partially fill this gap and promote further +research. + +
+
+ comment: Accepted at COLING 2025 +
+
+
+
+
+ + ♻ ☆ A Little Goes a Long Way: Efficient Long Context Training and Inference + with Partial Contexts + + +
+ Training and serving long-context large language models (LLMs) incurs +substantial overhead. To address this, two critical steps are often required: a +pretrained LLM typically undergoes a separate stage for context length +extension by training on long-context data, followed by architectural +modifications to reduce the overhead of KV cache during serving. This paper +argues that integrating length extension with a GPU-friendly KV cache reduction +architecture not only reduces training overhead during length extension, but +also achieves better long-context performance. This leads to our proposed +LongGen, which finetunes a pretrained LLM into an efficient architecture during +length extension. LongGen builds on three key insights: (1) Sparse attention +patterns, such as window attention (attending to recent tokens), attention sink +(initial ones), and blockwise sparse attention (strided token blocks) are +well-suited for building efficient long-context models, primarily due to their +GPU-friendly memory access patterns, enabling efficiency gains not just +theoretically but in practice as well. (2) It is essential for the model to +have direct access to all tokens. A hybrid architecture with 1/3 full attention +layers and 2/3 efficient ones achieves a balanced trade-off between efficiency +and long-context performance. (3) Lightweight training on 5B long-context data +is sufficient to extend the hybrid model's context length from 4K to 128K. + We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its +effectiveness across different scales. During training with 128K-long contexts, +LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, +compared to a full-attention baseline. During inference, LongGen reduces KV +cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding +speedup. + +
+
+
+
+
+
+
+
+ + Information Retrieval 13 + +
+
+
+ + ☆ HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and + Representation Learning + + +
+ Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by +integrating external document retrieval to provide domain-specific or +up-to-date knowledge. The effectiveness of RAG depends on the relevance of +retrieved documents, which is influenced by the semantic alignment of +embeddings with the domain's specialized content. Although full fine-tuning can +align language models to specific domains, it is computationally intensive and +demands substantial data. This paper introduces Hierarchical Embedding +Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy +clustering with matrix factorization within contrastive learning to efficiently +align LLM embeddings with domain-specific content. HEAL computes +level/depth-wise contrastive losses and incorporates hierarchical penalties to +align embeddings with the underlying relationships in label hierarchies. This +approach enhances retrieval relevance and document classification, effectively +reducing hallucinations in LLM outputs. In our experiments, we benchmark and +evaluate HEAL across diverse domains, including Healthcare, Material Science, +Cyber-security, and Applied Maths. + +
+
+
+
+
+ + ☆ Semantic Retrieval at Walmart KDD 2022 + + +
+ In product search, the retrieval of candidate products before re-ranking is +more critical and challenging than other search like web search, especially for +tail queries, which have a complex and specific search intent. In this paper, +we present a hybrid system for e-commerce search deployed at Walmart that +combines traditional inverted index and embedding-based neural retrieval to +better answer user tail queries. Our system significantly improved the +relevance of the search engine, measured by both offline and online +evaluations. The improvements were achieved through a combination of different +approaches. We present a new technique to train the neural model at scale. and +describe how the system was deployed in production with little impact on +response time. We highlight multiple learnings and practical tricks that were +used in the deployment of this system. + +
+
+ comment: 9 page, 2 figures, 10 tables, KDD 2022 +
+
+
+
+
+ + ☆ Argumentative Experience: Reducing Confirmation Bias on Controversial + Issues through LLM-Generated Multi-Persona Debates + + +
+ Large language models (LLMs) are enabling designers to give life to exciting +new user experiences for information access. In this work, we present a system +that generates LLM personas to debate a topic of interest from different +perspectives. How might information seekers use and benefit from such a system? +Can centering information access around diverse viewpoints help to mitigate +thorny challenges like confirmation bias in which information seekers +over-trust search results matching existing beliefs? How do potential biases +and hallucinations in LLMs play out alongside human users who are also fallible +and possibly biased? + Our study exposes participants to multiple viewpoints on controversial issues +via a mixed-methods, within-subjects study. We use eye-tracking metrics to +quantitatively assess cognitive engagement alongside qualitative feedback. +Compared to a baseline search system, we see more creative interactions and +diverse information-seeking with our multi-persona debate system, which more +effectively reduces user confirmation bias and conviction toward their initial +beliefs. Overall, our study contributes to the emerging design space of +LLM-based information access systems, specifically investigating the potential +of simulated personas to promote greater exposure to information diversity, +emulate collective intelligence, and mitigate bias in information seeking. + +
+
+
+
+
+ + ☆ User-item fairness tradeoffs in recommendations + + +
+ In the basic recommendation paradigm, the most (predicted) relevant item is +recommended to each user. This may result in some items receiving lower +exposure than they "should"; to counter this, several algorithmic approaches +have been developed to ensure item fairness. These approaches necessarily +degrade recommendations for some users to improve outcomes for items, leading +to user fairness concerns. In turn, a recent line of work has focused on +developing algorithms for multi-sided fairness, to jointly optimize user +fairness, item fairness, and overall recommendation quality. This induces the +question: what is the tradeoff between these objectives, and what are the +characteristics of (multi-objective) optimal solutions? Theoretically, we +develop a model of recommendations with user and item fairness objectives and +characterize the solutions of fairness-constrained optimization. We identify +two phenomena: (a) when user preferences are diverse, there is "free" item and +user fairness; and (b) users whose preferences are misestimated can be +especially disadvantaged by item fairness constraints. Empirically, we +prototype a recommendation system for preprints on arXiv and implement our +framework, measuring the phenomena in practice and showing how these phenomena +inform the design of markets with recommendation systems-intermediated +matching. + +
+
+ comment: Accepted at the Thirty-Eighth Annual Conference on Neural Information + Processing Systems +
+
+
+
+
+ + ☆ Graph-Sequential Alignment and Uniformity: Toward Enhanced + Recommendation Systems + + +
+ Graph-based and sequential methods are two popular recommendation paradigms, +each excelling in its domain but lacking the ability to leverage signals from +the other. To address this, we propose a novel method that integrates both +approaches for enhanced performance. Our framework uses Graph Neural Network +(GNN)-based and sequential recommenders as separate submodules while sharing a +unified embedding space optimized jointly. To enable positive knowledge +transfer, we design a loss function that enforces alignment and uniformity both +within and across submodules. Experiments on three real-world datasets +demonstrate that the proposed method significantly outperforms using either +approach alone and achieves state-of-the-art results. Our implementations are +publicly available at https://github.com/YuweiCao-UIC/GSAU.git. + +
+
+ comment: Under review +
+
+
+
+
+ + ☆ PoTable: Programming Standardly on Table-based Reasoning Like a Human + Analyst + + +
+ Table-based reasoning has garnered substantial research interest, +particularly in its integration with Large Language Model (LLM) which has +revolutionized the general reasoning paradigm. Numerous LLM-based studies +introduce symbolic tools (e.g., databases, Python) as assistants to extend +human-like abilities in structured table understanding and complex arithmetic +computations. However, these studies can be improved better in simulating human +cognitive behavior when using symbolic tools, as they still suffer from +limitations of non-standard logical splits and constrained operation pools. In +this study, we propose PoTable as a novel table-based reasoning method that +simulates a human tabular analyst, which integrates a Python interpreter as the +real-time executor accompanied by an LLM-based operation planner and code +generator. Specifically, PoTable follows a human-like logical stage split and +extends the operation pool into an open-world space without any constraints. +Through planning and executing in each distinct stage, PoTable standardly +completes the entire reasoning process and produces superior reasoning results +along with highly accurate, steply commented and completely executable +programs. Accordingly, the effectiveness and explainability of PoTable are +fully demonstrated. Extensive experiments over three evaluation datasets from +two public benchmarks on two backbones show the outstanding performance of our +approach. In particular, GPT-based PoTable achieves over 4% higher absolute +accuracy than runner-ups on all evaluation datasets. + +
+
+ comment: 12 pages, 4 figures +
+
+
+
+
+ + ☆ Pre-train, Align, and Disentangle: Empowering Sequential Recommendation + with Large Language Models + + +
+ Sequential recommendation (SR) aims to model the sequential dependencies in +users' historical interactions to better capture their evolving interests. +However, existing SR approaches primarily rely on collaborative data, which +leads to limitations such as the cold-start problem and sub-optimal +performance. Meanwhile, despite the success of large language models (LLMs), +their application in industrial recommender systems is hindered by high +inference latency, inability to capture all distribution statistics, and +catastrophic forgetting. To this end, we propose a novel Pre-train, Align, and +Disentangle (PAD) paradigm to empower recommendation models with LLMs. +Specifically, we first pre-train both the SR and LLM models to get +collaborative and textual embeddings. Next, a characteristic +recommendation-anchored alignment loss is proposed using multi-kernel maximum +mean discrepancy with Gaussian kernels. Finally, a triple-experts architecture, +consisting aligned and modality-specific experts with disentangled embeddings, +is fine-tuned in a frequency-aware manner. Experiments conducted on three +public datasets demonstrate the effectiveness of PAD, showing significant +improvements and compatibility with various SR backbone models, especially on +cold items. The implementation code and datasets will be publicly available. + +
+
+
+
+
+ + ☆ Graph Disentangle Causal Model: Enhancing Causal Inference in Networked + Observational Data WSDM 2025 + + +
+ Estimating individual treatment effects (ITE) from observational data is a +critical task across various domains. However, many existing works on ITE +estimation overlook the influence of hidden confounders, which remain +unobserved at the individual unit level. To address this limitation, +researchers have utilized graph neural networks to aggregate neighbors' +features to capture the hidden confounders and mitigate confounding bias by +minimizing the discrepancy of confounder representations between the treated +and control groups. Despite the success of these approaches, practical +scenarios often treat all features as confounders and involve substantial +differences in feature distributions between the treated and control groups. +Confusing the adjustment and confounder and enforcing strict balance on the +confounder representations could potentially undermine the effectiveness of +outcome prediction. To mitigate this issue, we propose a novel framework called +the \textit{Graph Disentangle Causal model} (GDC) to conduct ITE estimation in +the network setting. GDC utilizes a causal disentangle module to separate unit +features into adjustment and confounder representations. Then we design a graph +aggregation module consisting of three distinct graph aggregators to obtain +adjustment, confounder, and counterfactual confounder representations. Finally, +a causal constraint module is employed to enforce the disentangled +representations as true causal factors. The effectiveness of our proposed +method is demonstrated by conducting comprehensive experiments on two networked +datasets. + +
+
+ comment: Accepted by WSDM 2025 +
+
+
+
+
+ + ☆ Learning to Hash for Recommendation: A Survey + + +
+ With the explosive growth of users and items, Recommender Systems (RS) are +facing unprecedented challenges on both retrieval efficiency and storage cost. +Fortunately, Learning to Hash (L2H) techniques have been shown as a promising +solution to address the two dilemmas, whose core idea is encoding +high-dimensional data into compact hash codes. To this end, L2H for RS (HashRec +for short) has recently received widespread attention to support large-scale +recommendations. In this survey, we present a comprehensive review of current +HashRec algorithms. Specifically, we first introduce the commonly used +two-tower models in the recall stage and identify two search strategies +frequently employed in L2H. Then, we categorize prior works into two-tier +taxonomy based on: (i) the type of loss function and (ii) the optimization +strategy. We also introduce some commonly used evaluation metrics to measure +the performance of HashRec algorithms. Finally, we shed light on the +limitations of the current research and outline the future research directions. +Furthermore, the summary of HashRec methods reviewed in this survey can be +found at +\href{https://github.com/Luo-Fangyuan/HashRec}{https://github.com/Luo-Fangyuan/HashRec}. + +
+
+
+
+
+ + ♻ ☆ Agent-OM: Leveraging LLM Agents for Ontology Matching + + +
+ Ontology matching (OM) enables semantic interoperability between different +ontologies and resolves their conceptual heterogeneity by aligning related +entities. OM systems currently have two prevailing design paradigms: +conventional knowledge-based expert systems and newer machine learning-based +predictive systems. While large language models (LLMs) and LLM agents have +revolutionised data engineering and have been applied creatively in many +domains, their potential for OM remains underexplored. This study introduces a +novel agent-powered LLM-based design paradigm for OM systems. With +consideration of several specific challenges in leveraging LLM agents for OM, +we propose a generic framework, namely Agent-OM (Agent for Ontology Matching), +consisting of two Siamese agents for retrieval and matching, with a set of +simple OM tools. Our framework is implemented in a proof-of-concept system. +Evaluations of three Ontology Alignment Evaluation Initiative (OAEI) tracks +over state-of-the-art OM systems show that our system can achieve results very +close to the long-standing best performance on simple OM tasks and can +significantly improve the performance on complex and few-shot OM tasks. + +
+
+ comment: 14 pages, 13 figures, 4 tables +
+
+
+
+
+ + ♻ ☆ TiM4Rec: An Efficient Sequential Recommendation Model Based on + Time-Aware Structured State Space Duality Model + + +
+ The Sequential Recommendation modeling paradigm is shifting from Transformer +to Mamba architecture, which comprises two generations: Mamba1, based on the +State Space Model (SSM), and Mamba2, based on State Space Duality (SSD). +Although SSD offers superior computational efficiency compared to SSM, it +suffers performance degradation in sequential recommendation tasks, especially +in low-dimensional scenarios that are critical for these tasks. Considering +that time-aware enhancement methods are commonly employed to mitigate +performance loss, our analysis reveals that the performance decline of SSD can +similarly be fundamentally compensated by leveraging mechanisms in time-aware +methods. Thus, we propose integrating time-awareness into the SSD framework to +address these performance issues. However, integrating current time-aware +methods, modeled after TiSASRec, into SSD faces the following challenges: 1) +the complexity of integrating these transformer-based mechanisms with the SSD +architecture, and 2) the computational inefficiency caused by the need for +dimensionality expansion of time-difference modeling. To overcome these +challenges, we introduce a novel Time-aware Structured Masked Matrix that +efficiently incorporates time-aware capabilities into SSD. Building on this, we +propose Time-Aware Mamba for Recommendation (TiM4Rec), which mitigates +performance degradation in low-dimensional SSD contexts while preserving +computational efficiency. This marks the inaugural application of a time-aware +enhancement method specifically tailored for the Mamba architecture within the +domain of sequential recommendation. Extensive experiments conducted on three +real-world datasets demonstrate the superiority of our approach. The code for +our model is accessible at https://github.com/AlwaysFHao/TiM4Rec. + +
+
+
+
+
+ + ♻ ☆ Lexicalization Is All You Need: Examining the Impact of Lexical + Knowledge in a Compositional QALD System + + +
+ In this paper, we examine the impact of lexicalization on Question Answering +over Linked Data (QALD). It is well known that one of the key challenges in +interpreting natural language questions with respect to SPARQL lies in bridging +the lexical gap, that is mapping the words in the query to the correct +vocabulary elements. We argue in this paper that lexicalization, that is +explicit knowledge about the potential interpretations of a word with respect +to the given vocabulary, significantly eases the task and increases the +performance of QA systems. Towards this goal, we present a compositional QA +system that can leverage explicit lexical knowledge in a compositional manner +to infer the meaning of a question in terms of a SPARQL query. We show that +such a system, given lexical knowledge, has a performance well beyond current +QA systems, achieving up to a $35.8\%$ increase in the micro $F_1$ score +compared to the best QA system on QALD-9. This shows the importance and +potential of including explicit lexical knowledge. In contrast, we show that +LLMs have limited abilities to exploit lexical knowledge, with only marginal +improvements compared to a version without lexical knowledge. This shows that +LLMs have no ability to compositionally interpret a question on the basis of +the meaning of its parts, a key feature of compositional approaches. Taken +together, our work shows new avenues for QALD research, emphasizing the +importance of lexicalization and compositionality. + +
+
+ comment: 24th International Conference on Knowledge Engineering and Knowledge + Management (EKAW 2024), November 26-28, 2024, Amsterdam, The Netherlands +
+
+
+
+
+ + ♻ ☆ A Survey on Point-of-Interest Recommendations Leveraging Heterogeneous + Data + + +
+ Tourism is an important application domain for recommender systems. In this +domain, recommender systems are for example tasked with providing personalized +recommendations for transportation, accommodation, points-of-interest (POIs), +etc. Among these tasks, in particular the problem of recommending POIs that are +of likely interest to individual tourists has gained growing attention in +recent years. Providing POI recommendations to tourists can however be +especially challenging due to the variability of the user's context. With the +rapid development of the Web and today's multitude of online services, vast +amounts of data from various sources have become available, and these +heterogeneous data represent a huge potential to better address the challenges +of POI recommendation problems. In this work, we provide a survey of published +research on the problem of POI recommendation between 2021 and 2023. The +literature was surveyed to identify the information types, techniques and +evaluation methods employed. Based on the analysis, it was observed that the +current research tends to focus on a relatively narrow range of information +types and there is a significant potential in improving POI recommendation by +leveraging heterogeneous data. As the first information-centric survey on POI +recommendation research, this study serves as a reference for researchers +aiming to develop increasingly accurate, personalized and context-aware POI +recommender systems. + +
+
+
+
+
+
+
+
+ + Multimedia 4 + +
+
+
+ + ☆ Feature Coding in the Era of Large Models: Dataset, Test Conditions, and + Benchmark + + +
+ Large models have achieved remarkable performance across various tasks, yet +they incur significant computational costs and privacy concerns during both +training and inference. Distributed deployment has emerged as a potential +solution, but it necessitates the exchange of intermediate information between +model segments, with feature representations serving as crucial information +carriers. To optimize information exchange, feature coding methods are applied +to reduce transmission and storage overhead. Despite its importance, feature +coding for large models remains an under-explored area. In this paper, we draw +attention to large model feature coding and make three contributions to this +field. First, we introduce a comprehensive dataset encompassing diverse +features generated by three representative types of large models. Second, we +establish unified test conditions, enabling standardized evaluation pipelines +and fair comparisons across future feature coding studies. Third, we introduce +two baseline methods derived from widely used image coding techniques and +benchmark their performance on the proposed dataset. These contributions aim to +advance the field of feature coding, facilitating more efficient large model +deployment. All source code and the dataset will be made available on GitHub. + +
+
+
+
+
+ + ♻ ☆ Compression of Higher Order Ambisonics with Multichannel RVQGAN + + +
+ A multichannel extension to the RVQGAN neural coding method is proposed, and +realized for data-driven compression of third-order Ambisonics audio. The +input- and output layers of the generator and discriminator models are modified +to accept multiple (16) channels without increasing the model bitrate. We also +propose a loss function for accounting for spatial perception in immersive +reproduction, and transfer learning from single-channel models. Listening test +results with 7.1.4 immersive playback show that the proposed extension is +suitable for coding scene-based, 16-channel Ambisonics content with good +quality at 16 kbps. + +
+
+
+
+
+ + ♻ ☆ Identity-Preserving Text-to-Video Generation by Frequency Decomposition + + +
+ Identity-preserving text-to-video (IPT2V) generation aims to create +high-fidelity videos with consistent human identity. It is an important task in +video generation but remains an open problem for generative models. This paper +pushes the technical frontier of IPT2V in two directions that have not been +resolved in literature: (1) A tuning-free pipeline without tedious case-by-case +finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based +control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V +model to keep human identity consistent in the generated video. Inspired by +prior findings in frequency analysis of diffusion transformers, it employs +identity-control signals in the frequency domain, where facial features can be +decomposed into low-frequency global features and high-frequency intrinsic +features. First, from a low-frequency perspective, we introduce a global facial +extractor, which encodes reference images and facial key points into a latent +space, generating features enriched with low-frequency information. These +features are then integrated into shallow layers of the network to alleviate +training challenges associated with DiT. Second, from a high-frequency +perspective, we design a local facial extractor to capture high-frequency +details and inject them into transformer blocks, enhancing the model's ability +to preserve fine-grained features. We propose a hierarchical training strategy +to leverage frequency information for identity preservation, transforming a +vanilla pre-trained video generation model into an IPT2V model. Extensive +experiments demonstrate that our frequency-aware heuristic scheme provides an +optimal control solution for DiT-based models. Thanks to this scheme, our +ConsisID generates high-quality, identity-preserving videos, making strides +towards more effective IPT2V. + +
+
+ comment: 12 pages, 8 figures, Code: https://github.com/PKU-YuanGroup/ConsisID +
+
+
+
+
+ + ♻ ☆ Memories are One-to-Many Mapping Alleviators in Talking Face Generation + + +
+ Talking face generation aims at generating photo-realistic video portraits of +a target person driven by input audio. Due to its nature of one-to-many mapping +from the input audio to the output video (e.g., one speech content may have +multiple feasible visual appearances), learning a deterministic mapping like +previous works brings ambiguity during training, and thus causes inferior +visual results. Although this one-to-many mapping could be alleviated in part +by a two-stage framework (i.e., an audio-to-expression model followed by a +neural-rendering model), it is still insufficient since the prediction is +produced without enough information (e.g., emotions, wrinkles, etc.). In this +paper, we propose MemFace to complement the missing information with an +implicit memory and an explicit memory that follow the sense of the two stages +respectively. More specifically, the implicit memory is employed in the +audio-to-expression model to capture high-level semantics in the +audio-expression shared space, while the explicit memory is employed in the +neural-rendering model to help synthesize pixel-level details. Our experimental +results show that our proposed MemFace surpasses all the state-of-the-art +results across multiple scenarios consistently and significantly. + +
+
+ comment: IEEE Transactions on Pattern Analysis and Machine Intelligence + (2024). Project page: see https://memoryface.github.io +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Information Retrieval 8 + +
+
+
+ + ☆ Freshness and Informativity Weighted Cognitive Extent and Its + Correlation with Cumulative Citation Count + + +
+ In this paper, we revisit cognitive extent, originally defined as the number +of unique phrases in a quota. We introduce Freshness and Informative Weighted +Cognitive Extent (FICE), calculated based on two novel weighting factors, the +lifetime ratio and informativity of scientific entities. We model the lifetime +of each scientific entity as the time-dependent document frequency, which is +fit by the composition of multiple Gaussian profiles. The lifetime ratio is +then calculated as the cumulative document frequency at the publication time +$t_0$ divided by the cumulative document frequency over its entire lifetime. +The informativity is calculated by normalizing the document frequency across +all scientific entities recognized in a title. Using the ACL Anthology, we +verified the trend formerly observed in several other domains that the number +of unique scientific entities per quota increased gradually at a slower rate. +We found that FICE exhibits a strong correlation with the average cumulative +citation count within a quota. Our code is available at +\href{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent}{https://github.com/ZiheHerzWang/Freshness-and-Informativity-Weighted-Cognitive-Extent} + +
+
+
+
+
+ + ☆ YT-30M: A multi-lingual multi-category dataset of YouTube comments + + +
+ This paper introduces two large-scale multilingual comment datasets, YT-30M +(and YT-100K) from YouTube. The analysis in this paper is performed on a +smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and +YT-100K (randomly selected 100K sample from YT-30M) are publicly released for +further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted +by YouTube channel that belong to YouTube categories. Each comment is +associated with a video ID, comment ID, commentor name, commentor channel ID, +comment text, upvotes, original channel ID and category of the YouTube channel +(e.g., 'News & Politics', 'Science & Technology', etc.). + +
+
+
+
+
+ + ☆ Recommender Systems for Sustainability: Overview and Research Issues + + +
+ Sustainability development goals (SDGs) are regarded as a universal call to +action with the overall objectives of planet protection, ending of poverty, and +ensuring peace and prosperity for all people. In order to achieve these +objectives, different AI technologies play a major role. Specifically, +recommender systems can provide support for organizations and individuals to +achieve the defined goals. Recommender systems integrate AI technologies such +as machine learning, explainable AI (XAI), case-based reasoning, and constraint +solving in order to find and explain user-relevant alternatives from a +potentially large set of options. In this article, we summarize the state of +the art in applying recommender systems to support the achievement of +sustainability development goals. In this context, we discuss open issues for +future research. + +
+
+
+
+
+ + ☆ Beyond Questions: Leveraging ColBERT for Keyphrase Search + + +
+ While question-like queries are gaining popularity and search engines' users +increasingly adopt them, keyphrase search has traditionally been the +cornerstone of web search. This query type is also prevalent in specialised +search tasks such as academic or professional search, where experts rely on +keyphrases to articulate their information needs. However, current dense +retrieval models often fail with keyphrase-like queries, primarily because they +are mostly trained on question-like ones. This paper introduces a novel model +that employs the ColBERT architecture to enhance document ranking for keyphrase +queries. For that, given the lack of large keyphrase-based retrieval datasets, +we first explore how Large Language Models can convert question-like queries +into keyphrase format. Then, using those keyphrases, we train a keyphrase-based +ColBERT ranker (ColBERTKP_QD) to improve the performance when working with +keyphrase queries. Furthermore, to reduce the training costs associated with +training the full ColBERT model, we investigate the feasibility of training +only a keyphrase query encoder while keeping the document encoder weights +static (ColBERTKP_Q). We assess our proposals' ranking performance using both +automatically generated and manually annotated keyphrases. Our results reveal +the potential of the late interaction architecture when working under the +keyphrase search scenario. + +
+
+
+
+
+ + ☆ Enhancing Recommendation Systems with GNNs and Addressing Over-Smoothing + + +
+ This paper addresses key challenges in enhancing recommendation systems by +leveraging Graph Neural Networks (GNNs) and addressing inherent limitations +such as over-smoothing, which reduces model effectiveness as network hierarchy +deepens. The proposed approach introduces three GNN-based recommendation +models, specifically designed to mitigate over-smoothing through innovative +mechanisms like residual connections and identity mapping within the +aggregation propagation process. These modifications enable more effective +information flow across layers, preserving essential user-item interaction +details to improve recommendation accuracy. Additionally, the study emphasizes +the critical need for interpretability in recommendation systems, aiming to +provide transparent and justifiable suggestions tailored to dynamic user +preferences. By integrating collaborative filtering with GNN architectures, the +proposed models not only enhance predictive accuracy but also align +recommendations more closely with individual behaviors, adapting to nuanced +shifts in user interests. This work advances the field by tackling both +technical and user-centric challenges, contributing to the development of +robust and explainable recommendation systems capable of managing the +complexity and scale of modern online environments. + +
+
+
+
+
+ + ☆ CLAS: A Machine Learning Enhanced Framework for Exploring Large 3D + Design Datasets + + +
+ Three-dimensional (3D) objects have wide applications. Despite the growing +interest in 3D modeling in academia and industries, designing and/or creating +3D objects from scratch remains time-consuming and challenging. With the +development of generative artificial intelligence (AI), designers discover a +new way to create images for ideation. However, generative AIs are less useful +in creating 3D objects with satisfying qualities. To allow 3D designers to +access a wide range of 3D objects for creative activities based on their +specific demands, we propose a machine learning (ML) enhanced framework CLAS - +named after the four-step of capture, label, associate, and search - to enable +fully automatic retrieval of 3D objects based on user specifications leveraging +the existing datasets of 3D objects. CLAS provides an effective and efficient +method for any person or organization to benefit from their existing but not +utilized 3D datasets. In addition, CLAS may also be used to produce +high-quality 3D object synthesis datasets for training and evaluating 3D +generative models. As a proof of concept, we created and showcased a search +system with a web user interface (UI) for retrieving 6,778 3D objects of chairs +in the ShapeNet dataset powered by CLAS. In a close-set retrieval setting, our +retrieval method achieves a mean reciprocal rank (MRR) of 0.58, top 1 accuracy +of 42.27%, and top 10 accuracy of 89.64%. + +
+
+
+
+
+ + ♻ ☆ CoRNStack: High-Quality Contrastive Data for Better Code Ranking + + +
+ Effective code retrieval plays a crucial role in advancing code generation, +bug fixing, and software maintenance, particularly as software systems increase +in complexity. While current code embedding models have demonstrated promise in +retrieving code snippets for small-scale, well-defined tasks, they often +underperform in more demanding real-world applications such as bug localization +within GitHub repositories. We hypothesize that a key issue is their reliance +on noisy and inconsistent datasets for training, which impedes their ability to +generalize to more complex retrieval scenarios. To address these limitations, +we introduce CoRNStack, a large-scale, high-quality contrastive training +dataset for code that spans multiple programming languages. This dataset is +curated using consistency filtering to eliminate noisy positives and is further +enriched with mined hard negatives, thereby facilitating more effective +learning. We demonstrate that contrastive training of embedding models using +CoRNStack leads to state-of-the-art performance across a variety of code +retrieval tasks. Furthermore, the dataset can be leveraged for training code +reranking models, a largely underexplored area compared to text reranking. Our +finetuned code reranking model significantly improves the ranking quality over +the retrieved results. Finally, by employing our code retriever and reranker +together, we demonstrate significant improvements in function localization for +GitHub issues, an important component of real-world software development. + +
+
+
+
+
+ + ♻ ☆ Mathematical Information Retrieval: Search and Question Answering + + +
+ Mathematical information is essential for technical work, but its creation, +interpretation, and search are challenging. To help address these challenges, +researchers have developed multimodal search engines and mathematical question +answering systems. This book begins with a simple framework characterizing the +information tasks that people and systems perform as we work to answer +math-related questions. The framework is used to organize and relate the other +core topics of the book, including interactions between people and systems, +representing math formulas in sources, and evaluation. We close by addressing +some key questions and presenting directions for future work. This book is +intended for students, instructors, and researchers interested in systems that +help us find and use mathematical information. + +
+
+ comment: [DRAFT] Revised (2nd) draft +
+
+
+
+
+
+
+
+ + Multimedia 7 + +
+
+
+ + ☆ Personalizing Multimodal Large Language Models for Image Captioning: An + Experimental Analysis ECCV 2024 + + +
+ The task of image captioning demands an algorithm to generate natural +language descriptions of visual inputs. Recent advancements have seen a +convergence between image captioning research and the development of Large +Language Models (LLMs) and Multimodal LLMs -- like GPT-4V and Gemini -- which +extend the capabilities of text-only LLMs to multiple modalities. This paper +investigates whether Multimodal LLMs can supplant traditional image captioning +networks by evaluating their performance on various image description +benchmarks. We explore both the zero-shot capabilities of these models and +their adaptability to different semantic domains through fine-tuning methods, +including prompt learning, prefix tuning, and low-rank adaptation. Our results +demonstrate that while Multimodal LLMs achieve impressive zero-shot +performance, fine-tuning for specific domains while maintaining their +generalization capabilities intact remains challenging. We discuss the +implications of these findings for future research in image captioning and the +development of more adaptable Multimodal LLMs. + +
+
+ comment: ECCV 2024 Workshop on Green Foundation Models +
+
+
+
+
+ + ☆ SPICE: Smart Projection Interface for Cooking Enhancement + + +
+ Tangible User Interfaces (TUI) for human--computer interaction (HCI) provide +the user with physical representations of digital information with the aim to +overcome the limitations of screen-based interfaces. Although many compelling +demonstrations of TUIs exist in the literature, there is a lack of research on +TUIs intended for daily two-handed tasks and processes, such as cooking. In +response to this gap, we propose SPICE (Smart Projection Interface for Cooking +Enhancement). SPICE investigates TUIs in a kitchen setting, aiming to transform +the recipe following experience from simply text-based to tangibly interactive. +SPICE includes a tracking system, an agent-based software, and vision large +language models to create and interpret a kitchen environment where recipe +information is projected directly onto the cooking surface. We conducted a +comparative usability study of SPICE and text-based recipe following with 30 +participants, assessing the task difficulty, total duration, and efficiency, as +well as user confidence and taste perception. The results indicate that SPICE +allowed participants to perform the recipe with less stops and in shorter time +while also improving self-reported efficiency, confidence, and taste. Despite +this, participants self-reported no change in overall difficulty, which is a +direction for future research. Overall, the SPICE project demonstrates the +potential of using TUIs to improve everyday activities, paving the way for +future research in HCI and new computing interfaces. + +
+
+ comment: Article submitted to IUI 2025 +
+
+
+
+
+ + ☆ Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large + Vision-Language Model via Causality Analysis WACV2025 + + +
+ Recent advancements in large vision-language models (LVLM) have significantly +enhanced their ability to comprehend visual inputs alongside natural language. +However, a major challenge in their real-world application is hallucination, +where LVLMs generate non-existent visual elements, eroding user trust. The +underlying mechanism driving this multimodal hallucination is poorly +understood. Minimal research has illuminated whether contexts such as sky, +tree, or grass field involve the LVLM in hallucinating a frisbee. We +hypothesize that hidden factors, such as objects, contexts, and semantic +foreground-background structures, induce hallucination. This study proposes a +novel causal approach: a hallucination probing system to identify these hidden +factors. By analyzing the causality between images, text prompts, and network +saliency, we systematically explore interventions to block these factors. Our +experimental findings show that a straightforward technique based on our +analysis can significantly reduce hallucinations. Additionally, our analyses +indicate the potential to edit network internals to minimize hallucinated +outputs. + +
+
+ comment: Accepted by WACV2025 +
+
+
+
+
+ + ♻ ☆ PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for + Long-Term Expressive Symbolic Music Generation + + +
+ AI-based music generation has progressed significantly in recent years. +However, creating symbolic music that is both long-structured and expressive +remains a considerable challenge. In this paper, we propose PerceiverS +(Segmentation and Scale), a novel architecture designed to address this issue +by leveraging both Effective Segmentation and Multi-Scale attention mechanisms. +Our approach enhances symbolic music generation by simultaneously learning +long-term structural dependencies and short-term expressive details. By +combining cross-attention and self-attention in a Multi-Scale setting, +PerceiverS captures long-range musical structure while preserving musical +diversity. The proposed model has been evaluated using the Maestro dataset and +has demonstrated improvements in generating music of conventional length with +expressive nuances. The project demos and the generated music samples can be +accessed through the link: https://perceivers.github.io + +
+
+
+
+
+ + ♻ ☆ FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking + Portrait + + +
+ With the rapid advancement of diffusion-based generative models, portrait +image animation has achieved remarkable results. However, it still faces +challenges in temporally consistent video generation and fast sampling due to +its iterative sampling nature. This paper presents FLOAT, an audio-driven +talking portrait video generation method based on flow matching generative +model. We shift the generative modeling from the pixel-based latent space to a +learned motion latent space, enabling efficient design of temporally consistent +motion. To achieve this, we introduce a transformer-based vector field +predictor with a simple yet effective frame-wise conditioning mechanism. +Additionally, our method supports speech-driven emotion enhancement, enabling a +natural incorporation of expressive motions. Extensive experiments demonstrate +that our method outperforms state-of-the-art audio-driven talking portrait +methods in terms of visual quality, motion fidelity, and efficiency. + +
+
+ comment: Project page: https://deepbrainai-research.github.io/float/ +
+
+
+
+
+ + ♻ ☆ Once-for-All: Controllable Generative Image Compression with Dynamic + Granularity Adaption + + +
+ Although recent generative image compression methods have demonstrated +impressive potential in optimizing the rate-distortion-perception trade-off, +they still face the critical challenge of flexible rate adaption to diverse +compression necessities and scenarios. To overcome this challenge, this paper +proposes a Controllable Generative Image Compression framework, termed +Control-GIC, the first capable of fine-grained bitrate adaption across a broad +spectrum while ensuring high-fidelity and generality compression. Control-GIC +is grounded in a VQGAN framework that encodes an image as a sequence of +variable-length codes (i.e. VQ-indices), which can be losslessly compressed and +exhibits a direct positive correlation with the bitrates. Drawing inspiration +from the classical coding principle, we correlate the information density of +local image patches with their granular representations. Hence, we can flexibly +determine a proper allocation of granularity for the patches to achieve dynamic +adjustment for VQ-indices, resulting in desirable compression rates. We further +develop a probabilistic conditional decoder capable of retrieving historic +encoded multi-granularity representations according to transmitted codes, and +then reconstruct hierarchical granular features in the formalization of +conditional probability, enabling more informative aggregation to improve +reconstruction realism. Our experiments show that Control-GIC allows highly +flexible and controllable bitrate adaption where the results demonstrate its +superior performance over recent state-of-the-art methods. + +
+
+
+
+
+ + ♻ ☆ Zero-Shot Relational Learning for Multimodal Knowledge Graphs + + +
+ Relational learning is an essential task in the domain of knowledge +representation, particularly in knowledge graph completion (KGC). While +relational learning in traditional single-modal settings has been extensively +studied, exploring it within a multimodal KGC context presents distinct +challenges and opportunities. One of the major challenges is inference on newly +discovered relations without any associated training data. This zero-shot +relational learning scenario poses unique requirements for multimodal KGC, +i.e., utilizing multimodality to facilitate relational learning.However, +existing works fail to support the leverage of multimodal information and leave +the problem unexplored. In this paper, we propose a novel end-to-end framework, +consisting of three components, i.e., multimodal learner, structure +consolidator, and relation embedding generator, to integrate diverse multimodal +information and knowledge graph structures to facilitate the zero-shot +relational learning. Evaluation results on three multimodal knowledge graphs +demonstrate the superior performance of our proposed method. + +
+
+ comment: In the Proceedings of the 2024 IEEE International Conference on Big + Data (IEEE BigData 2024) +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Information Retrieval 17 + +
+
+
+ + ☆ Arctic-Embed 2.0: Multilingual Retrieval Without Compromise + + +
+ This paper presents the training methodology of Arctic-Embed 2.0, a set of +open-source text embedding models built for accurate and efficient multilingual +retrieval. While prior works have suffered from degraded English retrieval +quality, Arctic-Embed 2.0 delivers competitive retrieval quality on +multilingual and English-only benchmarks, and supports Matryoshka +Representation Learning (MRL) for efficient embedding storage with +significantly lower compressed quality degradation compared to alternatives. We +detail the design and implementation, presenting several important open +research questions that arose during model development. We conduct experiments +exploring these research questions and include extensive discussion aimed at +fostering further discussion in this field. + +
+
+ comment: 10 pages, 5 figures, 3 tables +
+
+
+
+
+ + ☆ CAISSON: Concept-Augmented Inference Suite of Self-Organizing Neural + Networks + + +
+ We present CAISSON, a novel hierarchical approach to Retrieval-Augmented +Generation (RAG) that transforms traditional single-vector search into a +multi-view clustering framework. At its core, CAISSON leverages dual +Self-Organizing Maps (SOMs) to create complementary organizational views of the +document space, where each view captures different aspects of document +relationships through specialized embeddings. The first view processes combined +text and metadata embeddings, while the second operates on metadata enriched +with concept embeddings, enabling a comprehensive multi-view analysis that +captures both fine-grained semantic relationships and high-level conceptual +patterns. This dual-view approach enables more nuanced document discovery by +combining evidence from different organizational perspectives. To evaluate +CAISSON, we develop SynFAQA, a framework for generating synthetic financial +analyst notes and question-answer pairs that systematically tests different +aspects of information retrieval capabilities. Drawing on HotPotQA's +methodology for constructing multi-step reasoning questions, SynFAQA generates +controlled test cases where each question is paired with the set of notes +containing its ground-truth answer, progressing from simple single-entity +queries to complex multi-hop retrieval tasks involving multiple entities and +concepts. Our experimental results demonstrate substantial improvements over +both basic and enhanced RAG implementations, particularly for complex +multi-entity queries, while maintaining practical response times suitable for +interactive applications. + +
+
+ comment: 26 pages, 7 figures, 8 tables +
+
+
+
+
+ + ☆ Explainable CTR Prediction via LLM Reasoning WSDM 2025 + + +
+ Recommendation Systems have become integral to modern user experiences, but +lack transparency in their decision-making processes. Existing explainable +recommendation methods are hindered by reliance on a post-hoc paradigm, wherein +explanation generators are trained independently of the underlying recommender +models. This paradigm necessitates substantial human effort in data +construction and raises concerns about explanation reliability. In this paper, +we present ExpCTR, a novel framework that integrates large language model based +explanation generation directly into the CTR prediction process. Inspired by +recent advances in reinforcement learning, we employ two carefully designed +reward mechanisms, LC alignment, which ensures explanations reflect user +intentions, and IC alignment, which maintains consistency with traditional +ID-based CTR models. Our approach incorporates an efficient training paradigm +with LoRA and a three-stage iterative process. ExpCTR circumvents the need for +extensive explanation datasets while fostering synergy between CTR prediction +and explanation generation. Experimental results demonstrate that ExpCTR +significantly enhances both recommendation accuracy and interpretability across +three real-world datasets. + +
+
+ comment: WSDM 2025 +
+
+
+
+
+ + ☆ Knowledge-Enhanced Conversational Recommendation via Transformer-based + Sequential Modelling + + +
+ In conversational recommender systems (CRSs), conversations usually involve a +set of items and item-related entities or attributes, e.g., director is a +related entity of a movie. These items and item-related entities are often +mentioned along the development of a dialog, leading to potential sequential +dependencies among them. However, most of existing CRSs neglect these potential +sequential dependencies. In this article, we first propose a Transformer-based +sequential conversational recommendation method, named TSCR, to model the +sequential dependencies in the conversations to improve CRS. In TSCR, we +represent conversations by items and the item-related entities, and construct +user sequences to discover user preferences by considering both the mentioned +items and item-related entities. Based on the constructed sequences, we deploy +a Cloze task to predict the recommended items along a sequence. Meanwhile, in +certain domains, knowledge graphs formed by the items and their related +entities are readily available, which provide various different kinds of +associations among them. Given that TSCR does not benefit from such knowledge +graphs, we then propose a knowledge graph enhanced version of TSCR, called +TSCRKG. In specific, we leverage the knowledge graph to offline initialize our +model TSCRKG, and augment the user sequence of conversations (i.e., sequence of +the mentioned items and item-related entities in the conversation) with +multi-hop paths in the knowledge graph. Experimental results demonstrate that +our TSCR model significantly outperforms state-of-the-art baselines, and the +enhanced version TSCRKG further improves recommendation performance on top of +TSCR. + +
+
+ comment: Accepted by ACM TOIS +
+
+
+
+
+ + ☆ Active Learning via Classifier Impact and Greedy Selection for + Interactive Image Retrieval + + +
+ Active Learning (AL) is a user-interactive approach aimed at reducing +annotation costs by selecting the most crucial examples to label. Although AL +has been extensively studied for image classification tasks, the specific +scenario of interactive image retrieval has received relatively little +attention. This scenario presents unique characteristics, including an open-set +and class-imbalanced binary classification, starting with very few labeled +samples. We introduce a novel batch-mode Active Learning framework named GAL +(Greedy Active Learning) that better copes with this application. It +incorporates a new acquisition function for sample selection that measures the +impact of each unlabeled sample on the classifier. We further embed this +strategy in a greedy selection approach, better exploiting the samples within +each batch. We evaluate our framework with both linear (SVM) and non-linear +MLP/Gaussian Process classifiers. For the Gaussian Process case, we show a +theoretical guarantee on the greedy approximation. Finally, we assess our +performance for the interactive content-based image retrieval task on several +benchmarks and demonstrate its superiority over existing approaches and common +baselines. Code is available at https://github.com/barleah/GreedyAL. + +
+
+ comment: Accepted to Transactions on Machine Learning Research (TMLR) +
+
+
+
+
+ + ☆ CADMR: Cross-Attention and Disentangled Learning for Multimodal + Recommender Systems + + +
+ The increasing availability and diversity of multimodal data in recommender +systems offer new avenues for enhancing recommendation accuracy and user +satisfaction. However, these systems must contend with high-dimensional, sparse +user-item rating matrices, where reconstructing the matrix with only small +subsets of preferred items for each user poses a significant challenge. To +address this, we propose CADMR, a novel autoencoder-based multimodal +recommender system framework. CADMR leverages multi-head cross-attention +mechanisms and Disentangled Learning to effectively integrate and utilize +heterogeneous multimodal data in reconstructing the rating matrix. Our approach +first disentangles modality-specific features while preserving their +interdependence, thereby learning a joint latent representation. The multi-head +cross-attention mechanism is then applied to enhance user-item interaction +representations with respect to the learned multimodal item latent +representations. We evaluate CADMR on three benchmark datasets, demonstrating +significant performance improvements over state-of-the-art methods. + +
+
+
+
+
+ + ☆ Characterizing Information Shared by Participants to Coding Challenges: + The Case of Advent of Code + + +
+ Advent of Code (AoC from now on) is a popular coding challenge requiring to +solve programming puzzles for a variety of skill sets and levels. AoC follows +the advent calendar, therefore it is an annual challenge that lasts for 25 +days. AoC participants usually post their solutions on social networks and +discuss them online. These challenges are interesting to study since they could +highlight the adoption of new tools, the evolution of the developer community, +or the technological requirements of well-known companies. For these reasons, +we first create a dataset of the 2019-2021 AoC editions containing the +discussion threads made on the subreddit {\tt /r/adventofcode}. Then, we +propose a model based on stream graphs to best study this context, where we +represent its most important actors through time: participants, comments, and +programming languages. Thanks to our model, we investigate user participation, +adoption of new programming languages during a challenge and between two of +them, and resiliency of programming languages based on a Stack Overflow survey. +We find that the top-used programming languages are almost the same in the +three years, pointing out their importance. Moreover, participants tend to keep +the same programming language for the whole challenge, while the ones attending +two AoCs usually change it in the next one. Finally, we observe interesting +results about the programming languages that are ``Popular'' or ``Loved'' +according to the Stack Overflow survey. Firstly, these are the ones adopted for +the longest time in an AoC edition, thanks to which users have a high chance of +reaching the end of the challenge. Secondly, they are the most chosen when a +participant decides to change programming language during the same challenge. + +
+
+ comment: 10 pages, 7 figures +
+
+
+
+
+ + ☆ CausalMob: Causal Human Mobility Prediction with LLMs-derived Human + Intentions toward Public Events KDD 2025 + + +
+ Large-scale human mobility exhibits spatial and temporal patterns that can +assist policymakers in decision making. Although traditional prediction models +attempt to capture these patterns, they often interfered by non-periodic public +events, such as disasters and occasional celebrations. Since regular human +mobility patterns are heavily affected by these events, estimating their causal +effects is critical to accurate mobility predictions. Although news articles +provide unique perspectives on these events in an unstructured format, +processing is a challenge. In this study, we propose a causality-augmented +prediction model, called \textbf{CausalMob}, to analyze the causal effects of +public events. We first utilize large language models (LLMs) to extract human +intentions from news articles and transform them into features that act as +causal treatments. Next, the model learns representations of spatio-temporal +regional covariates from multiple data sources to serve as confounders for +causal inference. Finally, we present a causal effect estimation framework to +ensure event features remain independent of confounders during prediction. +Based on large-scale real-world data, the experimental results show that the +proposed model excels in human mobility prediction, outperforming +state-of-the-art models. + +
+
+ comment: Accepted by KDD 2025 +
+
+
+
+
+ + ☆ Leveraging Large Language Models for Comparative Literature + Summarization with Reflective Incremental Mechanisms + + +
+ In this paper, we introduce ChatCite, a novel method leveraging large +language models (LLMs) for generating comparative literature summaries. The +ability to summarize research papers with a focus on key comparisons between +studies is an essential task in academic research. Existing summarization +models, while effective at generating concise summaries, fail to provide deep +comparative insights. ChatCite addresses this limitation by incorporating a +multi-step reasoning mechanism that extracts critical elements from papers, +incrementally builds a comparative summary, and refines the output through a +reflective memory process. We evaluate ChatCite on a custom dataset, +CompLit-LongContext, consisting of 1000 research papers with annotated +comparative summaries. Experimental results show that ChatCite outperforms +several baseline methods, including GPT-4, BART, T5, and CoT, across various +automatic evaluation metrics such as ROUGE and the newly proposed G-Score. +Human evaluation further confirms that ChatCite generates more coherent, +insightful, and fluent summaries compared to these baseline models. Our method +provides a significant advancement in automatic literature review generation, +offering researchers a powerful tool for efficiently comparing and synthesizing +scientific research. + +
+
+ comment: 8 pages +
+
+
+
+
+ + ☆ Personalized Multimodal Large Language Models: A Survey + + +
+ Multimodal Large Language Models (MLLMs) have become increasingly important +due to their state-of-the-art performance and ability to integrate multiple +data modalities, such as text, images, and audio, to perform complex tasks with +high accuracy. This paper presents a comprehensive survey on personalized +multimodal large language models, focusing on their architecture, training +methods, and applications. We propose an intuitive taxonomy for categorizing +the techniques used to personalize MLLMs to individual users, and discuss the +techniques accordingly. Furthermore, we discuss how such techniques can be +combined or adapted when appropriate, highlighting their advantages and +underlying rationale. We also provide a succinct summary of personalization +tasks investigated in existing research, along with the evaluation metrics +commonly used. Additionally, we summarize the datasets that are useful for +benchmarking personalized MLLMs. Finally, we outline critical open challenges. +This survey aims to serve as a valuable resource for researchers and +practitioners seeking to understand and advance the development of personalized +multimodal large language models. + +
+
+
+
+
+ + ☆ Improving Sequential Recommender Systems with Online and In-store User + Behavior + + +
+ Online e-commerce platforms have been extending in-store shopping, which +allows users to keep the canonical online browsing and checkout experience +while exploring in-store shopping. However, the growing transition between +online and in-store becomes a challenge to sequential recommender systems for +future online interaction prediction due to the lack of holistic modeling of +hybrid user behaviors (online and in-store). The challenges are twofold. First, +combining online and in-store user behavior data into a single data schema and +supporting multiple stages in the model life cycle (pre-training, training, +inference, etc.) organically needs a new data pipeline design. Second, online +recommender systems, which solely rely on online user behavior sequences, must +be redesigned to support online and in-store user data as input under the +sequential modeling setting. To overcome the first challenge, we propose a +hybrid, omnichannel data pipeline to compile online and in-store user behavior +data by caching information from diverse data sources. Later, we introduce a +model-agnostic encoder module to the sequential recommender system to interpret +the user in-store transaction and augment the modeling capacity for better +online interaction prediction given the hybrid user behavior. + +
+
+ comment: 6 pages, IEEE BigData 2024 Workshop +
+
+
+
+
+ + ☆ Future of Information Retrieval Research in the Age of Generative AI + + +
+ In the fast-evolving field of information retrieval (IR), the integration of +generative AI technologies such as large language models (LLMs) is transforming +how users search for and interact with information. Recognizing this paradigm +shift at the intersection of IR and generative AI (IR-GenAI), a visioning +workshop supported by the Computing Community Consortium (CCC) was held in July +2024 to discuss the future of IR in the age of generative AI. This workshop +convened 44 experts in information retrieval, natural language processing, +human-computer interaction, and artificial intelligence from academia, +industry, and government to explore how generative AI can enhance IR and vice +versa, and to identify the major challenges and opportunities in this rapidly +advancing field. + This report contains a summary of discussions as potentially important +research topics and contains a list of recommendations for academics, industry +practitioners, institutions, evaluation campaigns, and funding agencies. + +
+
+
+
+
+ + ♻ ☆ Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented + Generation NeurIPS 2024 + + +
+ Many language models now enhance their responses with retrieval capabilities, +leading to the widespread adoption of retrieval-augmented generation (RAG) +systems. However, despite retrieval being a core component of RAG, much of the +research in this area overlooks the extensive body of work on fair ranking, +neglecting the importance of considering all stakeholders involved. This paper +presents the first systematic evaluation of RAG systems integrated with fair +rankings. We focus specifically on measuring the fair exposure of each relevant +item across the rankings utilized by RAG systems (i.e., item-side fairness), +aiming to promote equitable growth for relevant item providers. To gain a deep +understanding of the relationship between item-fairness, ranking quality, and +generation quality in the context of RAG, we analyze nine different RAG systems +that incorporate fair rankings across seven distinct datasets. Our findings +indicate that RAG systems with fair rankings can maintain a high level of +generation quality and, in many cases, even outperform traditional RAG systems, +despite the general trend of a tradeoff between ensuring fairness and +maintaining system-effectiveness. We believe our insights lay the groundwork +for responsible and equitable RAG systems and open new avenues for future +research. We publicly release our codebase and dataset at +https://github.com/kimdanny/Fair-RAG. + +
+
+ comment: Top 5 Spotlight at AFME Workshop at NeurIPS 2024 +
+
+
+
+
+ + ♻ ☆ Predictive Models in Sequential Recommendations: Bridging Performance + Laws with Data Quality Insights + + +
+ Sequential Recommendation (SR) plays a critical role in predicting users' +sequential preferences. Despite its growing prominence in various industries, +the increasing scale of SR models incurs substantial computational costs and +unpredictability, challenging developers to manage resources efficiently. Under +this predicament, Scaling Laws have achieved significant success by examining +the loss as models scale up. However, there remains a disparity between loss +and model performance, which is of greater concern in practical applications. +Moreover, as data continues to expand, it incorporates repetitive and +inefficient data. In response, we introduce the Performance Law for SR models, +which aims to theoretically investigate and model the relationship between +model performance and data quality. Specifically, we first fit the HR and NDCG +metrics to transformer-based SR models. Subsequently, we propose Approximate +Entropy (ApEn) to assess data quality, presenting a more nuanced approach +compared to traditional data quantity metrics. Our method enables accurate +predictions across various dataset scales and model sizes, demonstrating a +strong correlation in large SR models and offering insights into achieving +optimal performance for any given model configuration. + +
+
+ comment: 12 pages, 5 figures +
+
+
+
+
+ + ♻ ☆ A Novel Approach to Comprehending Users' Preferences for Accurate + Personalized News Recommendation + + +
+ Personalized news recommendation aims to assist users in finding news +articles that align with their interests, which plays a pivotal role in +mitigating users' information overload problem. Although many recent works have +been studied for better personalized news recommendation, the following +challenges should be explored more: (C1) Comprehending manifold intents coupled +within a news article, (C2) Differentiating varying post-read preferences of +news articles, and (C3) Addressing the cold-start user problem. To tackle the +aforementioned challenges together, in this paper, we propose a novel +personalized news recommendation framework (CROWN) that employs (1) +category-guided intent disentanglement for (C1), (2) consistency-based news +representation for (C2), and (3) GNN-enhanced hybrid user representation for +(C3). Furthermore, we incorporate a category prediction into the training +process of CROWN as an auxiliary task, which provides supplementary supervisory +signals to enhance intent disentanglement. Extensive experiments on two +real-world datasets reveal that (1) CROWN provides consistent performance +improvements over ten state-of-the-art news recommendation methods and (2) the +proposed strategies significantly improve the accuracy of CROWN. + +
+
+ comment: 10 pages, 6 figures, 8 tables +
+
+
+
+
+ + ♻ ☆ Generalized compression and compressive search of large datasets + + +
+ The Big Data explosion has necessitated the development of search algorithms +that scale sub-linearly in time and memory. + While compression algorithms and search algorithms do exist independently, +few algorithms offer both, and those which do are domain-specific. + We present panCAKES, a novel approach to compressive search, i.e., a way to +perform $k$-NN and $\rho$-NN search on compressed data while only decompressing +a small, relevant, portion of the data. + panCAKES assumes the manifold hypothesis and leverages the low-dimensional +structure of the data to compress and search it efficiently. + panCAKES is generic over any distance function for which the distance between +two points is proportional to the memory cost of storing an encoding of one in +terms of the other. + This property holds for many widely-used distance functions, e.g. string edit +distances (Levenshtein, Needleman-Wunsch, etc.) and set dissimilarity measures +(Jaccard, Dice, etc.). + We benchmark panCAKES on a variety of datasets, including genomic, proteomic, +and set data. + We compare compression ratios to gzip, and search performance between the +compressed and uncompressed versions of the same dataset. + panCAKES achieves compression ratios close to those of gzip, while offering +sub-linear time performance for $k$-NN and $\rho$-NN search. + We conclude that panCAKES is an efficient, general-purpose algorithm for +exact compressive search on large datasets that obey the manifold hypothesis. + We provide an open-source implementation of panCAKES in the Rust programming +language. + +
+
+
+
+
+ + ♻ ☆ CPRM: A LLM-based Continual Pre-training Framework for Relevance + Modeling in Commercial Search + + +
+ Relevance modeling between queries and items stands as a pivotal component in +commercial search engines, directly affecting the user experience. Given the +remarkable achievements of large language models (LLMs) in various natural +language processing (NLP) tasks, LLM-based relevance modeling is gradually +being adopted within industrial search systems. Nevertheless, foundational LLMs +lack domain-specific knowledge and do not fully exploit the potential of +in-context learning. Furthermore, structured item text remains underutilized, +and there is a shortage in the supply of corresponding queries and background +knowledge. We thereby propose CPRM (Continual Pre-training for Relevance +Modeling), a framework designed for the continual pre-training of LLMs to +address these issues. Our CPRM framework includes three modules: 1) employing +both queries and multi-field item to jointly pre-train for enhancing domain +knowledge, 2) applying in-context pre-training, a novel approach where LLMs are +pre-trained on a sequence of related queries or items, and 3) conducting +reading comprehension on items to produce associated domain knowledge and +background information (e.g., generating summaries and corresponding queries) +to further strengthen LLMs. Results on offline experiments and online A/B +testing demonstrate that our model achieves convincing performance compared to +strong baselines. + +
+
+
+
+
+
+
+
+ + Multimedia 4 + +
+
+
+ + ☆ AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand + Audio-Visual Information? + + +
+ Recently, multimodal large language models (MLLMs), such as GPT-4o, Gemini +1.5 Pro, and Reka Core, have expanded their capabilities to include vision and +audio modalities. While these models demonstrate impressive performance across +a wide range of audio-visual applications, our proposed DeafTest reveals that +MLLMs often struggle with simple tasks humans find trivial: 1) determining +which of two sounds is louder, and 2) determining which of two sounds has a +higher pitch. Motivated by these observations, we introduce AV-Odyssey Bench, a +comprehensive audio-visual benchmark designed to assess whether those MLLMs can +truly understand the audio-visual information. This benchmark encompasses 4,555 +carefully crafted problems, each incorporating text, visual, and audio +components. To successfully infer answers, models must effectively leverage +clues from both visual and audio inputs. To ensure precise and objective +evaluation of MLLM responses, we have structured the questions as +multiple-choice, eliminating the need for human evaluation or LLM-assisted +assessment. We benchmark a series of closed-source and open-source models and +summarize the observations. By revealing the limitations of current models, we +aim to provide useful insight for future dataset collection and model +development. + +
+
+ comment: Project page: https://av-odyssey.github.io/ +
+
+
+
+
+ + ☆ Copy-Move Forgery Detection and Question Answering for Remote Sensing + Image + + +
+ This paper introduces the task of Remote Sensing Copy-Move Question Answering +(RSCMQA). Unlike traditional Remote Sensing Visual Question Answering (RSVQA), +RSCMQA focuses on interpreting complex tampering scenarios and inferring +relationships between objects. Based on the practical needs of national defense +security and land resource monitoring, we have developed an accurate and +comprehensive global dataset for remote sensing image copy-move question +answering, named RS-CMQA-2.1M. These images were collected from 29 different +regions across 14 countries. Additionally, we have refined a balanced dataset, +RS-CMQA-B, to address the long-standing issue of long-tail data in the remote +sensing field. Furthermore, we propose a region-discriminative guided +multimodal CMQA model, which enhances the accuracy of answering questions about +tampered images by leveraging prompt about the differences and connections +between the source and tampered domains. Extensive experiments demonstrate that +our method provides a stronger benchmark for RS-CMQA compared to general VQA +and RSVQA models. Our dataset and code are available at +https://github.com/shenyedepisa/RSCMQA. + +
+
+ comment: 7 figs, 7 tables +
+
+
+
+
+ + ☆ It Takes Two: Real-time Co-Speech Two-person's Interaction Generation + via Reactive Auto-regressive Diffusion Model + + +
+ Conversational scenarios are very common in real-world settings, yet existing +co-speech motion synthesis approaches often fall short in these contexts, where +one person's audio and gestures will influence the other's responses. +Additionally, most existing methods rely on offline sequence-to-sequence +frameworks, which are unsuitable for online applications. In this work, we +introduce an audio-driven, auto-regressive system designed to synthesize +dynamic movements for two characters during a conversation. At the core of our +approach is a diffusion-based full-body motion synthesis model, which is +conditioned on the past states of both characters, speech audio, and a +task-oriented motion trajectory input, allowing for flexible spatial control. +To enhance the model's ability to learn diverse interactions, we have enriched +existing two-person conversational motion datasets with more dynamic and +interactive motions. We evaluate our system through multiple experiments to +show it outperforms across a variety of tasks, including single and two-person +co-speech motion generation, as well as interactive motion generation. To the +best of our knowledge, this is the first system capable of generating +interactive full-body motions for two characters from speech in an online +manner. + +
+
+ comment: 15 pages, 10 figures +
+
+
+
+
+ + ♻ ☆ Resource-Efficient Reference-Free Evaluation of Audio Captions + + +
+ To establish the trustworthiness of systems that automatically generate text +captions for audio, images and video, existing reference-free metrics rely on +large pretrained models which are impractical to accommodate in +resource-constrained settings. To address this, we propose some metrics to +elicit the model's confidence in its own generation. To assess how well these +metrics replace correctness measures that leverage reference captions, we test +their calibration with correctness measures. We discuss why some of these +confidence metrics align better with certain correctness measures. Further, we +provide insight into why temperature scaling of confidence metrics is +effective. Our main contribution is a suite of well-calibrated lightweight +confidence metrics for reference-free evaluation of captions in +resource-constrained settings. + +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Information Retrieval 15 + +
+
+
+ + ☆ Improving feature interactions at Pinterest under industry constraints + + +
+ Adopting advances in recommendation systems is often challenging in +industrial settings due to unique constraints. This paper aims to highlight +these constraints through the lens of feature interactions. Feature +interactions are critical for accurately predicting user behavior in +recommendation systems and online advertising. Despite numerous novel +techniques showing superior performance on benchmark datasets like Criteo, +their direct application in industrial settings is hindered by constraints such +as model latency, GPU memory limitations and model reproducibility. In this +paper, we share our learnings from improving feature interactions in +Pinterest's Homefeed ranking model under such constraints. We provide details +about the specific challenges encountered, the strategies employed to address +them, and the trade-offs made to balance performance with practical +limitations. Additionally, we present a set of learning experiments that help +guide the feature interaction architecture selection. We believe these insights +will be useful for engineers who are interested in improving their model +through better feature interaction learning. + +
+
+
+
+
+ + ☆ FGATT: A Robust Framework for Wireless Data Imputation Using Fuzzy Graph + Attention Networks and Transformer Encoders + + +
+ Missing data is a pervasive challenge in wireless networks and many other +domains, often compromising the performance of machine learning and deep +learning models. To address this, we propose a novel framework, FGATT, that +combines the Fuzzy Graph Attention Network (FGAT) with the Transformer encoder +to perform robust and accurate data imputation. FGAT leverages fuzzy rough sets +and graph attention mechanisms to capture spatial dependencies dynamically, +even in scenarios where predefined spatial information is unavailable. The +Transformer encoder is employed to model temporal dependencies, utilizing its +self-attention mechanism to focus on significant time-series patterns. A +self-adaptive graph construction method is introduced to enable dynamic +connectivity learning, ensuring the framework's applicability to a wide range +of wireless datasets. Extensive experiments demonstrate that our approach +outperforms state-of-the-art methods in imputation accuracy and robustness, +particularly in scenarios with substantial missing data. The proposed model is +well-suited for applications in wireless sensor networks and IoT environments, +where data integrity is critical. + +
+
+
+
+
+ + ☆ Down with the Hierarchy: The 'H' in HNSW Stands for "Hubs" + + +
+ Driven by recent breakthrough advances in neural representation learning, +approximate near-neighbor (ANN) search over vector embeddings has emerged as a +critical computational workload. With the introduction of the seminal +Hierarchical Navigable Small World (HNSW) algorithm, graph-based indexes have +established themseves as the overwhelmingly dominant paradigm for efficient and +scalable ANN search. As the name suggests, HNSW searches a layered hierarchical +graph to quickly identify neighborhoods of similar points to a given query +vector. But is this hierarchy even necessary? A rigorous experimental analysis +to answer this question would provide valuable insights into the nature of +algorithm design for ANN search and motivate directions for future work in this +increasingly crucial domain. To that end, we conduct an extensive benchmarking +study covering more large-scale datasets than prior investigations of this +question. We ultimately find that a flat graph retains all of the benefits of +HNSW on high-dimensional datasets, with latency and recall performance +essentially \emph{identical} to the original algorithm but with less memory +overhead. Furthermore, we go a step further and study \emph{why} the hierarchy +of HNSW provides no benefit in high dimensions, hypothesizing that navigable +small world graphs contain a well-connected, frequently traversed ``highway" of +hub nodes that maintain the same purported function as the hierarchical layers. +We present compelling empirical evidence that the \emph{Hub Highway Hypothesis} +holds for real datasets and investigate the mechanisms by which the highway +forms. The implications of this hypothesis may also provide future research +directions in developing enhancements to graph-based ANN search. + +
+
+ comment: 10 pages +
+
+
+
+
+ + ☆ Using Large Language Models in Automatic Hint Ranking and Generation + Tasks + + +
+ The use of Large Language Models (LLMs) has increased significantly recently, +with individuals frequently interacting with chatbots to receive answers to a +wide range of questions. In an era where information is readily accessible, it +is crucial to stimulate and preserve human cognitive abilities and maintain +strong reasoning skills. This paper addresses such challenges by promoting the +use of hints as an alternative or a supplement to direct answers. We first +introduce a manually constructed hint dataset, WIKIHINT, which includes 5,000 +hints created for 1,000 questions. We then finetune open-source LLMs such as +LLaMA-3.1 for hint generation in answer-aware and answer-agnostic contexts. We +assess the effectiveness of the hints with human participants who try to answer +questions with and without the aid of hints. Additionally, we introduce a +lightweight evaluation method, HINTRANK, to evaluate and rank hints in both +answer-aware and answer-agnostic settings. Our findings show that (a) the +dataset helps generate more effective hints, (b) including answer information +along with questions generally improves hint quality, and (c) encoder-based +models perform better than decoder-based models in hint ranking. + +
+
+
+
+
+ + ☆ Multi-Facet Blending for Faceted Query-by-Example Retrieval + + +
+ With the growing demand to fit fine-grained user intents, faceted +query-by-example (QBE), which retrieves similar documents conditioned on +specific facets, has gained recent attention. However, prior approaches mainly +depend on document-level comparisons using basic indicators like citations due +to the lack of facet-level relevance datasets; yet, this limits their use to +citation-based domains and fails to capture the intricacies of facet +constraints. In this paper, we propose a multi-facet blending (FaBle) +augmentation method, which exploits modularity by decomposing and recomposing +to explicitly synthesize facet-specific training sets. We automatically +decompose documents into facet units and generate (ir)relevant pairs by +leveraging LLMs' intrinsic distinguishing capabilities; then, dynamically +recomposing the units leads to facet-wise relevance-informed document pairs. +Our modularization eliminates the need for pre-defined facet knowledge or +labels. Further, to prove the FaBle's efficacy in a new domain beyond +citation-based scientific paper retrieval, we release a benchmark dataset for +educational exam item QBE. FaBle augmentation on 1K documents remarkably +assists training in obtaining facet conditional embeddings. + +
+
+
+
+
+ + ☆ Global Estimation of Building-Integrated Facade and Rooftop Photovoltaic + Potential by Integrating 3D Building Footprint and Spatio-Temporal Datasets + + +
+ This research tackles the challenges of estimating Building-Integrated +Photovoltaics (BIPV) potential across various temporal and spatial scales, +accounting for different geographical climates and urban morphology. We +introduce a holistic methodology for evaluating BIPV potential, integrating 3D +building footprint models with diverse meteorological data sources to account +for dynamic shadow effects. The approach enables the assessment of PV potential +on facades and rooftops at different levels-individual buildings, urban blocks, +and cities globally. Through an analysis of 120 typical cities, we highlight +the importance of 3D building forms, cityscape morphology, and geographic +positioning in measuring BIPV potential at various levels. In particular, our +simulation study reveals that among cities with optimal facade PV performance, +the average ratio of facade PV potential to rooftop PV potential is +approximately 68.2%. Additionally, approximately 17.5% of the analyzed samples +demonstrate even higher facade PV potentials compared to rooftop installations. +This finding underscores the strategic value of incorporating facade PV +applications into urban sustainable energy systems. + +
+
+ comment: 17 pages, 5 figures +
+
+
+
+
+ + ☆ Learning Smooth Distance Functions via Queries + + +
+ In this work, we investigate the problem of learning distance functions +within the query-based learning framework, where a learner is able to pose +triplet queries of the form: ``Is $x_i$ closer to $x_j$ or $x_k$?'' We +establish formal guarantees on the query complexity required to learn smooth, +but otherwise general, distance functions under two notions of approximation: +$\omega$-additive approximation and $(1 + \omega)$-multiplicative +approximation. For the additive approximation, we propose a global method whose +query complexity is quadratic in the size of a finite cover of the sample +space. For the (stronger) multiplicative approximation, we introduce a method +that combines global and local approaches, utilizing multiple Mahalanobis +distance functions to capture local geometry. This method has a query +complexity that scales quadratically with both the size of the cover and the +ambient space dimension of the sample space. + +
+
+ comment: 40 pages, 1 figure +
+
+
+
+
+ + ☆ Lossless and Privacy-Preserving Graph Convolution Network for Federated + Item Recommendation + + +
+ Graph neural network (GNN) has emerged as a state-of-the-art solution for +item recommendation. However, existing GNN-based recommendation methods rely on +a centralized storage of fragmented user-item interaction sub-graphs and +training on an aggregated global graph, which will lead to privacy concerns. As +a response, some recent works develop GNN-based federated recommendation +methods by exploiting decentralized and fragmented user-item sub-graphs in +order to preserve user privacy. However, due to privacy constraints, the graph +convolution process in existing federated recommendation methods is incomplete +compared with the centralized counterpart, causing a degradation of the +recommendation performance. In this paper, we propose a novel lossless and +privacy-preserving graph convolution network (LP-GCN), which fully completes +the graph convolution process with decentralized user-item interaction +sub-graphs while ensuring privacy. It is worth mentioning that its performance +is equivalent to that of the non-federated (i.e., centralized) counterpart. +Moreover, we validate its effectiveness through both theoretical analysis and +empirical studies. Extensive experiments on three real-world datasets show that +our LP-GCN outperforms the existing federated recommendation methods. The code +will be publicly available once the paper is accepted. + +
+
+
+
+
+ + ☆ Precision Profile Pollution Attack on Sequential Recommenders via + Influence Function + + +
+ Sequential recommendation approaches have demonstrated remarkable proficiency +in modeling user preferences. Nevertheless, they are susceptible to profile +pollution attacks (PPA), wherein items are introduced into a user's interaction +history deliberately to influence the recommendation list. Since retraining the +model for each polluted item is time-consuming, recent PPAs estimate item +influence based on gradient directions to identify the most effective attack +candidates. However, the actual item representations diverge significantly from +the gradients, resulting in disparate outcomes.To tackle this challenge, we +introduce an INFluence Function-based Attack approach INFAttack that offers a +more accurate estimation of the influence of polluting items. Specifically, we +calculate the modifications to the original model using the influence function +when generating polluted sequences by introducing specific items. Subsequently, +we choose the sequence that has been most significantly influenced to +substitute the original sequence, thus promoting the target item. Comprehensive +experiments conducted on five real-world datasets illustrate that INFAttack +surpasses all baseline methods and consistently delivers stable attack +performance for both popular and unpopular items. + +
+
+
+
+
+ + ☆ Automated Extraction of Acronym-Expansion Pairs from Scientific Papers + + +
+ This project addresses challenges posed by the widespread use of +abbreviations and acronyms in digital texts. We propose a novel method that +combines document preprocessing, regular expressions, and a large language +model to identify abbreviations and map them to their corresponding expansions. +The regular expressions alone are often insufficient to extract expansions, at +which point our approach leverages GPT-4 to analyze the text surrounding the +acronyms. By limiting the analysis to only a small portion of the surrounding +text, we mitigate the risk of obtaining incorrect or multiple expansions for an +acronym. There are several known challenges in processing text with acronyms, +including polysemous acronyms, non-local and ambiguous acronyms. Our approach +enhances the precision and efficiency of NLP techniques by addressing these +issues with automated acronym identification and disambiguation. This study +highlights the challenges of working with PDF files and the importance of +document preprocessing. Furthermore, the results of this work show that neither +regular expressions nor GPT-4 alone can perform well. Regular expressions are +suitable for identifying acronyms but have limitations in finding their +expansions within the paper due to a variety of formats used for expressing +acronym-expansion pairs and the tendency of authors to omit expansions within +the text. GPT-4, on the other hand, is an excellent tool for obtaining +expansions but struggles with correctly identifying all relevant acronyms. +Additionally, GPT-4 poses challenges due to its probabilistic nature, which may +lead to slightly different results for the same input. Our algorithm employs +preprocessing to eliminate irrelevant information from the text, regular +expressions for identifying acronyms, and a large language model to help find +acronym expansions to provide the most accurate and consistent results. + +
+
+ comment: 9 pages, 1 figure +
+
+
+
+
+ + ☆ e-Fold Cross-Validation for Recommender-System Evaluation + + +
+ To combat the rising energy consumption of recommender systems we implement a +novel alternative for k-fold cross validation. This alternative, named e-fold +cross validation, aims to minimize the number of folds to achieve a reduction +in power usage while keeping the reliability and robustness of the test results +high. We tested our method on 5 recommender system algorithms across 6 datasets +and compared it with 10-fold cross validation. On average e-fold cross +validation only needed 41.5% of the energy that 10-fold cross validation would +need, while it's results only differed by 1.81%. We conclude that e-fold cross +validation is a promising approach that has the potential to be an energy +efficient but still reliable alternative to k-fold cross validation. + +
+
+ comment: This preprint has not undergone peer review (when applicable) or any + post-submission improvements or corrections. The Version of Record of this + contribution is published in [TBA], and is available online at [TBA] +
+
+
+
+
+ + ♻ ☆ Using text embedding models as text classifiers with medical data + + +
+ The advent of Large Language Models (LLMs) is promising and LLMs have been +applied to numerous fields. However, it is not trivial to implement LLMs in the +medical field, due to the high standards for precision and accuracy. Currently, +the diagnosis of medical ailments must be done by hand, as it is costly to +build a sufficiently broad LLM that can diagnose a wide range of diseases. +Here, we explore the use of vector databases and embedding models as a means of +encoding and classifying text with medical text data without the need to train +a new model altogether. We used various LLMs to generate the medical data, then +encoded the data with a text embedding model and stored it in a vector +database. We hypothesized that higher embedding dimensions coupled with +descriptive data in the vector database would lead to better classifications +and designed a robustness test to test our hypothesis. By using vector +databases and text embedding models to classify a clinician's notes on a +patient presenting with a certain ailment, we showed that these tools can be +successful at classifying medical text data. We found that a higher embedding +dimension did indeed yield better results, however, querying with simple data +in the database was optimal for performance. We have shown in this study the +applicability of text embedding models and vector databases on a small scale, +and our work lays the groundwork for applying these tools on a larger scale. + +
+
+ comment: 15 pages, 6 figures +
+
+
+
+
+ + ♻ ☆ RIRAG: Regulatory Information Retrieval and Answer Generation + + +
+ Regulatory documents, issued by governmental regulatory bodies, establish +rules, guidelines, and standards that organizations must adhere to for legal +compliance. These documents, characterized by their length, complexity and +frequent updates, are challenging to interpret, requiring significant +allocation of time and expertise on the part of organizations to ensure ongoing +compliance. Regulatory Natural Language Processing (RegNLP) is a +multidisciplinary field aimed at simplifying access to and interpretation of +regulatory rules and obligations. We introduce a task of generating +question-passages pairs, where questions are automatically created and paired +with relevant regulatory passages, facilitating the development of regulatory +question-answering systems. We create the ObliQA dataset, containing 27,869 +questions derived from the collection of Abu Dhabi Global Markets (ADGM) +financial regulation documents, design a baseline Regulatory Information +Retrieval and Answer Generation (RIRAG) system and evaluate it with RePASs, a +novel evaluation metric that tests whether generated answers accurately capture +all relevant obligations while avoiding contradictions. + +
+
+
+
+
+ + ♻ ☆ Unifying Multimodal Retrieval via Document Screenshot Embedding EMNLP2024 + + +
+ In the real world, documents are organized in different formats and varied +modalities. Traditional retrieval pipelines require tailored document parsing +techniques and content extraction modules to prepare input for indexing. This +process is tedious, prone to errors, and has information loss. To this end, we +propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that +regards document screenshots as a unified input format, which does not require +any content extraction preprocess and preserves all the information in a +document (e.g., text, image and layout). DSE leverages a large vision-language +model to directly encode document screenshots into dense representations for +retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a +1.3M Wikipedia web page screenshots as the corpus to answer the questions from +the Natural Questions dataset. In such a text-intensive document retrieval +setting, DSE shows competitive effectiveness compared to other text retrieval +methods relying on parsing. For example, DSE outperforms BM25 by 17 points in +top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide +retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 +points in nDCG@10. These experiments show that DSE is an effective document +retrieval paradigm for diverse types of documents. Model checkpoints, code, and +Wiki-SS collection will be released. + +
+
+ comment: EMNLP2024 main +
+
+
+
+
+ + ♻ ☆ Unveiling and Mitigating Bias in Large Language Model Recommendations: A + Path to Fairness + + +
+ excel in delivering comprehensive suggestions by deeply analyzing content and +user behavior. However, they often inherit biases from skewed training data, +favoring mainstream content while underrepresenting diverse or non-traditional +options. This study explores the interplay between bias and LLM-based +recommendation systems, focusing on music, song, and book recommendations +across diverse demographic and cultural groups. This paper analyzes bias in +LLM-based recommendation systems across multiple models (GPT, LLaMA, and +Gemini), revealing its deep and pervasive impact on outcomes. Intersecting +identities and contextual factors, like socioeconomic status, further amplify +biases, complicating fair recommendations across diverse groups. Our findings +reveal that bias in these systems is deeply ingrained, yet even simple +interventions like prompt engineering can significantly reduce it. We further +propose a retrieval-augmented generation strategy to mitigate bias more +effectively. Numerical experiments validate these strategies, demonstrating +both the pervasive nature of bias and the impact of the proposed solutions. + +
+
+
+
+
+
+
+
+ + Multimedia 7 + +
+
+
+ + ☆ HybridMQA: Exploring Geometry-Texture Interactions for Colored Mesh + Quality Assessment + + +
+ Mesh quality assessment (MQA) models play a critical role in the design, +optimization, and evaluation of mesh operation systems in a wide variety of +applications. Current MQA models, whether model-based methods using +topology-aware features or projection-based approaches working on rendered 2D +projections, often fail to capture the intricate interactions between texture +and 3D geometry. We introduce HybridMQA, a first-of-its-kind hybrid +full-reference colored MQA framework that integrates model-based and +projection-based approaches, capturing complex interactions between textural +information and 3D structures for enriched quality representations. Our method +employs graph learning to extract detailed 3D representations, which are then +projected to 2D using a novel feature rendering process that precisely aligns +them with colored projections. This enables the exploration of geometry-texture +interactions via cross-attention, producing comprehensive mesh quality +representations. Extensive experiments demonstrate HybridMQA's superior +performance across diverse datasets, highlighting its ability to effectively +leverage geometry-texture interactions for a thorough understanding of mesh +quality. Our implementation will be made publicly available. + +
+
+
+
+
+ + ☆ X-Prompt: Towards Universal In-Context Image Generation in + Auto-Regressive Vision Language Foundation Models + + +
+ In-context generation is a key component of large language models' (LLMs) +open-task generalization capability. By leveraging a few examples as context, +LLMs can perform both in-domain and out-of-domain tasks. Recent advancements in +auto-regressive vision-language models (VLMs) built upon LLMs have showcased +impressive performance in text-to-image generation. However, the potential of +in-context learning for general image generation tasks remains largely +unexplored. To address this, we introduce X-Prompt, a purely auto-regressive +large-vision language model designed to deliver competitive performance across +a wide range of both seen and unseen image generation tasks, all within a +unified in-context learning framework. X-Prompt incorporates a specialized +design that efficiently compresses valuable features from in-context examples, +supporting longer in-context token sequences and improving its ability to +generalize to unseen tasks. A unified training task for both text and image +prediction enables X-Prompt to handle general image generation with enhanced +task awareness from in-context examples. Extensive experiments validate the +model's performance across diverse seen image generation tasks and its capacity +to generalize to previously unseen tasks. + +
+
+ comment: code: https://github.com/SunzeY/X-Prompt +
+
+
+
+
+ + ☆ Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient + Object Detection + + +
+ RGB-Thermal Salient Object Detection aims to pinpoint prominent objects +within aligned pairs of visible and thermal infrared images. Traditional +encoder-decoder architectures, while designed for cross-modality feature +interactions, may not have adequately considered the robustness against noise +originating from defective modalities. Inspired by hierarchical human visual +systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network +employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises +three flows: two modality-specific flows explore cues from RGB and Thermal +modalities, and a third modality-complementary flow integrates cues from both +modalities. ConTriNet presents several notable advantages. It incorporates a +Modality-induced Feature Modulator in the modality-shared union encoder to +minimize inter-modality discrepancies and mitigate the impact of defective +samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in +the separated flows enlarges the receptive field, allowing for the capture of +multi-scale contextual information. Furthermore, a Modality-aware Dynamic +Aggregation Module in the modality-complementary flow dynamically aggregates +saliency-related cues from both modality-specific flows. Leveraging the +proposed parallel triple-flow framework, we further refine saliency maps +derived from different flows through a flow-cooperative fusion strategy, +yielding a high-quality, full-resolution saliency map for the final prediction. +To evaluate the robustness and stability of our approach, we collect a +comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world +challenging scenarios. Extensive experiments on public benchmarks and our +VT-IMAG dataset demonstrate that ConTriNet consistently outperforms +state-of-the-art competitors in both common and challenging scenarios. + +
+
+ comment: Accepted by IEEE TPAMI. Project page: + https://cser-tang-hao.github.io/contrinet.html +
+
+
+
+
+ + ☆ Long Video Diffusion Generation with Segmented Cross-Attention and + Content-Rich Video Data Curation + + +
+ We introduce Presto, a novel video diffusion model designed to generate +15-second videos with long-range coherence and rich content. Extending video +generation methods to maintain scenario diversity over long durations presents +significant challenges. To address this, we propose a Segmented Cross-Attention +(SCA) strategy, which splits hidden states into segments along the temporal +dimension, allowing each segment to cross-attend to a corresponding +sub-caption. SCA requires no additional parameters, enabling seamless +incorporation into current DiT-based architectures. To facilitate high-quality +long video generation, we build the LongTake-HD dataset, consisting of 261k +content-rich videos with scenario coherence, annotated with an overall video +caption and five progressive sub-captions. Experiments show that our Presto +achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, +outperforming existing state-of-the-art video generation methods. This +demonstrates that our proposed Presto significantly enhances content richness, +maintains long-range coherence, and captures intricate textual details. More +details are displayed on our project page: https://presto-video.github.io/. + +
+
+
+
+
+ + ☆ Neuron Abandoning Attention Flow: Visual Explanation of Dynamics inside + CNN Models + + +
+ In this paper, we present a Neuron Abandoning Attention Flow (NAFlow) method +to address the open problem of visually explaining the attention evolution +dynamics inside CNNs when making their classification decisions. A novel +cascading neuron abandoning back-propagation algorithm is designed to trace +neurons in all layers of a CNN that involve in making its prediction to address +the problem of significant interference from abandoned neurons. Firstly, a +Neuron Abandoning Back-Propagation (NA-BP) module is proposed to generate +Back-Propagated Feature Maps (BPFM) by using the inverse function of the +intermediate layers of CNN models, on which the neurons not used for +decision-making are abandoned. Meanwhile, the cascading NA-BP modules calculate +the tensors of importance coefficients which are linearly combined with the +tensors of BPFMs to form the NAFlow. Secondly, to be able to visualize +attention flow for similarity metric-based CNN models, a new channel +contribution weights module is proposed to calculate the importance +coefficients via Jacobian Matrix. The effectiveness of the proposed NAFlow is +validated on nine widely-used CNN models for various tasks of general image +classification, contrastive learning classification, few-shot image +classification, and image retrieval. + +
+
+
+
+
+ + ☆ OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows + + +
+ We introduce OmniFlow, a novel generative model designed for any-to-any +generation tasks such as text-to-image, text-to-audio, and audio-to-image +synthesis. OmniFlow advances the rectified flow (RF) framework used in +text-to-image models to handle the joint distribution of multiple modalities. +It outperforms previous any-to-any models on a wide range of tasks, such as +text-to-image and text-to-audio synthesis. Our work offers three key +contributions: First, we extend RF to a multi-modal setting and introduce a +novel guidance mechanism, enabling users to flexibly control the alignment +between different modalities in the generated outputs. Second, we propose a +novel architecture that extends the text-to-image MMDiT architecture of Stable +Diffusion 3 and enables audio and text generation. The extended modules can be +efficiently pretrained individually and merged with the vanilla text-to-image +MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design +choices of rectified flow transformers for large-scale audio and text +generation, providing valuable insights into optimizing performance across +diverse modalities. The Code will be available at +https://github.com/jacklishufan/OmniFlows. + +
+
+ comment: 12 pages, 14 figures +
+
+
+
+
+ + ♻ ☆ DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with + Diffusion Autoencoder + + +
+ While recent research has made significant progress in speech-driven talking +face generation, the quality of the generated video still lags behind that of +real recordings. One reason for this is the use of handcrafted intermediate +representations like facial landmarks and 3DMM coefficients, which are designed +based on human knowledge and are insufficient to precisely describe facial +movements. Additionally, these methods require an external pretrained model for +extracting these representations, whose performance sets an upper bound on +talking face generation. To address these limitations, we propose a novel +method called DAE-Talker that leverages data-driven latent representations +obtained from a diffusion autoencoder (DAE). DAE contains an image encoder that +encodes an image into a latent vector and a DDIM image decoder that +reconstructs the image from it. We train our DAE on talking face video frames +and then extract their latent representations as the training target for a +Conformer-based speech2latent model. This allows DAE-Talker to synthesize full +video frames and produce natural head movements that align with the content of +speech, rather than relying on a predetermined head pose from a template video. +We also introduce pose modelling in speech2latent for pose controllability. +Additionally, we propose a novel method for generating continuous video frames +with the DDIM image decoder trained on individual frames, eliminating the need +for modelling the joint distribution of consecutive frames directly. Our +experiments show that DAE-Talker outperforms existing popular methods in +lip-sync, video fidelity, and pose naturalness. We also conduct ablation +studies to analyze the effectiveness of the proposed techniques and demonstrate +the pose controllability of DAE-Talker. + +
+
+ comment: Accepted to ACM Multimedia 2023 +
+
+
+
+
+
+
+
+
+ +
+
+
+ + Information Retrieval 12 + +
+
+
+ + ☆ Patent-publication pairs for the detection of knowledge transfer from + research to industry: reducing ambiguities with word embeddings and + references + + +
+ The performance of medical research can be viewed and evaluated not only from +the perspective of publication output, but also from the perspective of +economic exploitability. Patents can represent the exploitation of research +results and thus the transfer of knowledge from research to industry. In this +study, we set out to identify publication-patent pairs in order to use patents +as a proxy for the economic impact of research. To identify these pairs, we +matched scholarly publications and patents by comparing the names of authors +and investors. To resolve the ambiguities that arise in this name-matching +process, we expanded our approach with two additional filter features, one used +to assess the similarity of text content, the other to identify common +references in the two document types. To evaluate text similarity, we extracted +and transformed technical terms from a medical ontology (MeSH) into numerical +vectors using word embeddings. We then calculated the results of the two +supporting features over an example five-year period. Furthermore, we developed +a statistical procedure which can be used to determine valid patent classes for +the domain of medicine. Our complete data processing pipeline is freely +available, from the raw data of the two document types right through to the +validated publication-patent pairs. + +
+
+ comment: 16 Pages, 8 figures +
+
+
+
+
+ + ☆ QABISAR: Query-Article Bipartite Interactions for Statutory Article + Retrieval COLING 2025 + + +
+ In this paper, we introduce QABISAR, a novel framework for statutory article +retrieval, to overcome the semantic mismatch problem when modeling each +query-article pair in isolation, making it hard to learn representation that +can effectively capture multi-faceted information. QABISAR leverages bipartite +interactions between queries and articles to capture diverse aspects inherent +in them. Further, we employ knowledge distillation to transfer enriched query +representations from the graph network into the query bi-encoder, to capture +the rich semantics present in the graph representations, despite absence of +graph-based supervision for unseen queries during inference. Our experiments on +a real-world expert-annotated dataset demonstrate its effectiveness. + +
+
+ comment: Accepted to COLING 2025 +
+
+
+
+
+ + ☆ Oracle-guided Dynamic User Preference Modeling for Sequential + Recommendation + + +
+ Sequential recommendation methods can capture dynamic user preferences from +user historical interactions to achieve better performance. However, most +existing methods only use past information extracted from user historical +interactions to train the models, leading to the deviations of user preference +modeling. Besides past information, future information is also available during +training, which contains the ``oracle'' user preferences in the future and will +be beneficial to model dynamic user preferences. Therefore, we propose an +oracle-guided dynamic user preference modeling method for sequential +recommendation (Oracle4Rec), which leverages future information to guide model +training on past information, aiming to learn ``forward-looking'' models. +Specifically, Oracle4Rec first extracts past and future information through two +separate encoders, then learns a forward-looking model through an +oracle-guiding module which minimizes the discrepancy between past and future +information. We also tailor a two-phase model training strategy to make the +guiding more effective. Extensive experiments demonstrate that Oracle4Rec is +superior to state-of-the-art sequential methods. Further experiments show that +Oracle4Rec can be leveraged as a generic module in other sequential +recommendation methods to improve their performance with a considerable margin. + +
+
+
+
+
+ + ☆ Scaling New Frontiers: Insights into Large Recommendation Models + + +
+ Recommendation systems are essential for filtering data and retrieving +relevant information across various applications. Recent advancements have seen +these systems incorporate increasingly large embedding tables, scaling up to +tens of terabytes for industrial use. However, the expansion of network +parameters in traditional recommendation models has plateaued at tens of +millions, limiting further benefits from increased embedding parameters. +Inspired by the success of large language models (LLMs), a new approach has +emerged that scales network parameters using innovative structures, enabling +continued performance improvements. A significant development in this area is +Meta's generative recommendation model HSTU, which illustrates the scaling laws +of recommendation systems by expanding parameters to thousands of billions. +This new paradigm has achieved substantial performance gains in online +experiments. In this paper, we aim to enhance the understanding of scaling laws +by conducting comprehensive evaluations of large recommendation models. +Firstly, we investigate the scaling laws across different backbone +architectures of the large recommendation models. Secondly, we conduct +comprehensive ablation studies to explore the origins of these scaling laws. We +then further assess the performance of HSTU, as the representative of large +recommendation models, on complex user behavior modeling tasks to evaluate its +applicability. Notably, we also analyze its effectiveness in ranking tasks for +the first time. Finally, we offer insights into future directions for large +recommendation models. Supplementary materials for our research are available +on GitHub at https://github.com/USTC-StarTeam/Large-Recommendation-Models. + +
+
+
+
+
+ + ☆ Improving Vietnamese Legal Document Retrieval using Synthetic Data + + +
+ In the field of legal information retrieval, effective embedding-based models +are essential for accurate question-answering systems. However, the scarcity of +large annotated datasets poses a significant challenge, particularly for +Vietnamese legal texts. To address this issue, we propose a novel approach that +leverages large language models to generate high-quality, diverse synthetic +queries for Vietnamese legal passages. This synthetic data is then used to +pre-train retrieval models, specifically bi-encoder and ColBERT, which are +further fine-tuned using contrastive loss with mined hard negatives. Our +experiments demonstrate that these enhancements lead to strong improvement in +retrieval accuracy, validating the effectiveness of synthetic data and +pre-training techniques in overcoming the limitations posed by the lack of +large labeled datasets in the Vietnamese legal domain. + +
+
+
+
+
+ + ☆ Needle: A Generative-AI Powered Monte Carlo Method for Answering Complex + Natural Language Queries on Multi-modal Data + + +
+ Multi-modal data, such as image data sets, often miss the detailed +descriptions that properly capture the rich information encoded in them. This +makes answering complex natural language queries a major challenge in these +domains. In particular, unlike the traditional nearest-neighbor search, where +the tuples and the query are modeled as points in a data cube, the query and +the tuples are of different natures, making the traditional query answering +solutions not directly applicable for such settings. Existing literature +addresses this challenge for image data through vector representations jointly +trained on natural language and images. This technique, however, underperforms +for complex queries due to various reasons. + This paper takes a step towards addressing this challenge by introducing a +Generative-AI (GenAI) powered Monte Carlo method that utilizes foundation +models to generate synthetic samples that capture the complexity of the natural +language query and transform it to the same space of the multi-modal data. +Following this method, we develop a system for image data retrieval and propose +practical solutions that enable leveraging future advancements in GenAI and +vector representations for improving our system's performance. Our +comprehensive experiments on various benchmark datasets verify that our system +significantly outperforms state-of-the-art techniques. + +
+
+
+
+
+ + ♻ ☆ DUET: A Tuning-Free Device-Cloud Collaborative Parameters Generation + Framework for Efficient Device Model Generalization WWW'23 + + +
+ Device Model Generalization (DMG) is a practical yet under-investigated +research topic for on-device machine learning applications. It aims to improve +the generalization ability of pre-trained models when deployed on +resource-constrained devices, such as improving the performance of pre-trained +cloud models on smart mobiles. While quite a lot of works have investigated the +data distribution shift across clouds and devices, most of them focus on model +fine-tuning on personalized data for individual devices to facilitate DMG. +Despite their promising, these approaches require on-device re-training, which +is practically infeasible due to the overfitting problem and high time delay +when performing gradient calculation on real-time data. In this paper, we argue +that the computational cost brought by fine-tuning can be rather unnecessary. +We consequently present a novel perspective to improving DMG without increasing +computational cost, i.e., device-specific parameter generation which directly +maps data distribution to parameters. Specifically, we propose an efficient +Device-cloUd collaborative parametErs generaTion framework DUET. DUET is +deployed on a powerful cloud server that only requires the low cost of +forwarding propagation and low time delay of data transmission between the +device and the cloud. By doing so, DUET can rehearse the device-specific model +weight realizations conditioned on the personalized real-time data for an +individual device. Importantly, our DUET elegantly connects the cloud and +device as a 'duet' collaboration, frees the DMG from fine-tuning, and enables a +faster and more accurate DMG paradigm. We conduct an extensive experimental +study of DUET on three public datasets, and the experimental results confirm +our framework's effectiveness and generalisability for different DMG tasks. + +
+
+ comment: Published on WWW'23: Proceedings of the ACM on Web Conference 2023 + (pp. 3077 - 3085) +
+
+
+
+
+ + ♻ ☆ Intelligent Model Update Strategy for Sequential Recommendation WWW'24 + + +
+ Modern online platforms are increasingly employing recommendation systems to +address information overload and improve user engagement. There is an evolving +paradigm in this research field that recommendation network learning occurs +both on the cloud and on edges with knowledge transfer in between (i.e., +edge-cloud collaboration). Recent works push this field further by enabling +edge-specific context-aware adaptivity, where model parameters are updated in +real-time based on incoming on-edge data. However, we argue that frequent data +exchanges between the cloud and edges often lead to inefficiency and waste of +communication/computation resources, as considerable parameter updates might be +redundant. To investigate this problem, we introduce Intelligent Edge-Cloud +Parameter Request Model, abbreviated as IntellectReq. + IntellectReq is designed to operate on edge, evaluating the cost-benefit +landscape of parameter requests with minimal computation and communication +overhead. We formulate this as a novel learning task, aimed at the detection of +out-of-distribution data, thereby fine-tuning adaptive communication +strategies. Further, we employ statistical mapping techniques to convert +real-time user behavior into a normal distribution, thereby employing +multi-sample outputs to quantify the model's uncertainty and thus its +generalization capabilities. Rigorous empirical validation on four +widely-adopted benchmarks evaluates our approach, evidencing a marked +improvement in the efficiency and generalizability of edge-cloud collaborative +and dynamic recommendation systems. + +
+
+ comment: Published on WWW'24(Oral): Proceedings of the ACM on Web Conference + 2024 (pp. 3117-3128) +
+
+
+
+
+ + ♻ ☆ Leveraging Retrieval-Augmented Generation for Persian University + Knowledge Retrieval + + +
+ This paper introduces an innovative approach using Retrieval-Augmented +Generation (RAG) pipelines with Large Language Models (LLMs) to enhance +information retrieval and query response systems for university-related +question answering. By systematically extracting data from the university +official webpage and employing advanced prompt engineering techniques, we +generate accurate, contextually relevant responses to user queries. + We developed a comprehensive university benchmark, UniversityQuestionBench +(UQB), to rigorously evaluate our system performance, based on common key +metrics in the filed of RAG pipelines, assessing accuracy and reliability +through various metrics and real-world scenarios. Our experimental results +demonstrate significant improvements in the precision and relevance of +generated responses, enhancing user experience and reducing the time required +to obtain relevant answers. In summary, this paper presents a novel application +of RAG pipelines and LLMs, supported by a meticulously prepared university +benchmark, offering valuable insights into advanced AI techniques for academic +data retrieval and setting the stage for future research in this domain. + +
+
+ comment: 6 pages, 2 figures, 1 table, Submitted to 15th IKT conference +
+
+
+
+
+ + ♻ ☆ Efficient Data-aware Distance Comparison Operations for High-Dimensional + Approximate Nearest Neighbor Search VLDB 2025 + + +
+ High-dimensional approximate $K$ nearest neighbor search (AKNN) is a +fundamental task for various applications, including information retrieval. +Most existing algorithms for AKNN can be decomposed into two main components, +i.e., candidate generation and distance comparison operations (DCOs). While +different methods have unique ways of generating candidates, they all share the +same DCO process. In this study, we focus on accelerating the process of DCOs +that dominates the time cost in most existing AKNN algorithms. To achieve this, +we propose an Data-Aware Distance Estimation approach, called DADE, which +approximates the exact distance in a lower-dimensional space. We theoretically +prove that the distance estimation in DADE is unbiased in terms of data +distribution. Furthermore, we propose an optimized estimation based on the +unbiased distance estimation formulation. In addition, we propose a hypothesis +testing approach to adaptively determine the number of dimensions needed to +estimate the exact distance with sufficient confidence. We integrate DADE into +widely-used AKNN search algorithms, e.g., IVF and HNSW, and conduct extensive +experiments to demonstrate the superiority. + +
+
+ comment: Accepted by VLDB 2025 +
+
+
+
+
+ + ♻ ☆ Potential Field Based Deep Metric Learning + + +
+ Deep metric learning (DML) involves training a network to learn a +semantically meaningful representation space. Many current approaches mine +n-tuples of examples and model interactions within each tuplets. We present a +novel, compositional DML model, inspired by electrostatic fields in physics +that, instead of in tuples, represents the influence of each example +(embedding) by a continuous potential field, and superposes the fields to +obtain their combined global potential field. We use attractive/repulsive +potential fields to represent interactions among embeddings from images of the +same/different classes. Contrary to typical learning methods, where mutual +influence of samples is proportional to their distance, we enforce reduction in +such influence with distance, leading to a decaying field. We show that such +decay helps improve performance on real world datasets with large intra-class +variations and label noise. Like other proxy-based methods, we also use proxies +to succinctly represent sub-populations of examples. We evaluate our method on +three standard DML benchmarks- Cars-196, CUB-200-2011, and SOP datasets where +it outperforms state-of-the-art baselines. + +
+
+
+
+
+ + ♻ ☆ G-RAG: Knowledge Expansion in Material Science + + +
+ In the field of Material Science, effective information retrieval systems are +essential for facilitating research. Traditional Retrieval-Augmented Generation +(RAG) approaches in Large Language Models (LLMs) often encounter challenges +such as outdated information, hallucinations, limited interpretability due to +context constraints, and inaccurate retrieval. To address these issues, Graph +RAG integrates graph databases to enhance the retrieval process. Our proposed +method processes Material Science documents by extracting key entities +(referred to as MatIDs) from sentences, which are then utilized to query +external Wikipedia knowledge bases (KBs) for additional relevant information. +We implement an agent-based parsing technique to achieve a more detailed +representation of the documents. Our improved version of Graph RAG called G-RAG +further leverages a graph database to capture relationships between these +entities, improving both retrieval accuracy and contextual understanding. This +enhanced approach demonstrates significant improvements in performance for +domains that require precise information retrieval, such as Material Science. + +
+
+
+
+
+
+
+
+ + Multimedia 2 + +
+
+
+ + ♻ ☆ Separate Anything You Describe + + +
+ Language-queried audio source separation (LASS) is a new paradigm for +computational auditory scene analysis (CASA). LASS aims to separate a target +sound from an audio mixture given a natural language query, which provides a +natural and scalable interface for digital audio applications. Recent works on +LASS, despite attaining promising separation performance on specific sources +(e.g., musical instruments, limited classes of audio events), are unable to +separate audio concepts in the open domain. In this work, we introduce +AudioSep, a foundation model for open-domain audio source separation with +natural language queries. We train AudioSep on large-scale multimodal datasets +and extensively evaluate its capabilities on numerous tasks including audio +event separation, musical instrument separation, and speech enhancement. +AudioSep demonstrates strong separation performance and impressive zero-shot +generalization ability using audio captions or text labels as queries, +substantially outperforming previous audio-queried and language-queried sound +separation models. For reproducibility of this work, we will release the source +code, evaluation benchmark and pre-trained model at: +https://github.com/Audio-AGI/AudioSep. + +
+
+ comment: Code, benchmark and pre-trained models: + https://github.com/Audio-AGI/AudioSep +
+
+
+
+
+ + ♻ ☆ SongBsAb: A Dual Prevention Approach against Singing Voice Conversion + based Illegal Song Covers NDSS + + +
+ Singing voice conversion (SVC) automates song covers by converting a source +singing voice from a source singer into a new singing voice with the same +lyrics and melody as the source, but sounds like being covered by the target +singer of some given target singing voices. However, it raises serious concerns +about copyright and civil right infringements. We propose SongBsAb, the first +proactive approach to tackle SVC-based illegal song covers. SongBsAb adds +perturbations to singing voices before releasing them, so that when they are +used, the process of SVC will be interfered, leading to unexpected singing +voices. Perturbations are carefully crafted to (1) provide a dual prevention, +i.e., preventing the singing voice from being used as the source and target +singing voice in SVC, by proposing a gender-transformation loss and a high/low +hierarchy multi-target loss, respectively; and (2) be harmless, i.e., no +side-effect on the enjoyment of protected songs, by refining a psychoacoustic +model-based loss with the backing track as an additional masker, a unique +accompanying element for singing voices compared to ordinary speech voices. We +also adopt a frame-level interaction reduction-based loss and encoder ensemble +to enhance the transferability of SongBsAb to unknown SVC models. We +demonstrate the prevention effectiveness, harmlessness, and robustness of +SongBsAb on five diverse and promising SVC models, using both English and +Chinese datasets, and both objective and human study-based subjective metrics. +Our work fosters an emerging research direction for mitigating illegal +automated song covers. + +
+
+ comment: In Proceedings of the 32nd Network and Distributed System Security + (NDSS) Symposium 2025 +
+
+
+
+
+
+
+ + + + + + diff --git a/index.js b/index.js new file mode 100644 index 00000000..69f5da7b --- /dev/null +++ b/index.js @@ -0,0 +1,39 @@ +/* Exapand/Collapse with TAB key */ +var expanded = false; +document.onkeydown = function (e) { + if (e.keyCode === 9) { + expanded = !expanded; + document.querySelectorAll("details").forEach(detail => detail.open = expanded); + return false; + } +}; + +/* Switch Theme */ +const toggleSwitch = document.querySelector('.theme-switch input[type="checkbox"]'); + +function switchTheme(e) { + if (e.target.checked) { + document.documentElement.setAttribute('data-theme', 'light'); + document.getElementById("theme-icon").className = "ri-sun-line"; + localStorage.setItem('theme', 'light'); //add this + } else { + document.documentElement.setAttribute('data-theme', 'dark'); + document.getElementById("theme-icon").className = "ri-moon-line"; + localStorage.setItem('theme', 'dark'); //add this + } +} + +toggleSwitch.addEventListener('change', switchTheme, false); +const currentTheme = localStorage.getItem('theme') ? localStorage.getItem('theme') : null; +if (currentTheme) { + document.documentElement.setAttribute('data-theme', currentTheme); + if (currentTheme === 'light') { + toggleSwitch.checked = true; + } +} + +const timestamp = document.getElementById("build-timestamp"); +const timestamp_local = new Date(timestamp.getAttribute("datetime")).toLocaleString(); + +const badge = document.getElementById("build-timestamp-badge"); +// badge.src = `https://img.shields.io/github/workflow/status/mlnlp-world/myarxiv/Update?=${timestamp_local}&style=for-the-badge`