Skip to content

Commit

Permalink
Automated report
Browse files Browse the repository at this point in the history
  • Loading branch information
deep-diver committed Dec 18, 2024
1 parent 3cc161e commit aabbdc2
Show file tree
Hide file tree
Showing 11 changed files with 99 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-17"
author: Junnan Liu
title: Are Your LLMs Capable of Stable Reasoning?
thumbnail: ""
link: https://huggingface.co/papers/2412.13147
summary: This paper introduces G-Pass@k, a new evaluation metric that measures the performance and stability of Large Language Models (LLMs) in complex reasoning tasks. It also presents LiveMathBench, a dynamic benchmark of challenging mathematical problems. The authors find that LLMs have room for improvement in their realistic reasoning capabilities, highlighting the need for more robust evaluation methods....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-17"
author: YiFan Zhang
title: 'Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.12606
summary: The Multi-Dimensional Insights benchmark is a new tool that tests large multimodal models' ability to understand and analyze images in real-world scenarios. It includes questions for different age groups and shows that these models still need to improve in meeting people's needs....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Jeffrey Cheng
title: 'Compressed Chain of Thought: Efficient Reasoning Through Dense Representations'
thumbnail: ""
link: https://huggingface.co/papers/2412.13171
summary: We propose a method called Compressed Chain-of-Thought (CCoT) to generate continuous and variable-length contemplation tokens for language models during inference. These tokens represent explicit reasoning chains and can be applied to existing models. Our method improves accuracy by allowing the models to reason over dense contentful representations, and the level of reasoning can be adjusted by controlling the number of tokens generated....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Seungwook Han
title: 'Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers'
thumbnail: ""
link: https://huggingface.co/papers/2412.12276
summary: This paper proposes a concept encoding-decoding mechanism to explain in-context learning (ICL) in transformers. The mechanism involves the model encoding different latent concepts into distinct, separable representations and building conditional decoding algorithms. The quality of concept encoding is causally related and predictive of ICL performance. The mechanism is validated across pretrained models of varying scales and through controlled finetuning....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Mark Endo
title: 'Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration'
thumbnail: ""
link: https://huggingface.co/papers/2412.13180
summary: This paper studies the issue of pruning visual tokens in Vision-Language Models and introduces a new approach called FEATHER that resolves this issue and improves performance on vision-centric localization benchmarks by more than 5 times compared to the original approach....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Samuel Teodoro
title: 'MIVE: New Design and Benchmark for Multi-Instance Video Editing'
thumbnail: ""
link: https://huggingface.co/papers/2412.12877
summary: 'This paper introduces a new video editing framework called MIVE. It has two main components: DMS to prevent editing leakage and IPR to ensure precise editing. The paper also presents a new MIVE Dataset and an evaluation metric called CIA Score. MIVE outperforms existing methods in terms of editing accuracy, faithfulness, and leakage prevention....'
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Shuting Wang
title: 'OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain'
thumbnail: ""
link: https://huggingface.co/papers/2412.13018
summary: This paper presents OmniEval, a comprehensive RAG benchmark in the financial domain that assesses both retrieval and generation performance. It includes a multi-dimensional evaluation framework, automatic data generation, and robust evaluation metrics, highlighting the performance variations of RAG systems across diverse topics and tasks. The code is open-sourced....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Yifei Zhou
title: 'Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents'
thumbnail: ""
link: https://huggingface.co/papers/2412.13194
summary: The paper proposes a system called PAE that allows foundation model agents to discover and practice skills in the real world. It uses a task proposer to suggest tasks, a thought-based agent policy to attempt those tasks, and an autonomous success evaluator to assess the results. PAE is validated on vision-based web navigation tasks and outperforms other methods with real-world human-annotated benchmarks. The open-source code and checkpoints are available at https://yanqval.github.io/PAE/....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Xuanming Zhang
title: 'Seeker: Towards Exception Safety Code Generation with Intermediate Language Agents Framework'
thumbnail: ""
link: https://huggingface.co/papers/2412.11713
summary: 'This academic paper introduces Seeker, a multi-agent framework that uses large language models to improve exception handling in code. The framework addresses three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Block, and Distorted Handling Solution. Seeker uses agents to assist LLMs in detecting, capturing, and resolving exceptions more effectively, providing valuable insights for future improvements in code reliability....'
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Manan Suri
title: 'VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation'
thumbnail: ""
link: https://huggingface.co/papers/2412.10704
summary: This paper presents VisDoMBench, a benchmark for evaluating question answering systems in multi-document settings with rich multimodal content. The paper also introduces VisDoMRAG, a novel multimodal approach that combines visual and textual retrieval augmented generation, improving accuracy and answer verifiability. The approach outperforms unimodal and long-context LLM baselines by 12-20% in end-to-end multimodal document QA....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-18"
author: Hyuhng Joon Kim
title: 'When to Speak, When to Abstain: Contrastive Decoding with Abstention'
thumbnail: ""
link: https://huggingface.co/papers/2412.12527
summary: This paper introduces Contrastive Decoding with Abstention (CDA), a method that helps large language models decide when to provide an answer and when to abstain, improving their reliability and trustworthiness....
opinion: placeholder
tags:
- ML

0 comments on commit aabbdc2

Please sign in to comment.