Skip to content

Commit

Permalink
Automated report
Browse files Browse the repository at this point in the history
  • Loading branch information
deep-diver committed Dec 6, 2024
1 parent 33db301 commit 020fc8c
Show file tree
Hide file tree
Showing 29 changed files with 261 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Yiheng Xu
title: 'Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction'
thumbnail: ""
link: https://huggingface.co/papers/2412.04454
summary: The paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that uses image-based observations and natural language to interact with digital environments. It overcomes limitations of previous work by integrating explicit planning and reasoning, and outperforms existing methods in both offline and real-world scenarios. Datasets, models, and training recipes are open-sourced....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Enshen Zhou
title: 'Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection'
thumbnail: ""
link: https://huggingface.co/papers/2412.04455
summary: We developed a new method using a model called VLM to detect and prevent failures in robotic systems. This method forms the tasks as a set of problems to solve with constraints, and uses code generated by the VLM to check for these constraints in real-time. The method also uses small geometric shapes to make the monitoring more accurate and efficient. Tests show that this method is more successful and faster than other methods, and it can be used in real-world settings and with other control sys...
opinion: placeholder
tags:
- ML
9 changes: 9 additions & 0 deletions current/2024-12-05 Densing Law of LLMs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Chaojun Xiao
title: Densing Law of LLMs
thumbnail: ""
link: https://huggingface.co/papers/2412.04315
summary: This paper proposes a new metric called 'capacity density' to evaluate the quality of Large Language Models (LLMs) based on their effectiveness and efficiency. The paper also identifies an empirical law, the 'densing law', which states that the capacity density of LLMs is increasing exponentially over time, doubling approximately every three months. This law can guide future LLM development by emphasizing the importance of improving capacity density to achieve optimal results with minimal comput...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Seungone Kim
title: Evaluating Language Models as Synthetic Data Generators
thumbnail: ""
link: https://huggingface.co/papers/2412.03679
summary: This paper introduces AgoraBench, a benchmark for evaluating language models' ability to generate high-quality synthetic data. By synthesizing 1.26 million training instances using 6 LMs and training 99 student models, they uncover key insights about LMs' data generation capabilities, including that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability, and that multiple intrinsic features of data quality collectively serve as better indicators. They also...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jiuhai Chen
title: 'Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion'
thumbnail: ""
link: https://huggingface.co/papers/2412.04424
summary: Florence-VL is a new family of multimodal large language models that uses a generative vision foundation model called Florence-2 to create more versatile visual representations. The model uses a depth-breath fusion technique to combine visual features from different depths and prompts, and it's trained on a diverse set of open-source datasets. Florence-VL outperforms existing models on various multimodal and vision-centric benchmarks, and the models and training recipe are open-sourced....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jinbin Bai
title: 'HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing'
thumbnail: ""
link: https://huggingface.co/papers/2412.04280
summary: HumanEdit is a high-quality, human-rewarded dataset for instruction-based image editing that includes various types of editing instructions and is designed to align with human preferences. It is meticulously curated and includes images with masks, making it a versatile benchmark for instructional image editing datasets....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jian Han
title: 'Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis'
thumbnail: ""
link: https://huggingface.co/papers/2412.04431
summary: Infinity is a high-resolution, photorealistic image generation model that uses a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism to improve generation capacity and details. It outperforms top-tier diffusion models and sets a new record for autoregressive text-to-image models, generating a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Haoning Wu
title: 'MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities'
thumbnail: ""
link: https://huggingface.co/papers/2412.04106
summary: 'Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: ...'
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Zehuan Huang
title: 'MV-Adapter: Multi-view Consistent Image Generation Made Easy'
thumbnail: ""
link: https://huggingface.co/papers/2412.03632
summary: This paper introduces MV-Adapter, a versatile plug-and-play adapter that enhances text-to-image models for multi-view image generation without altering the original network structure or feature space. It updates fewer parameters, enabling efficient training and preserving the prior knowledge embedded in pre-trained models, mitigating overfitting risks. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL and demonstrates adaptability and versatility, setting a new q...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jungwoo Park
title: 'Monet: Mixture of Monosemantic Experts for Transformers'
thumbnail: ""
link: https://huggingface.co/papers/2412.04139
summary: Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jaskirat Singh
title: 'Negative Token Merging: Image-based Adversarial Feature Guidance'
thumbnail: ""
link: https://huggingface.co/papers/2412.01339
summary: This paper introduces a new method called NegToMe for adversarial guidance using visual features from a reference image or other images in a batch, which helps reduce visual similarity with copyrighted content and increases output diversity without sacrificing image quality. It is simple to implement, uses marginally higher inference times, and generalizes to different diffusion architectures....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Shufan Li
title: 'OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows'
thumbnail: ""
link: https://huggingface.co/papers/2412.01169
summary: OmniFlow is a new method for creating images, audio, or text from any source. It uses a flow framework to combine multiple types of data, and it is better than previous methods for tasks like turning text into images or audio. OmniFlow also has a new way to control how different types of data are connected in the final product, and it can be trained on audio and text data separately before being combined with other data for fine-tuning. The method was tested thoroughly and provided insights into...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jianfeng Xiang
title: Structured 3D Latents for Scalable and Versatile 3D Generation
thumbnail: ""
link: https://huggingface.co/papers/2412.01506
summary: We propose a new method for generating 3D assets that can be used in various ways and looks good. The key is a special way of representing 3D information that can be used to create different types of outputs, like radiance fields, 3D shapes, and meshes. This method uses a sparse 3D grid and visual information from a powerful computer vision model to capture both shape and appearance information. Our 3D generation models, called rectified flow transformers, are trained on a large dataset of 500,0...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Jiayuan Rao
title: Towards Universal Soccer Video Understanding
thumbnail: ""
link: https://huggingface.co/papers/2412.01820
summary: This paper presents a comprehensive multi-modal framework for soccer video understanding, including the largest multi-modal soccer dataset to date, a visual-language foundation model called MatchVision, and experiments on various downstream tasks. MatchVision demonstrates state-of-the-art performance, highlighting the superiority of the proposed data and model....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Senqiao Yang
title: 'VisionZip: Longer is Better but Not Necessary in Vision Language Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.04467
summary: VisionZip is a method that reduces visual token redundancy and improves efficiency in vision-language models by selecting informative tokens for input, outperforming previous methods across various tasks and settings while also enhancing model inference speed....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-05"
author: Yefei He
title: 'ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality'
thumbnail: ""
link: https://huggingface.co/papers/2412.04062
summary: ZipAR is a method that speeds up the process of creating images by looking at the parts of the image that are close together instead of one at a time....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Chaoyang Wang
title: '4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion'
thumbnail: ""
link: https://huggingface.co/papers/2412.04462
summary: 4Real-Video is a new way to create 4D videos by organizing them as a grid of video frames with both time and viewpoint axes. It uses a two-stream architecture with a synchronization layer to exchange information between streams, improving inference speed, visual quality, and temporal and viewpoint consistency compared to previous methods....
opinion: placeholder
tags:
- ML
9 changes: 9 additions & 0 deletions current/2024-12-06 A Noise is Worth Diffusion Guidance.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Donghoon Ahn
title: A Noise is Worth Diffusion Guidance
thumbnail: ""
link: https://huggingface.co/papers/2412.03895
summary: We find that small, low-magnitude, low-frequency components in the initial noise of a denoising pipeline significantly enhance the denoising process, removing the need for guidance and improving inference throughput and memory. We propose a new method, NoiseRefine, that replaces guidance methods with a single refinement of the initial noise, enabling high-quality image generation without guidance within the same diffusion pipeline. Our method uses efficient noise-space learning and achieves stro...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Xinghui Li
title: 'AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.04146
summary: This paper proposes a new method called AnyDressing for generating customized images of characters wearing any combination of garments based on text prompts. The method uses two networks, GarmentsNet and DressingNet, to extract garment features and generate images, respectively. The paper also introduces a new module called Garment-Specific Feature Extractor and an adaptive Dressing-Attention mechanism to improve the accuracy and quality of the generated images. The method can be easily integrat...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Wenting Zhao
title: Challenges in Trustworthy Human Evaluation of Chatbots
thumbnail: ""
link: https://huggingface.co/papers/2412.04363
summary: This paper discusses challenges in collecting high-quality human evaluations for chatbots, specifically focusing on the impact of poor quality votes by apathetic or adversarial annotators on open leaderboard rankings. It highlights that only 10% of such votes can change model rankings by up to 5 places, and it identifies open challenges in ensuring reliable human annotations....
opinion: placeholder
tags:
- ML
9 changes: 9 additions & 0 deletions current/2024-12-06 Discriminative Fine-tuning of LVLMs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Yassine Ouali
title: Discriminative Fine-tuning of LVLMs
thumbnail: ""
link: https://huggingface.co/papers/2412.04378
summary: This paper proposes a new training approach for discriminative fine-tuning of Large Vision-Language Models (LVLMs) that combines the best of contrastively-trained Vision-Language Models (VLMs) and LVLMs. The approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. The contributions include a carefully designed training/optimization framework, a parameter-efficient adaptation metho...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Shivalika Singh
title: 'Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation'
thumbnail: ""
link: https://huggingface.co/papers/2412.03304
summary: This paper examines cultural and linguistic biases in multilingual evaluation datasets, specifically focusing on MMLU. The researchers find that these biases can negatively impact model performance and rankings. They propose Global-MMLU, an improved version of MMLU with reduced biases and expanded language coverage, to address these issues....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Mingyu Xu
title: KV Shifting Attention Enhances Language Modeling
thumbnail: ""
link: https://huggingface.co/papers/2411.19574
summary: This paper introduces a new method called KV shifting attention that enhances the language modeling capabilities of large models by reducing their requirements for depth and width. The method is proven to be beneficial for learning induction heads and language modeling, leading to better performance or faster convergence from toy models to models with over 10 billion parameters....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Longtao Zheng
title: 'MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation'
thumbnail: ""
link: https://huggingface.co/papers/2412.04448
summary: This paper introduces MEMO, a new system that uses memory and emotion to create more realistic talking videos. It improves on previous methods by making the videos more consistent with the audio and maintaining the person's identity better....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Lingfeng Ming
title: 'Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement'
thumbnail: ""
link: https://huggingface.co/papers/2412.04003
summary: Marco-LLM is a new multilingual LLM that has been trained on a large amount of multilingual data to improve its performance in low-resource languages. It has been evaluated on multiple benchmarks and has shown significant improvements over existing LLMs. Additionally, Marco-LLM has demonstrated substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. It is designed to work well in multilingual tasks, including low-resource languages, wh...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Junda Wu
title: 'Personalized Multimodal Large Language Models: A Survey'
thumbnail: ""
link: https://huggingface.co/papers/2412.02142
summary: This paper provides a comprehensive overview of personalized multimodal large language models, covering their architecture, training methods, and applications. It proposes a taxonomy for categorizing personalization techniques and discusses the advantages and evaluation metrics used in existing research. The survey also outlines open challenges in the field....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Wang Xiyao
title: Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
thumbnail: ""
link: https://huggingface.co/papers/2412.03704
summary: This paper introduces a new model called Vision Value Model (VisVM) that helps vision-language models (VLMs) generate better responses by guiding their search process during inference. VisVM evaluates the quality of sentences and predicts the quality of future sentences, which helps VLMs avoid generating incorrect or incomplete sentences. The paper shows that using VisVM improves the performance of VLMs on various multimodal benchmarks and can be used to train VLMs to improve themselves....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Ethan Bradley
title: 'SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction'
thumbnail: ""
link: https://huggingface.co/papers/2412.04262
summary: The paper introduces a new dataset called SynFinTabs, which contains synthetic financial tables. This dataset is meant to be used for training models to extract information from table images. The authors also introduce a model called FinTabQA that is trained on this dataset. The dataset, model, and code for generating the dataset are all available for others to use....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-06"
author: Jun Zhang
title: 'p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay'
thumbnail: ""
link: https://huggingface.co/papers/2412.04449
summary: We propose a new method to build more efficient multimodal language models by selectively processing only important vision tokens in each layer, and gradually reducing the number of processed tokens in deeper layers. This approach improves the efficiency and performance of our models, using less computation and storage during inference, and less training time. We validate our approach on two models and 14 benchmarks, and our model performs as well or better than the baseline models with reduced ...
opinion: placeholder
tags:
- ML

0 comments on commit 020fc8c

Please sign in to comment.