Automated report

deep-diver · Dec 6, 2024 · 020fc8c · 020fc8c
1 parent 33db301
commit 020fc8c
Show file tree

Hide file tree

Showing 29 changed files with 261 additions and 0 deletions.
diff --git a/current/2024-12-05 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction.yaml b/current/2024-12-05 Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Yiheng Xu
+title: 'Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04454
+summary: The paper introduces Aguvis, a unified pure vision-based framework for autonomous GUI agents that uses image-based observations and natural language to interact with digital environments. It overcomes limitations of previous work by integrating explicit planning and reasoning, and outperforms existing methods in both offline and real-world scenarios. Datasets, models, and training recipes are open-sourced....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...traint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection.yaml b/...traint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Enshen Zhou
+title: 'Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04455
+summary: We developed a new method using a model called VLM to detect and prevent failures in robotic systems. This method forms the tasks as a set of problems to solve with constraints, and uses code generated by the VLM to check for these constraints in real-time. The method also uses small geometric shapes to make the monitoring more accurate and efficient. Tests show that this method is more successful and faster than other methods, and it can be used in real-world settings and with other control sys...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Densing Law of LLMs.yaml b/current/2024-12-05 Densing Law of LLMs.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Chaojun Xiao
+title: Densing Law of LLMs
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04315
+summary: This paper proposes a new metric called 'capacity density' to evaluate the quality of Large Language Models (LLMs) based on their effectiveness and efficiency. The paper also identifies an empirical law, the 'densing law', which states that the capacity density of LLMs is increasing exponentially over time, doubling approximately every three months. This law can guide future LLM development by emphasizing the importance of improving capacity density to achieve optimal results with minimal comput...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Evaluating Language Models as Synthetic Data Generators.yaml b/current/2024-12-05 Evaluating Language Models as Synthetic Data Generators.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Seungone Kim
+title: Evaluating Language Models as Synthetic Data Generators
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03679
+summary: This paper introduces AgoraBench, a benchmark for evaluating language models' ability to generate high-quality synthetic data. By synthesizing 1.26 million training instances using 6 LMs and training 99 student models, they uncover key insights about LMs' data generation capabilities, including that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability, and that multiple intrinsic features of data quality collectively serve as better indicators. They also...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ncing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.yaml b/...ncing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jiuhai Chen
+title: 'Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04424
+summary: Florence-VL is a new family of multimodal large language models that uses a generative vision foundation model called Florence-2 to create more versatile visual representations. The model uses a depth-breath fusion technique to combine visual features from different depths and prompts, and it's trained on a diverse set of open-source datasets. Florence-VL outperforms existing models on various multimodal and vision-centric benchmarks, and the models and training recipe are open-sourced....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing.yaml b/...HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jinbin Bai
+title: 'HumanEdit: A High-Quality Human-Rewarded Dataset for Instruction-based Image Editing'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04280
+summary: HumanEdit is a high-quality, human-rewarded dataset for instruction-based image editing that includes various types of editing instructions and is designed to align with human preferences. It is meticulously curated and includes images with masks, making it a versatile benchmark for instructional image editing datasets....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...nfinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis.yaml b/...nfinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jian Han
+title: 'Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04431
+summary: Infinity is a high-resolution, photorealistic image generation model that uses a bitwise token prediction framework with an infinite-vocabulary tokenizer & classifier and bitwise self-correction mechanism to improve generation capacity and details. It outperforms top-tier diffusion models and sets a new record for autoregressive text-to-image models, generating a high-quality 1024x1024 image in 0.8 seconds, making it 2.6x faster than SD3-Medium....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...n-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities.yaml b/...n-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Haoning Wu
+title: 'MRGen: Diffusion-based Controllable Data Engine for MRI Segmentation towards Unannotated Modalities'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04106
+summary: 'Medical image segmentation has recently demonstrated impressive progress with deep neural networks, yet the heterogeneous modalities and scarcity of mask annotations limit the development of segmentation models on unannotated modalities. This paper investigates a new paradigm for leveraging generative models in medical applications: controllably synthesizing data for unannotated modalities, without requiring registered data pairs. Specifically, we make the following contributions in this paper: ...'
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 MV-Adapter: Multi-view Consistent Image Generation Made Easy.yaml b/current/2024-12-05 MV-Adapter: Multi-view Consistent Image Generation Made Easy.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Zehuan Huang
+title: 'MV-Adapter: Multi-view Consistent Image Generation Made Easy'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03632
+summary: This paper introduces MV-Adapter, a versatile plug-and-play adapter that enhances text-to-image models for multi-view image generation without altering the original network structure or feature space. It updates fewer parameters, enabling efficient training and preserving the prior knowledge embedded in pre-trained models, mitigating overfitting risks. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL and demonstrates adaptability and versatility, setting a new q...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Monet: Mixture of Monosemantic Experts for Transformers.yaml b/current/2024-12-05 Monet: Mixture of Monosemantic Experts for Transformers.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jungwoo Park
+title: 'Monet: Mixture of Monosemantic Experts for Transformers'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04139
+summary: Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Negative Token Merging: Image-based Adversarial Feature Guidance.yaml b/current/2024-12-05 Negative Token Merging: Image-based Adversarial Feature Guidance.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jaskirat Singh
+title: 'Negative Token Merging: Image-based Adversarial Feature Guidance'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01339
+summary: This paper introduces a new method called NegToMe for adversarial guidance using visual features from a reference image or other images in a batch, which helps reduce visual similarity with copyrighted content and increases output diversity without sacrificing image quality. It is simple to implement, uses marginally higher inference times, and generalizes to different diffusion architectures....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows.yaml b/current/2024-12-05 OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Shufan Li
+title: 'OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01169
+summary: OmniFlow is a new method for creating images, audio, or text from any source. It uses a flow framework to combine multiple types of data, and it is better than previous methods for tasks like turning text into images or audio. OmniFlow also has a new way to control how different types of data are connected in the final product, and it can be trained on audio and text data separately before being combined with other data for fine-tuning. The method was tested thoroughly and provided insights into...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Structured 3D Latents for Scalable and Versatile 3D Generation.yaml b/current/2024-12-05 Structured 3D Latents for Scalable and Versatile 3D Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jianfeng Xiang
+title: Structured 3D Latents for Scalable and Versatile 3D Generation
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01506
+summary: We propose a new method for generating 3D assets that can be used in various ways and looks good. The key is a special way of representing 3D information that can be used to create different types of outputs, like radiance fields, 3D shapes, and meshes. This method uses a sparse 3D grid and visual information from a powerful computer vision model to capture both shape and appearance information. Our 3D generation models, called rectified flow transformers, are trained on a large dataset of 500,0...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-05 Towards Universal Soccer Video Understanding.yaml b/current/2024-12-05 Towards Universal Soccer Video Understanding.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Jiayuan Rao
+title: Towards Universal Soccer Video Understanding
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01820
+summary: This paper presents a comprehensive multi-modal framework for soccer video understanding, including the largest multi-modal soccer dataset to date, a visual-language foundation model called MatchVision, and experiments on various downstream tasks. MatchVision demonstrates state-of-the-art performance, highlighting the superiority of the proposed data and model....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...t/2024-12-05 VisionZip: Longer is Better but Not Necessary in Vision Language Models.yaml b/...t/2024-12-05 VisionZip: Longer is Better but Not Necessary in Vision Language Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Senqiao Yang
+title: 'VisionZip: Longer is Better but Not Necessary in Vision Language Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04467
+summary: VisionZip is a method that reduces visual token redundancy and improves efficiency in vision-language models by selecting informative tokens for input, outperforming previous methods across various tasks and settings while also enhancing model inference speed....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...4-12-05 ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality.yaml b/...4-12-05 ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-05"
+author: Yefei He
+title: 'ZipAR: Accelerating Autoregressive Image Generation through Spatial Locality'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04062
+summary: ZipAR is a method that speeds up the process of creating images by looking at the parts of the image that are close together instead of one at a time....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...nt/2024-12-06 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion.yaml b/...nt/2024-12-06 4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Chaoyang Wang
+title: '4Real-Video: Learning Generalizable Photo-Realistic 4D Video Diffusion'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04462
+summary: 4Real-Video is a new way to create 4D videos by organizing them as a grid of video frames with both time and viewpoint axes. It uses a two-stream architecture with a synchronization layer to exchange information between streams, improving inference speed, visual quality, and temporal and viewpoint consistency compared to previous methods....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 A Noise is Worth Diffusion Guidance.yaml b/current/2024-12-06 A Noise is Worth Diffusion Guidance.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Donghoon Ahn
+title: A Noise is Worth Diffusion Guidance
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03895
+summary: We find that small, low-magnitude, low-frequency components in the initial noise of a denoising pipeline significantly enhance the denoising process, removing the need for guidance and improving inference throughput and memory. We propose a new method, NoiseRefine, that replaces guidance methods with a single refinement of the initial noise, enabling high-quality image generation without guidance within the same diffusion pipeline. Our method uses efficient noise-space learning and achieves stro...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models.yaml b/...AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Xinghui Li
+title: 'AnyDressing: Customizable Multi-Garment Virtual Dressing via Latent Diffusion Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04146
+summary: This paper proposes a new method called AnyDressing for generating customized images of characters wearing any combination of garments based on text prompts. The method uses two networks, GarmentsNet and DressingNet, to extract garment features and generate images, respectively. The paper also introduces a new module called Garment-Specific Feature Extractor and an adaptive Dressing-Attention mechanism to improve the accuracy and quality of the generated images. The method can be easily integrat...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 Challenges in Trustworthy Human Evaluation of Chatbots.yaml b/current/2024-12-06 Challenges in Trustworthy Human Evaluation of Chatbots.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Wenting Zhao
+title: Challenges in Trustworthy Human Evaluation of Chatbots
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04363
+summary: This paper discusses challenges in collecting high-quality human evaluations for chatbots, specifically focusing on the impact of poor quality votes by apathetic or adversarial annotators on open leaderboard rankings. It highlights that only 10% of such votes can change model rankings by up to 5 places, and it identifies open challenges in ensuring reliable human annotations....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 Discriminative Fine-tuning of LVLMs.yaml b/current/2024-12-06 Discriminative Fine-tuning of LVLMs.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Yassine Ouali
+title: Discriminative Fine-tuning of LVLMs
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04378
+summary: This paper proposes a new training approach for discriminative fine-tuning of Large Vision-Language Models (LVLMs) that combines the best of contrastively-trained Vision-Language Models (VLMs) and LVLMs. The approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. The contributions include a carefully designed training/optimization framework, a parameter-efficient adaptation metho...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...derstanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation.yaml b/...derstanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Shivalika Singh
+title: 'Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03304
+summary: This paper examines cultural and linguistic biases in multilingual evaluation datasets, specifically focusing on MMLU. The researchers find that these biases can negatively impact model performance and rankings. They propose Global-MMLU, an improved version of MMLU with reduced biases and expanded language coverage, to address these issues....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 KV Shifting Attention Enhances Language Modeling.yaml b/current/2024-12-06 KV Shifting Attention Enhances Language Modeling.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Mingyu Xu
+title: KV Shifting Attention Enhances Language Modeling
+thumbnail: ""
+link: https://huggingface.co/papers/2411.19574
+summary: This paper introduces a new method called KV shifting attention that enhances the language modeling capabilities of large models by reducing their requirements for depth and width. The method is proven to be beneficial for learning induction heads and language modeling, leading to better performance or faster convergence from toy models to models with over 10 billion parameters....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ent/2024-12-06 MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation.yaml b/...ent/2024-12-06 MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Longtao Zheng
+title: 'MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04448
+summary: This paper introduces MEMO, a new system that uses memory and emotion to create more realistic talking videos. It improves on previous methods by making the videos more consistent with the audio and maintaining the person's identity better....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement.yaml b/...: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Lingfeng Ming
+title: 'Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04003
+summary: Marco-LLM is a new multilingual LLM that has been trained on a large amount of multilingual data to improve its performance in low-resource languages. It has been evaluated on multiple benchmarks and has shown significant improvements over existing LLMs. Additionally, Marco-LLM has demonstrated substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. It is designed to work well in multilingual tasks, including low-resource languages, wh...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 Personalized Multimodal Large Language Models: A Survey.yaml b/current/2024-12-06 Personalized Multimodal Large Language Models: A Survey.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Junda Wu
+title: 'Personalized Multimodal Large Language Models: A Survey'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.02142
+summary: This paper provides a comprehensive overview of personalized multimodal large language models, covering their architecture, training methods, and applications. It proposes a taxonomy for categorizing personalization techniques and discusses the advantages and evaluation metrics used in existing research. The survey also outlines open challenges in the field....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.yaml b/...ling Inference-Time Search with Vision Value Model for Improved Visual Comprehension.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Wang Xiyao
+title: Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03704
+summary: This paper introduces a new model called Vision Value Model (VisVM) that helps vision-language models (VLMs) generate better responses by guiding their search process during inference. VisVM evaluates the quality of sentences and predicts the quality of future sentences, which helps VLMs avoid generating incorrect or incomplete sentences. The paper shows that using VisVM improves the performance of VLMs on various multimodal benchmarks and can be used to train VLMs to improve themselves....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...inTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction.yaml b/...inTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Ethan Bradley
+title: 'SynFinTabs: A Dataset of Synthetic Financial Tables for Information and Table Extraction'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04262
+summary: The paper introduces a new dataset called SynFinTabs, which contains synthetic financial tables. This dataset is meant to be used for training models to extract information from table images. The authors also introduce a model called FinTabQA that is trained on this dataset. The dataset, model, and code for generating the dataset are all available for others to use....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-06 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay.yaml b/current/2024-12-06 p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-06"
+author: Jun Zhang
+title: 'p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04449
+summary: We propose a new method to build more efficient multimodal language models by selectively processing only important vision tokens in each layer, and gradually reducing the number of processed tokens in deeper layers. This approach improves the efficiency and performance of our models, using less computation and storage during inference, and less training time. We validate our approach on two models and 14 benchmarks, and our model performs as well or better than the baseline models with reduced ...
+opinion: placeholder
+tags:
+    - ML