Automated report

deep-diver · Dec 16, 2024 · 46dc264 · 46dc264
1 parent 9a3c6a3
commit 46dc264
Show file tree

Hide file tree

Showing 17 changed files with 153 additions and 0 deletions.
diff --git a/.../2024-12-15 Apollo: An Exploration of Video Understanding in Large Multimodal Models.yaml b/.../2024-12-15 Apollo: An Exploration of Video Understanding in Large Multimodal Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Orr Zohar
+title: 'Apollo: An Exploration of Video Understanding in Large Multimodal Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.10360
+summary: This paper explores the factors that contribute to effective video understanding in Large Multimodal Models (LMMs). The researchers found that scaling consistency is a key factor in transferring design and training decisions from smaller models to larger ones. Based on these findings, they introduced Apollo, a family of LMMs that outperform existing models in video understanding tasks, including hour-long videos....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ent/2024-12-15 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing.yaml b/...ent/2024-12-15 FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Yingying Deng
+title: 'FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.07517
+summary: This paper presents FireFlow, a method that improves the speed and accuracy of inverting and editing images using Rectified Flows. It uses a numerical solver to achieve a 3x speedup compared to existing methods, while also improving image quality and editing results. The code for FireFlow is available online....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...reeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion.yaml b/...reeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Haonan Qiu
+title: 'FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09626
+summary: FreeScale is a method that allows pre-trained visual diffusion models to generate high-quality images and videos at higher resolutions without the need for additional training or tuning. It does this by fusing information from different receptive scales and extracting desired frequency components, which helps to reduce repetitive patterns and improve the overall quality of the generated content. FreeScale has been shown to be effective for both image and video models, and it is the first method ...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-15 GenEx: Generating an Explorable World.yaml b/current/2024-12-15 GenEx: Generating an Explorable World.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Taiming Lu
+title: 'GenEx: Generating an Explorable World'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09624
+summary: GenEx is a system that generates an entire 3D-consistent imaginative environment from a single RGB image, bringing it to life through panoramic video streams. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...nstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption.yaml b/...nstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Tiehan Fan
+title: 'InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09283
+summary: This paper introduces a new method called InstanceCap that creates more detailed and accurate video captions by breaking down videos into smaller parts and using those parts to create better descriptions. This method also improves the quality of the videos generated by text-to-video models....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-15 Large Action Models: From Inception to Implementation.yaml b/current/2024-12-15 Large Action Models: From Inception to Implementation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Lu Wang
+title: 'Large Action Models: From Inception to Implementation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.10047
+summary: This paper introduces a comprehensive framework for developing Large Action Models (LAMs) for action generation and execution within dynamic environments. It provides a step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. The paper also identifies current limitations and discusses future research and industrial deployment directions....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-15 SCBench: A KV Cache-Centric Analysis of Long-Context Methods.yaml b/current/2024-12-15 SCBench: A KV Cache-Centric Analysis of Long-Context Methods.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-15"
+author: Yucheng Li
+title: 'SCBench: A KV Cache-Centric Analysis of Long-Context Methods'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.10319
+summary: SCBench is a benchmark for evaluating long-context methods from a KV cache perspective. It covers four categories of long-context capabilities and evaluates eight categories long-context solutions. The findings show that sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly, and dynamic sparsity yields more expressive KV caches than static patterns....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-16 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities.yaml b/current/2024-12-16 BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Sahal Shaji Mullappilly
+title: 'BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.07769
+summary: This study introduces BiMediX2, a bilingual (Arabic-English) AI model that can understand and process text and images related to healthcare. It's trained on a large dataset of medical interactions in both languages and performs well on various medical tasks, even outperforming other models like GPT-4....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...t/2024-12-16 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers.yaml b/...t/2024-12-16 FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Yusuf Dalva
+title: 'FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09611
+summary: This paper introduces FluxSpace, a method for editing images generated by rectified flow transformers like Flux, by using the representations learned by the transformer blocks within the models. This allows for precise, attribute-specific modifications without affecting unrelated aspects of the image, and enables a wide range of image editing tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...Ter: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers.yaml b/...Ter: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Sarkar Snigdha Sarathi Das
+title: 'GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09722
+summary: GReaTer is a new method for optimizing prompts for smaller language models by using task loss gradients. It outperforms previous methods and can even match or surpass the performance of larger language models....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...solution Minute-Length Text-to-Video Generation with Linear Computational Complexity.yaml b/...solution Minute-Length Text-to-Video Generation with Linear Computational Complexity.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Hongjie Wang
+title: 'LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09856
+summary: A new method called LinGen is introduced that can generate high-resolution minute-length videos on a single GPU, which is a significant improvement over existing methods that can only generate videos of 10-20 seconds length due to their high computational cost. LinGen replaces a computationally-intensive block with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, while the TE-branch focuses on temporal ...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...4-12-16 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation.yaml b/...4-12-16 Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Baisen Wang
+title: Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09428
+summary: This paper proposes a new method, Visuals Music Bridge (VMB), for multimodal music generation. It uses explicit bridges of text and music for multimodal alignment, and a Multimodal Music Description Model to convert visual inputs into textual descriptions. It also includes a Dual-track Music Retrieval module for user control. The method improves music quality, modality, and customization alignment, and sets a new standard for interpretable and expressive multimodal music generation....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...16 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation.yaml b/...16 ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Daniel Winter
+title: 'ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.08645
+summary: This paper proposes a method that uses recurring objects in large unlabeled datasets to create massive supervision, enabling the training of a simple text-to-image diffusion architecture for object insertion and subject-driven generation. The method outperforms existing methods in identity preservation and photorealistic composition without requiring slow test-time tuning....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...b (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images.yaml b/...b (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Yasamin Medghalchi
+title: 'Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09910
+summary: We present Prompt2Perturb (P2P), a new approach to generating adversarial attacks on breast ultrasound images using text instructions. Our method uses learnable prompts and optimizes early reverse diffusion steps to create subtle, yet effective, perturbations that remain imperceptible while guiding the model towards targeted outcomes. P2P outperforms existing attack techniques and generates more natural-looking images....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...Tulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs.yaml b/...Tulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Sultan Alrashed
+title: 'SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.08347
+summary: We trained a language model called SmolTulu-1.7b-Instruct, which is based on Tulu 3 and SmolLM2-1.7B, and found that using higher learning rates and batch sizes can improve its performance on reasoning tasks. It outperforms other models with less than 2 billion parameters on instruction following and mathematical reasoning tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...Synergistic Image Understanding and Generation with Vision Experts and Token Folding.yaml b/...Synergistic Image Understanding and Generation with Vision Experts and Token Folding.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Hao Li
+title: 'SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.09604
+summary: This paper proposes SynerGen-VL, a simple yet powerful encoder-free MLLM that can both understand and generate images. It uses a token folding mechanism and vision-expert-based progressive alignment pretraining strategy to improve performance while reducing training complexity. SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and it's a promising path for future unified MLLMs....
+opinion: placeholder
+tags:
+    - ML
diff --git a/... Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies.yaml b/... Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-16"
+author: Ruijie Zheng
+title: 'TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.10345
+summary: We present a new approach called visual trace prompting to improve the spatial-temporal awareness of large vision-language-action models for robotic learning. Our method involves encoding state-action trajectories visually, which helps the models better understand the spatial-temporal dynamics in interactive robotics. We demonstrate the effectiveness of our approach by developing a new model called TraceVLA, which outperforms existing models in various robot manipulation tasks and exhibits robus...
+opinion: placeholder
+tags:
+    - ML