Skip to content

Commit

Permalink
Automated report
Browse files Browse the repository at this point in the history
  • Loading branch information
deep-diver committed Dec 16, 2024
1 parent 9a3c6a3 commit 46dc264
Show file tree
Hide file tree
Showing 17 changed files with 153 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Orr Zohar
title: 'Apollo: An Exploration of Video Understanding in Large Multimodal Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.10360
summary: This paper explores the factors that contribute to effective video understanding in Large Multimodal Models (LMMs). The researchers found that scaling consistency is a key factor in transferring design and training decisions from smaller models to larger ones. Based on these findings, they introduced Apollo, a family of LMMs that outperform existing models in video understanding tasks, including hour-long videos....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Yingying Deng
title: 'FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing'
thumbnail: ""
link: https://huggingface.co/papers/2412.07517
summary: This paper presents FireFlow, a method that improves the speed and accuracy of inverting and editing images using Rectified Flows. It uses a numerical solver to achieve a 3x speedup compared to existing methods, while also improving image quality and editing results. The code for FireFlow is available online....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Haonan Qiu
title: 'FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion'
thumbnail: ""
link: https://huggingface.co/papers/2412.09626
summary: FreeScale is a method that allows pre-trained visual diffusion models to generate high-quality images and videos at higher resolutions without the need for additional training or tuning. It does this by fusing information from different receptive scales and extracting desired frequency components, which helps to reduce repetitive patterns and improve the overall quality of the generated content. FreeScale has been shown to be effective for both image and video models, and it is the first method ...
opinion: placeholder
tags:
- ML
9 changes: 9 additions & 0 deletions current/2024-12-15 GenEx: Generating an Explorable World.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Taiming Lu
title: 'GenEx: Generating an Explorable World'
thumbnail: ""
link: https://huggingface.co/papers/2412.09624
summary: GenEx is a system that generates an entire 3D-consistent imaginative environment from a single RGB image, bringing it to life through panoramic video streams. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Tiehan Fan
title: 'InstanceCap: Improving Text-to-Video Generation via Instance-aware Structured Caption'
thumbnail: ""
link: https://huggingface.co/papers/2412.09283
summary: This paper introduces a new method called InstanceCap that creates more detailed and accurate video captions by breaking down videos into smaller parts and using those parts to create better descriptions. This method also improves the quality of the videos generated by text-to-video models....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Lu Wang
title: 'Large Action Models: From Inception to Implementation'
thumbnail: ""
link: https://huggingface.co/papers/2412.10047
summary: This paper introduces a comprehensive framework for developing Large Action Models (LAMs) for action generation and execution within dynamic environments. It provides a step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. The paper also identifies current limitations and discusses future research and industrial deployment directions....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-15"
author: Yucheng Li
title: 'SCBench: A KV Cache-Centric Analysis of Long-Context Methods'
thumbnail: ""
link: https://huggingface.co/papers/2412.10319
summary: SCBench is a benchmark for evaluating long-context methods from a KV cache perspective. It covers four categories of long-context capabilities and evaluates eight categories long-context solutions. The findings show that sparse encoding with O(n) memory and sub-O(n^2) pre-filling computation perform robustly, and dynamic sparsity yields more expressive KV caches than static patterns....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Sahal Shaji Mullappilly
title: 'BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities'
thumbnail: ""
link: https://huggingface.co/papers/2412.07769
summary: This study introduces BiMediX2, a bilingual (Arabic-English) AI model that can understand and process text and images related to healthcare. It's trained on a large dataset of medical interactions in both languages and performs well on various medical tasks, even outperforming other models like GPT-4....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Yusuf Dalva
title: 'FluxSpace: Disentangled Semantic Editing in Rectified Flow Transformers'
thumbnail: ""
link: https://huggingface.co/papers/2412.09611
summary: This paper introduces FluxSpace, a method for editing images generated by rectified flow transformers like Flux, by using the representations learned by the transformer blocks within the models. This allows for precise, attribute-specific modifications without affecting unrelated aspects of the image, and enables a wide range of image editing tasks....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Sarkar Snigdha Sarathi Das
title: 'GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers'
thumbnail: ""
link: https://huggingface.co/papers/2412.09722
summary: GReaTer is a new method for optimizing prompts for smaller language models by using task loss gradients. It outperforms previous methods and can even match or surpass the performance of larger language models....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Hongjie Wang
title: 'LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity'
thumbnail: ""
link: https://huggingface.co/papers/2412.09856
summary: A new method called LinGen is introduced that can generate high-resolution minute-length videos on a single GPU, which is a significant improvement over existing methods that can only generate videos of 10-20 seconds length due to their high computational cost. LinGen replaces a computationally-intensive block with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, while the TE-branch focuses on temporal ...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Baisen Wang
title: Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation
thumbnail: ""
link: https://huggingface.co/papers/2412.09428
summary: This paper proposes a new method, Visuals Music Bridge (VMB), for multimodal music generation. It uses explicit bridges of text and music for multimodal alignment, and a Multimodal Music Description Model to convert visual inputs into textual descriptions. It also includes a Dual-track Music Retrieval module for user control. The method improves music quality, modality, and customization alignment, and sets a new standard for interpretable and expressive multimodal music generation....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Daniel Winter
title: 'ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation'
thumbnail: ""
link: https://huggingface.co/papers/2412.08645
summary: This paper proposes a method that uses recurring objects in large unlabeled datasets to create massive supervision, enabling the training of a simple text-to-image diffusion architecture for object insertion and subject-driven generation. The method outperforms existing methods in identity preservation and photorealistic composition without requiring slow test-time tuning....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Yasamin Medghalchi
title: 'Prompt2Perturb (P2P): Text-Guided Diffusion-Based Adversarial Attacks on Breast Ultrasound Images'
thumbnail: ""
link: https://huggingface.co/papers/2412.09910
summary: We present Prompt2Perturb (P2P), a new approach to generating adversarial attacks on breast ultrasound images using text instructions. Our method uses learnable prompts and optimizes early reverse diffusion steps to create subtle, yet effective, perturbations that remain imperceptible while guiding the model towards targeted outcomes. P2P outperforms existing attack techniques and generates more natural-looking images....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Sultan Alrashed
title: 'SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs'
thumbnail: ""
link: https://huggingface.co/papers/2412.08347
summary: We trained a language model called SmolTulu-1.7b-Instruct, which is based on Tulu 3 and SmolLM2-1.7B, and found that using higher learning rates and batch sizes can improve its performance on reasoning tasks. It outperforms other models with less than 2 billion parameters on instruction following and mathematical reasoning tasks....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Hao Li
title: 'SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding'
thumbnail: ""
link: https://huggingface.co/papers/2412.09604
summary: This paper proposes SynerGen-VL, a simple yet powerful encoder-free MLLM that can both understand and generate images. It uses a token folding mechanism and vision-expert-based progressive alignment pretraining strategy to improve performance while reducing training complexity. SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and it's a promising path for future unified MLLMs....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-16"
author: Ruijie Zheng
title: 'TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies'
thumbnail: ""
link: https://huggingface.co/papers/2412.10345
summary: We present a new approach called visual trace prompting to improve the spatial-temporal awareness of large vision-language-action models for robotic learning. Our method involves encoding state-action trajectories visually, which helps the models better understand the spatial-temporal dynamics in interactive robotics. We demonstrate the effectiveness of our approach by developing a new model called TraceVLA, which outperforms existing models in various robot manipulation tasks and exhibits robus...
opinion: placeholder
tags:
- ML

0 comments on commit 46dc264

Please sign in to comment.