Automated report

deep-diver · Dec 3, 2024 · 9af9477 · 9af9477
1 parent 80e9a9f
commit 9af9477
Show file tree

Hide file tree

Showing 27 changed files with 243 additions and 0 deletions.
diff --git a/...usion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation.yaml b/...usion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Xin Yan
+title: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01316
+summary: Presto is a new video diffusion model that creates 15-second videos with a lot of detail and a clear story. It uses a method called Segmented Cross-Attention to help the model understand the story of the video better. Presto also uses a new dataset called LongTake-HD, which has a lot of videos with a clear story. Presto does a better job than other video generation methods at creating videos with a clear story and a lot of detail....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-02 Open-Sora Plan: Open-Source Large Video Generation Model.yaml b/current/2024-12-02 Open-Sora Plan: Open-Source Large Video Generation Model.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Bin Lin
+title: 'Open-Sora Plan: Open-Source Large Video Generation Model'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00131
+summary: Open-Sora Plan is an open-source project that aims to create a large video generation model using a variety of user inputs. It uses a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and condition controllers. The project also includes efficient training and inference strategies, as well as a data curation pipeline. The project achieves impressive video generation results and is available on GitHub....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...2 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation.yaml b/...2 Steering Rectified Flow Models in the Vector Field for Controlled Image Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Maitreya Patel
+title: Steering Rectified Flow Models in the Vector Field for Controlled Image Generation
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00100
+summary: This paper introduces FlowChef, a new method for controlled image generation that uses the vector field of rectified flow models (RFMs) to guide the denoising trajectory. FlowChef is a unified framework that can handle classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. It outperforms existing methods in terms of performance, memory, and time requirements, achieving new state-of-the-art results....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...TRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video.yaml b/...TRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Jinyuan Qu
+title: 'TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video'
+thumbnail: ""
+link: https://huggingface.co/papers/2411.18671
+summary: TAPTRv3 is an improved version of TAPTRv2, which is a simple DETR-like framework for tracking any point in real-world videos. TAPTRv3 uses spatial and temporal context to improve feature querying and robustness in long videos. It introduces Context-aware Cross-Attention (CCA) for better spatial feature querying and Visibility-aware Long-Temporal Attention (VLTA) for better temporal feature querying. TAPTRv3 outperforms TAPTRv2 and achieves state-of-the-art performance on most challenging dataset...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ision Language Models Still Struggle with Visual Perception of Geometric Information.yaml b/...ision Language Models Still Struggle with Visual Perception of Geometric Information.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Ryo Kamoi
+title: 'VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00947
+summary: A new dataset called VisOnlyQA has been introduced to evaluate the visual perception capabilities of Large Vision Language Models (LVLMs) on scientific figures. The dataset highlights that current LVLMs struggle with visual perception tasks, but fine-tuning on synthetic training data and using stronger language models can improve their performance....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model.yaml b/...: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-02"
+author: Zongjian Li
+title: 'WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model'
+thumbnail: ""
+link: https://huggingface.co/papers/2411.17459
+summary: This paper introduces WF-VAE, a new video VAE that uses wavelet transform to encode videos more efficiently and maintain the integrity of the latent space. It demonstrates better performance in metrics such as PSNR and LPIPS, and is 2x faster and uses 4x less memory compared to existing video VAEs....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models.yaml b/...A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Yanxi Chen
+title: A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
+thumbnail: ""
+link: https://huggingface.co/papers/2411.19477
+summary: This paper proposes a two-stage algorithm that uses a language model to generate and compare candidate solutions. The algorithm has a provable scaling law for test-time compute, meaning the failure probability decreases exponentially with the number of candidate solutions and comparison rounds. The algorithm was tested on the MMLU-Pro benchmark and showed promising results, validating the assumptions and the benefits of scaling up the test-time compute....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...nt/2024-12-03 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge.yaml b/...nt/2024-12-03 Art-Free Generative Models: Art Creation Without Graphic Art Knowledge.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Hui Ren
+title: 'Art-Free Generative Models: Art Creation Without Graphic Art Knowledge'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00176
+summary: The paper explores the creation of art without prior art knowledge by training a text-to-image generation model without access to art-related content. They introduce a method to learn an art adapter using only a few examples of selected artistic styles. The art generated using their method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Data attribution techniques show how examples from both artistic and non-artistic datasets contributed to the ...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...aborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input.yaml b/...aborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Francesco Taioli
+title: 'Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01250
+summary: The paper introduces a new task called Collaborative Instance Navigation (CoIN) where an AI agent interacts with a human to navigate to a specific target object. The AI agent uses a Self-Questioner model to initiate a self-dialogue and a Large Language Model (LLM) to determine whether to ask a question to the human, continue or halt navigation. This helps to minimize the amount of input needed from the human and makes the navigation process more efficient and effective....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-03 Efficient Track Anything.yaml b/current/2024-12-03 Efficient Track Anything.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Yunyang Xiong
+title: Efficient Track Anything
+thumbnail: ""
+link: https://huggingface.co/papers/2411.18933
+summary: EfficientTAMs is a lightweight track anything model that produces high-quality results with low latency and model size. It uses a nonhierarchical Vision Transformer as an image encoder and an efficient memory module for video object segmentation. EfficientTAMs perform comparably to SAM 2 model with faster speed and fewer parameters, and can run on mobile devices for on-device video object segmentation applications....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...rge Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting.yaml b/...rge Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Thilini Wijesiriwardene
+title: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00869
+summary: This paper explores the ability of large language models to solve proportional analogies by introducing a 15K MCQA dataset and evaluating their performance in different knowledge-enhanced prompt settings. The best model achieved an accuracy of 55%, and targeted knowledge was found to be the most helpful....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...2-03 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait.yaml b/...2-03 FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Taekyung Ki
+title: 'FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01064
+summary: A new method called FLOAT is introduced to create more realistic talking portrait videos by using a generative model that focuses on motion instead of individual pixels. This method is faster and more consistent than previous ones, and it can also make the person in the video look happier or sadder depending on the speech....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation.yaml b/...: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Pengfei Zhou
+title: 'GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation'
+thumbnail: ""
+link: https://huggingface.co/papers/2411.18499
+summary: This paper introduces a new benchmark called OpenING for evaluating the performance of models that can generate interleaved image-text content. This benchmark includes 5,400 high-quality human-annotated instances across 56 real-world tasks and covers a wide range of daily scenarios. The paper also presents a judge model called IntJudge that can evaluate open-ended multimodal generation methods. Experiments on OpenING reveal that current interleaved generation methods still have room for improvem...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...2-03 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge.yaml b/...2-03 INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Angelika Romanou
+title: 'INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge'
+thumbnail: ""
+link: https://huggingface.co/papers/2411.19799
+summary: The paper introduces INCLUDE, a comprehensive evaluation suite of QA pairs from local exam sources, measuring the capabilities of multilingual LLMs in regional contexts across 44 written languages. This benchmark focuses on knowledge and reasoning to evaluate the performance of multilingual LLMs in real-world language environments....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...-12-03 Improving speaker verification robustness with synthetic emotional utterances.yaml b/...-12-03 Improving speaker verification robustness with synthetic emotional utterances.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Nikhil Kumar Koditala
+title: Improving speaker verification robustness with synthetic emotional utterances
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00319
+summary: A new method called CycleGAN is used to create artificial emotional speech for each speaker, which helps improve the accuracy of speaker verification systems. This is useful because existing systems often make mistakes when trying to identify speakers who are emotional, and there is not much labeled emotional speech data available. The new method reduces the error rate by up to 3.64% relative when verifying speakers in emotional speech scenarios....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...t/2024-12-03 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.yaml b/...t/2024-12-03 PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Meng Cao
+title: 'PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01800
+summary: This paper introduces PhysGame, a benchmark to evaluate physical commonsense understanding in gameplay videos. They find that current video LLMs perform poorly and propose PhysVLM, a physical knowledge-enhanced video LLM, to improve performance....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ion-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters.yaml b/...ion-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Jianping Jiang
+title: 'SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00174
+summary: This paper introduces SOLAMI, a framework to make 3D characters more social by having them respond to human input with speech and movement. It uses a new dataset and a VR interface, and shows that it works better than other methods....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...ent/2024-12-03 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis.yaml b/...ent/2024-12-03 Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Anton Voronov
+title: 'Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01819
+summary: This work introduces Switti, a scale-wise transformer for text-to-image generation. Switti improves the convergence and overall performance of existing models by making architectural modifications and proposing a non-AR counterpart that is faster and uses less memory. It also reveals that classifier-free guidance at high-resolution scales can be detrimental and disables it to achieve further acceleration and better generation of fine-grained details. Switti outperforms existing text-to-image AR ...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...e Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning.yaml b/...e Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Ruben Ohana
+title: 'The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00568
+summary: 'Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software ...'
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-03 TinyFusion: Diffusion Transformers Learned Shallow.yaml b/current/2024-12-03 TinyFusion: Diffusion Transformers Learned Shallow.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Gongfan Fang
+title: 'TinyFusion: Diffusion Transformers Learned Shallow'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01199
+summary: TinyFusion is a method that removes unnecessary layers from diffusion transformers to make them more efficient, while still keeping their performance high after fine-tuning. It uses a technique called learnable pruning, which helps the model recover its performance after being made smaller. Experiments show that TinyFusion works well for different types of diffusion transformers and can make them faster and more efficient without sacrificing much performance....
+opinion: placeholder
+tags:
+    - ML
diff --git a/... Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning.yaml b/... Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Aditya Narayan Sankaran
+title: Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01408
+summary: We explore the use of pre-trained audio representations for detecting abusive language in low-resource languages by applying Few Shot Learning (FSL) and Model-Agnostic Meta-Learning (MAML) framework. Our approach integrates these representations within the MAML framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand mo...
+opinion: placeholder
+tags:
+    - ML
diff --git a/...uration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation.yaml b/...uration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Weiming Ren
+title: 'VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00927
+summary: The paper introduces VISTA, a video augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA improves the performance of large multimodal models on long-video understanding tasks by an average of 3.3% across four benchmarks. It also introduces HRVideoBench, a high-resolution video understanding benchmark, on which the models achieve a 6.5% performance gain....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-03 VLSBench: Unveiling Visual Leakage in Multimodal Safety.yaml b/current/2024-12-03 VLSBench: Unveiling Visual Leakage in Multimodal Safety.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Xuhao Hu
+title: 'VLSBench: Unveiling Visual Leakage in Multimodal Safety'
+thumbnail: ""
+link: https://huggingface.co/papers/2411.19939
+summary: This paper introduces VLSBench, a new benchmark for evaluating the safety of large language models in multimodal scenarios. The benchmark is designed to prevent visual safety leakage from images to textual queries, and it poses a significant challenge to both open-source and closed-source models. The study demonstrates that textual alignment is enough for multimodal safety scenarios with visual safety leakage, while multimodal alignment is a more promising solution for scenarios without leakage....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...3 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models.yaml b/...3 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Byung-Kwan Lee
+title: 'VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01822
+summary: We present VLsI, a new family of VLMs that focuses on efficiency while maintaining accuracy. VLsI uses a unique layer-wise distillation process with intermediate verbalizers to align smaller VLMs with the reasoning processes of larger VLMs, outperforming GPT-4V on ten vision-language benchmarks without scaling or architectural changes....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-03 World-consistent Video Diffusion with Explicit 3D Modeling.yaml b/current/2024-12-03 World-consistent Video Diffusion with Explicit 3D Modeling.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Qihang Zhang
+title: World-consistent Video Diffusion with Explicit 3D Modeling
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01821
+summary: We present World-consistent Video Diffusion (WVD), a new method that combines explicit 3D supervision with XYZ images and diffusion transformers to generate 3D-consistent videos and images. WVD can handle various tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation, showing strong performance on multiple benchmarks with a single pretrained model....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...sal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.yaml b/...sal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Zeyi Sun
+title: 'X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.01824
+summary: This paper introduces X-Prompt, a model that uses in-context examples to generate images. It can do both tasks it has seen before and tasks it has never seen before. The model is designed to efficiently use information from the examples and to be good at generalizing to new tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-03 o1-Coder: an o1 Replication for Coding.yaml b/current/2024-12-03 o1-Coder: an o1 Replication for Coding.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-03"
+author: Yuxiang Zhang
+title: 'o1-Coder: an o1 Replication for Coding'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.00154
+summary: The O1-CODER framework improves coding by using RL and MCTS, training a Test Case Generator, and fine-tuning a policy model to produce pseudocode and full code. It also discusses the challenges and opportunities of using similar models in real-world applications....
+opinion: placeholder
+tags:
+    - ML