Skip to content

Commit

Permalink
Automated report
Browse files Browse the repository at this point in the history
  • Loading branch information
deep-diver committed Dec 3, 2024
1 parent 80e9a9f commit 9af9477
Show file tree
Hide file tree
Showing 27 changed files with 243 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Xin Yan
title: Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation
thumbnail: ""
link: https://huggingface.co/papers/2412.01316
summary: Presto is a new video diffusion model that creates 15-second videos with a lot of detail and a clear story. It uses a method called Segmented Cross-Attention to help the model understand the story of the video better. Presto also uses a new dataset called LongTake-HD, which has a lot of videos with a clear story. Presto does a better job than other video generation methods at creating videos with a clear story and a lot of detail....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Bin Lin
title: 'Open-Sora Plan: Open-Source Large Video Generation Model'
thumbnail: ""
link: https://huggingface.co/papers/2412.00131
summary: Open-Sora Plan is an open-source project that aims to create a large video generation model using a variety of user inputs. It uses a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and condition controllers. The project also includes efficient training and inference strategies, as well as a data curation pipeline. The project achieves impressive video generation results and is available on GitHub....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Maitreya Patel
title: Steering Rectified Flow Models in the Vector Field for Controlled Image Generation
thumbnail: ""
link: https://huggingface.co/papers/2412.00100
summary: This paper introduces FlowChef, a new method for controlled image generation that uses the vector field of rectified flow models (RFMs) to guide the denoising trajectory. FlowChef is a unified framework that can handle classifier guidance, linear inverse problems, and image editing without the need for extra training, inversion, or intensive backpropagation. It outperforms existing methods in terms of performance, memory, and time requirements, achieving new state-of-the-art results....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Jinyuan Qu
title: 'TAPTRv3: Spatial and Temporal Context Foster Robust Tracking of Any Point in Long Video'
thumbnail: ""
link: https://huggingface.co/papers/2411.18671
summary: TAPTRv3 is an improved version of TAPTRv2, which is a simple DETR-like framework for tracking any point in real-world videos. TAPTRv3 uses spatial and temporal context to improve feature querying and robustness in long videos. It introduces Context-aware Cross-Attention (CCA) for better spatial feature querying and Visibility-aware Long-Temporal Attention (VLTA) for better temporal feature querying. TAPTRv3 outperforms TAPTRv2 and achieves state-of-the-art performance on most challenging dataset...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Ryo Kamoi
title: 'VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information'
thumbnail: ""
link: https://huggingface.co/papers/2412.00947
summary: A new dataset called VisOnlyQA has been introduced to evaluate the visual perception capabilities of Large Vision Language Models (LVLMs) on scientific figures. The dataset highlights that current LVLMs struggle with visual perception tasks, but fine-tuning on synthetic training data and using stronger language models can improve their performance....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-02"
author: Zongjian Li
title: 'WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model'
thumbnail: ""
link: https://huggingface.co/papers/2411.17459
summary: This paper introduces WF-VAE, a new video VAE that uses wavelet transform to encode videos more efficiently and maintain the integrity of the latent space. It demonstrates better performance in metrics such as PSNR and LPIPS, and is 2x faster and uses 4x less memory compared to existing video VAEs....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Yanxi Chen
title: A Simple and Provable Scaling Law for the Test-Time Compute of Large Language Models
thumbnail: ""
link: https://huggingface.co/papers/2411.19477
summary: This paper proposes a two-stage algorithm that uses a language model to generate and compare candidate solutions. The algorithm has a provable scaling law for test-time compute, meaning the failure probability decreases exponentially with the number of candidate solutions and comparison rounds. The algorithm was tested on the MMLU-Pro benchmark and showed promising results, validating the assumptions and the benefits of scaling up the test-time compute....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Hui Ren
title: 'Art-Free Generative Models: Art Creation Without Graphic Art Knowledge'
thumbnail: ""
link: https://huggingface.co/papers/2412.00176
summary: The paper explores the creation of art without prior art knowledge by training a text-to-image generation model without access to art-related content. They introduce a method to learn an art adapter using only a few examples of selected artistic styles. The art generated using their method is perceived by users as comparable to art produced by models trained on large, art-rich datasets. Data attribution techniques show how examples from both artistic and non-artistic datasets contributed to the ...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Francesco Taioli
title: 'Collaborative Instance Navigation: Leveraging Agent Self-Dialogue to Minimize User Input'
thumbnail: ""
link: https://huggingface.co/papers/2412.01250
summary: The paper introduces a new task called Collaborative Instance Navigation (CoIN) where an AI agent interacts with a human to navigate to a specific target object. The AI agent uses a Self-Questioner model to initiate a self-dialogue and a Large Language Model (LLM) to determine whether to ask a question to the human, continue or halt navigation. This helps to minimize the amount of input needed from the human and makes the navigation process more efficient and effective....
opinion: placeholder
tags:
- ML
9 changes: 9 additions & 0 deletions current/2024-12-03 Efficient Track Anything.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Yunyang Xiong
title: Efficient Track Anything
thumbnail: ""
link: https://huggingface.co/papers/2411.18933
summary: EfficientTAMs is a lightweight track anything model that produces high-quality results with low latency and model size. It uses a nonhierarchical Vision Transformer as an image encoder and an efficient memory module for video object segmentation. EfficientTAMs perform comparably to SAM 2 model with faster speed and fewer parameters, and can run on mobile devices for on-device video object segmentation applications....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Thilini Wijesiriwardene
title: Exploring the Abilities of Large Language Models to Solve Proportional Analogies via Knowledge-Enhanced Prompting
thumbnail: ""
link: https://huggingface.co/papers/2412.00869
summary: This paper explores the ability of large language models to solve proportional analogies by introducing a 15K MCQA dataset and evaluating their performance in different knowledge-enhanced prompt settings. The best model achieved an accuracy of 55%, and targeted knowledge was found to be the most helpful....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Taekyung Ki
title: 'FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait'
thumbnail: ""
link: https://huggingface.co/papers/2412.01064
summary: A new method called FLOAT is introduced to create more realistic talking portrait videos by using a generative model that focuses on motion instead of individual pixels. This method is faster and more consistent than previous ones, and it can also make the person in the video look happier or sadder depending on the speech....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Pengfei Zhou
title: 'GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation'
thumbnail: ""
link: https://huggingface.co/papers/2411.18499
summary: This paper introduces a new benchmark called OpenING for evaluating the performance of models that can generate interleaved image-text content. This benchmark includes 5,400 high-quality human-annotated instances across 56 real-world tasks and covers a wide range of daily scenarios. The paper also presents a judge model called IntJudge that can evaluate open-ended multimodal generation methods. Experiments on OpenING reveal that current interleaved generation methods still have room for improvem...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Angelika Romanou
title: 'INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge'
thumbnail: ""
link: https://huggingface.co/papers/2411.19799
summary: The paper introduces INCLUDE, a comprehensive evaluation suite of QA pairs from local exam sources, measuring the capabilities of multilingual LLMs in regional contexts across 44 written languages. This benchmark focuses on knowledge and reasoning to evaluate the performance of multilingual LLMs in real-world language environments....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Nikhil Kumar Koditala
title: Improving speaker verification robustness with synthetic emotional utterances
thumbnail: ""
link: https://huggingface.co/papers/2412.00319
summary: A new method called CycleGAN is used to create artificial emotional speech for each speaker, which helps improve the accuracy of speaker verification systems. This is useful because existing systems often make mistakes when trying to identify speakers who are emotional, and there is not much labeled emotional speech data available. The new method reduces the error rate by up to 3.64% relative when verifying speakers in emotional speech scenarios....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Meng Cao
title: 'PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos'
thumbnail: ""
link: https://huggingface.co/papers/2412.01800
summary: This paper introduces PhysGame, a benchmark to evaluate physical commonsense understanding in gameplay videos. They find that current video LLMs perform poorly and propose PhysVLM, a physical knowledge-enhanced video LLM, to improve performance....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Jianping Jiang
title: 'SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters'
thumbnail: ""
link: https://huggingface.co/papers/2412.00174
summary: This paper introduces SOLAMI, a framework to make 3D characters more social by having them respond to human input with speech and movement. It uses a new dataset and a VR interface, and shows that it works better than other methods....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Anton Voronov
title: 'Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis'
thumbnail: ""
link: https://huggingface.co/papers/2412.01819
summary: This work introduces Switti, a scale-wise transformer for text-to-image generation. Switti improves the convergence and overall performance of existing models by making architectural modifications and proposing a non-AR counterpart that is faster and uses less memory. It also reveals that classifier-free guidance at high-resolution scales can be detrimental and disables it to achieve further acceleration and better generation of fine-grained details. Switti outperforms existing text-to-image AR ...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Ruben Ohana
title: 'The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning'
thumbnail: ""
link: https://huggingface.co/papers/2412.00568
summary: 'Machine learning based surrogate models offer researchers powerful tools for accelerating simulation-based workflows. However, as standard datasets in this space often cover small classes of physical behavior, it can be difficult to evaluate the efficacy of new approaches. To address this gap, we introduce the Well: a large-scale collection of datasets containing numerical simulations of a wide variety of spatiotemporal physical systems. The Well draws from domain experts and numerical software ...'
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Gongfan Fang
title: 'TinyFusion: Diffusion Transformers Learned Shallow'
thumbnail: ""
link: https://huggingface.co/papers/2412.01199
summary: TinyFusion is a method that removes unnecessary layers from diffusion transformers to make them more efficient, while still keeping their performance high after fine-tuning. It uses a technique called learnable pruning, which helps the model recover its performance after being made smaller. Experiments show that TinyFusion works well for different types of diffusion transformers and can make them faster and more efficient without sacrificing much performance....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Aditya Narayan Sankaran
title: Towards Cross-Lingual Audio Abuse Detection in Low-Resource Settings with Few-Shot Learning
thumbnail: ""
link: https://huggingface.co/papers/2412.01408
summary: We explore the use of pre-trained audio representations for detecting abusive language in low-resource languages by applying Few Shot Learning (FSL) and Model-Agnostic Meta-Learning (MAML) framework. Our approach integrates these representations within the MAML framework to classify abusive language in 10 languages. We experiment with various shot sizes (50-200) evaluating the impact of limited data on performance. Additionally, a feature visualization study was conducted to better understand mo...
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Weiming Ren
title: 'VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation'
thumbnail: ""
link: https://huggingface.co/papers/2412.00927
summary: The paper introduces VISTA, a video augmentation framework that synthesizes long-duration and high-resolution video instruction-following pairs from existing video-caption datasets. VISTA improves the performance of large multimodal models on long-video understanding tasks by an average of 3.3% across four benchmarks. It also introduces HRVideoBench, a high-resolution video understanding benchmark, on which the models achieve a 6.5% performance gain....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Xuhao Hu
title: 'VLSBench: Unveiling Visual Leakage in Multimodal Safety'
thumbnail: ""
link: https://huggingface.co/papers/2411.19939
summary: This paper introduces VLSBench, a new benchmark for evaluating the safety of large language models in multimodal scenarios. The benchmark is designed to prevent visual safety leakage from images to textual queries, and it poses a significant challenge to both open-source and closed-source models. The study demonstrates that textual alignment is enough for multimodal safety scenarios with visual safety leakage, while multimodal alignment is a more promising solution for scenarios without leakage....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Byung-Kwan Lee
title: 'VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.01822
summary: We present VLsI, a new family of VLMs that focuses on efficiency while maintaining accuracy. VLsI uses a unique layer-wise distillation process with intermediate verbalizers to align smaller VLMs with the reasoning processes of larger VLMs, outperforming GPT-4V on ten vision-language benchmarks without scaling or architectural changes....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Qihang Zhang
title: World-consistent Video Diffusion with Explicit 3D Modeling
thumbnail: ""
link: https://huggingface.co/papers/2412.01821
summary: We present World-consistent Video Diffusion (WVD), a new method that combines explicit 3D supervision with XYZ images and diffusion transformers to generate 3D-consistent videos and images. WVD can handle various tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation, showing strong performance on multiple benchmarks with a single pretrained model....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Zeyi Sun
title: 'X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models'
thumbnail: ""
link: https://huggingface.co/papers/2412.01824
summary: This paper introduces X-Prompt, a model that uses in-context examples to generate images. It can do both tasks it has seen before and tasks it has never seen before. The model is designed to efficiently use information from the examples and to be good at generalizing to new tasks....
opinion: placeholder
tags:
- ML
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
date: "2024-12-03"
author: Yuxiang Zhang
title: 'o1-Coder: an o1 Replication for Coding'
thumbnail: ""
link: https://huggingface.co/papers/2412.00154
summary: The O1-CODER framework improves coding by using RL and MCTS, training a Test Case Generator, and fine-tuning a policy model to produce pseudocode and full code. It also discusses the challenges and opportunities of using similar models in real-world applications....
opinion: placeholder
tags:
- ML

0 comments on commit 9af9477

Please sign in to comment.