diff --git a/current/2024-11-18 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation.yaml b/current/2024-11-18 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation.yaml new file mode 100644 index 00000000..4a9b3726 --- /dev/null +++ b/current/2024-11-18 GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Yushi Lan +title: 'GaussianAnything: Interactive Point Cloud Latent Diffusion for 3D Generation' +thumbnail: "" +link: https://huggingface.co/papers/2411.08033 +summary: While 3D content generation has advanced significantly, existing methods still face challenges with input formats, latent space design, and output representations. This paper introduces a novel 3D generation framework that addresses these challenges, offering scalable, high-quality 3D generation with an interactive Point Cloud-structured Latent space. Our framework employs a Variational Autoencoder (VAE) with multi-view posed RGB-D(epth)-N(ormal) renderings as input, using a unique latent space ... +opinion: placeholder +tags: + - ML diff --git a/current/2024-11-18 LLaVA-o1: Let Vision Language Models Reason Step-by-Step.yaml b/current/2024-11-18 LLaVA-o1: Let Vision Language Models Reason Step-by-Step.yaml new file mode 100644 index 00000000..89b8d873 --- /dev/null +++ b/current/2024-11-18 LLaVA-o1: Let Vision Language Models Reason Step-by-Step.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Guowei Xu +title: 'LLaVA-o1: Let Vision Language Models Reason Step-by-Step' +thumbnail: "" +link: https://huggingface.co/papers/2411.10440 +summary: LLaVA-o1 is a new Vision-Language Model that can perform step-by-step reasoning on complex tasks. It uses a structured approach and a smaller dataset than other models, but still outperforms them on multimodal reasoning tasks. It also uses a new inference-time scaling method that helps it reason better.... +opinion: placeholder +tags: + - ML diff --git a/current/2024-11-18 Number it: Temporal Grounding Videos like Flipping Manga.yaml b/current/2024-11-18 Number it: Temporal Grounding Videos like Flipping Manga.yaml new file mode 100644 index 00000000..20772fdc --- /dev/null +++ b/current/2024-11-18 Number it: Temporal Grounding Videos like Flipping Manga.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Yongliang Wu +title: 'Number it: Temporal Grounding Videos like Flipping Manga' +thumbnail: "" +link: https://huggingface.co/papers/2411.10332 +summary: In this paper, we introduce a new method called Number-Prompt (NumPro) that helps Vid-LLMs understand video content better by adding unique numbers to each frame. This makes it easier for Vid-LLMs to find specific moments in the video, and our experiments show that it improves performance for video temporal grounding tasks.... +opinion: placeholder +tags: + - ML diff --git a/current/2024-11-18 Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement.yaml b/current/2024-11-18 Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement.yaml new file mode 100644 index 00000000..97478d84 --- /dev/null +++ b/current/2024-11-18 Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Zhennan Chen +title: Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement +thumbnail: "" +link: https://huggingface.co/papers/2411.06558 +summary: This paper presents RAG, a method for generating images based on regional descriptions. It decouples multi-region generation into two sub-tasks and allows users to modify specific unsatisfied regions in the last generation without relying on additional inpainting models. RAG is tuning-free and can be applied to other frameworks as an enhancement to the prompt following property.... +opinion: placeholder +tags: + - ML diff --git a/current/2024-11-18 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.yaml b/current/2024-11-18 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.yaml new file mode 100644 index 00000000..3938fbb9 --- /dev/null +++ b/current/2024-11-18 The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Siyuan Hu +title: 'The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use' +thumbnail: "" +link: https://huggingface.co/papers/2411.10323 +summary: A new AI model, Claude 3.5 Computer Use, is the first to offer computer use through a graphical user interface (GUI) in public beta. This case study explores its capabilities and limitations by designing tasks and providing an agent framework for deployment. The study aims to showcase its abilities and inspire future research into GUI agents.... +opinion: placeholder +tags: + - ML diff --git a/current/2024-11-18 Xmodel-1.5: An 1B-scale Multilingual LLM.yaml b/current/2024-11-18 Xmodel-1.5: An 1B-scale Multilingual LLM.yaml new file mode 100644 index 00000000..1c4285fd --- /dev/null +++ b/current/2024-11-18 Xmodel-1.5: An 1B-scale Multilingual LLM.yaml @@ -0,0 +1,9 @@ +date: "2024-11-18" +author: Wang Qun +title: 'Xmodel-1.5: An 1B-scale Multilingual LLM' +thumbnail: "" +link: https://huggingface.co/papers/2411.10083 +summary: Xmodel-1.5 is a new, big computer program that can understand and speak many languages, including Thai, Arabic, French, Chinese, and English. It's really good at understanding and speaking these languages, and it can help with lots of different tasks like answering questions. The people who made it also made a new test for Thai language understanding. They want to keep making it better and hope it helps with research about understanding different languages.... +opinion: placeholder +tags: + - ML