Automated report

deep-diver · Dec 9, 2024 · 1d9bed9 · 1d9bed9
1 parent c7c11af
commit 1d9bed9
Show file tree

Hide file tree

Showing 14 changed files with 126 additions and 0 deletions.
diff --git a/...an Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction.yaml b/...an Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Wanting Zhang
+title: '2DGS-Room: Seed-Guided 2D Gaussian Splatting with Geometric Constrains for High-Fidelity Indoor Scene Reconstruction'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.03428
+summary: The paper proposes a new method called 2DGS-Room that uses 2D Gaussian Splatting to create detailed indoor scenes. It uses seed points to control the distribution of the Gaussians and adds depth and normal information to make the scenes more accurate. The method also uses multi-view consistency to reduce artifacts and improve the overall quality of the reconstructions. Experiments show that this method performs better than existing ones for indoor scene reconstruction....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-08 APOLLO: SGD-like Memory, AdamW-level Performance.yaml b/current/2024-12-08 APOLLO: SGD-like Memory, AdamW-level Performance.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Hanqing Zhu
+title: 'APOLLO: SGD-like Memory, AdamW-level Performance'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.05270
+summary: APOLLO is a new optimizer that achieves AdamW-level performance while using less memory, allowing for faster training and larger batch sizes on less powerful GPUs. It does this by approximating learning rate scaling using an auxiliary low-rank optimizer state based on random projection, which makes it highly tolerant to further memory reductions. The APOLLO series performs on-par with or better than AdamW, and provides significant system-level benefits, including enhanced throughput, improved mo...
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-08 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases.yaml b/current/2024-12-08 EXAONE 3.5: Series of Large Language Models for Real-world Use Cases.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: LG AI Research
+title: 'EXAONE 3.5: Series of Large Language Models for Real-world Use Cases'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04862
+summary: EXAONE 3.5 is a series of large language models designed for real-world applications. They have exceptional instruction following capabilities, outstanding long-context comprehension, and competitive results compared to other models of similar sizes. They're available for research purposes and can be downloaded from https://huggingface.co/LGAI-EXAONE. For commercial use, contact LG AI Research at [email protected]....
+opinion: placeholder
+tags:
+    - ML
diff --git a/... Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.yaml b/... Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Zhe Chen
+title: Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
+thumbnail: ""
+link: https://huggingface.co/papers/2412.05271
+summary: InternVL 2.5 is an advanced multimodal large language model that improves upon InternVL 2.0 by scaling up model, data, and test-time configurations. It performs well on various benchmarks and is the first open-source model to surpass 70% on the MMMU benchmark. The model is available for the open-source community to use and build upon....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...-12-08 GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration.yaml b/...-12-08 GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Kaiyi Huang
+title: 'GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04440
+summary: 'GenMAC is a text-to-video generation model that uses multiple agents to collaborate and create complex scenes based on text prompts. It has four stages: design, generation, and redesign, with an iterative loop between the last two to correct and improve the generated video. The redesign stage uses four agents to verify the video, suggest corrections, and redesign the text prompt and layout. GenMAC also has a self-routing mechanism to choose the best correction agent for each scenario....'
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-08 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.yaml b/current/2024-12-08 LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Yibin Wang
+title: 'LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04814
+summary: The paper proposes a new way to improve text-to-video models by using human feedback. They created a dataset of human ratings and used it to train a model that can predict how well a video matches a text description. They then used this model to improve the text-to-video model by making it more likely to generate videos that match the text description....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...24-12-08 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale.yaml b/...24-12-08 MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Jarvis Guo
+title: 'MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.05237
+summary: This paper introduces a new method to create a large-scale multimodal instruction-tuning dataset with detailed rationales, which improves the reasoning capabilities of open-source multimodal large language models. The dataset is created using only open models and contains 12M instruction-response pairs. Experiments show that training MLLMs on this dataset improves performance on reasoning-intensive tasks and even non-reasoning-based benchmarks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...2024-12-08 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation.yaml b/...2024-12-08 Moto: Latent Motion Token as the Bridging Language for Robot Manipulation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Yi Chen
+title: 'Moto: Latent Motion Token as the Bridging Language for Robot Manipulation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04445
+summary: This paper proposes Moto, a method to use video data to improve robot learning by converting video content into latent Motion Token sequences and pre-training a model called Moto-GPT. The paper also introduces a co-fine-tuning strategy to transfer learned motion priors to real robot actions, and experiments show that the fine-tuned Moto-GPT is effective in improving robot manipulation tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...024-12-08 SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion.yaml b/...024-12-08 SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-08"
+author: Trong-Tung Nguyen
+title: 'SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04301
+summary: This paper introduces SwiftEdit, a fast and efficient tool for text-guided image editing. It uses a one-step inversion framework and a mask-guided editing technique to achieve instant results, which is at least 50 times faster than previous methods while maintaining similar editing quality. The project page is at https://swift-edit.github.io/....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...24-12-09 CompCap: Improving Multimodal Large Language Models with Composite Captions.yaml b/...24-12-09 CompCap: Improving Multimodal Large Language Models with Composite Captions.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-09"
+author: Xiaohui Chen
+title: 'CompCap: Improving Multimodal Large Language Models with Composite Captions'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.05243
+summary: A new framework called CompCap is introduced to improve the understanding of composite images by multimodal large language models. This is done by creating a dataset of 118K image-caption pairs and fine-tuning the models with this data, leading to improved performance on various benchmarks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...t/2024-12-09 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling.yaml b/...t/2024-12-09 DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-09"
+author: Minzheng Wang
+title: 'DEMO: Reframing Dialogue Interaction with Fine-grained Element Modeling'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04905
+summary: The paper introduces a new research task called Dialogue Element MOdeling and a benchmark called DEMO to improve dialogue generation and assessment. They also build an agent that can model dialogue elements. Experiments show that existing language models can be improved and their agent performs well in both in-domain and out-of-domain tasks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-09 Mind the Time: Temporally-Controlled Multi-Event Video Generation.yaml b/current/2024-12-09 Mind the Time: Temporally-Controlled Multi-Event Video Generation.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-09"
+author: Ziyi Wu
+title: 'Mind the Time: Temporally-Controlled Multi-Event Video Generation'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.05263
+summary: The paper introduces MinT, a video generator that can create sequences of events with precise timing control. It binds each event to a specific time period and uses a time-based positional encoding method to guide the cross-attention operation. This results in coherent videos with smoothly connected events, and it's the first model in the literature to offer control over event timing. MinT outperforms existing open-source models by a large margin....
+opinion: placeholder
+tags:
+    - ML
diff --git a/...-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction.yaml b/...-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-09"
+author: Jixuan Fan
+title: 'Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04887
+summary: Momentum-GS is a new method that improves the accuracy of large-scale scene reconstruction by using a moving average of a teacher Gaussian decoder to provide guidance to each block during training. It also dynamically adjusts the weight of each block based on its accuracy. This method outperforms existing techniques by 12.8% in LPIPS and requires fewer divided blocks....
+opinion: placeholder
+tags:
+    - ML
diff --git a/current/2024-12-09 PanoDreamer: 3D Panorama Synthesis from a Single Image.yaml b/current/2024-12-09 PanoDreamer: 3D Panorama Synthesis from a Single Image.yaml
@@ -0,0 +1,9 @@
+date: "2024-12-09"
+author: Avinash Paliwal
+title: 'PanoDreamer: 3D Panorama Synthesis from a Single Image'
+thumbnail: ""
+link: https://huggingface.co/papers/2412.04827
+summary: PanoDreamer is a new way to make a full 3D scene from one picture, by first guessing what the whole panorama and its depth would be, then filling in the missing parts, and finally turning it into 3D. This is different from other methods that create the scene bit by bit, and PanoDreamer does a better job at making the scene look right and feel whole....
+opinion: placeholder
+tags:
+    - ML