MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting SAGS: Structure-Aware 3D Gaussian Splatting Invisible Stitch: Generating Smooth 3D Scenes with Depth Inpainting...
May 09, 2024•12 min•Ep. 25
KAN: Kolmogorov-Arnold Networks InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Better & Faster Large Language Models via Multi-token Prediction Iterative Reasoning Preference Optimization Extending Llama-3's Context Ten-Fold Overnight...
May 08, 2024•11 min•Ep. 24
Paint by Inpaint: Learning to Add Image Objects by Removing Them First Self-Play Preference Optimization for Language Model Alignment Automatic Creative Selection with Cross-Modal Matching STT: Stateful Tracking with Transformers for Autonomous Driving Octopus v4: Graph of language models...
May 07, 2024•12 min•Ep. 23
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models WildChat: 1M ChatGPT Interaction Logs in the Wild StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report LLM-AD: Large Language Model based Audio Description System...
May 03, 2024•12 min•Ep. 22
Is Bigger Edit Batch Size Always Better? -- An Empirical Study on Model Editing with Llama-3 A Careful Examination of Large Language Model Performance on Grade School Arithmetic Spectrally Pruned Gaussian Fields with Neural Compensation SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge...
May 02, 2024•12 min•Ep. 21
Octopus v4: Graph of language models InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation Better & Faster Large Language Models via Multi-token Prediction GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting Iterative Reasoning Preference Optimization...
May 01, 2024•12 min•Ep. 20
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models LEGENT: Open Platform for Embodied Agents Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting BlenderAlchemy: Editing 3D Graphics with Vision-Language Models...
Apr 30, 2024•11 min•Ep. 19
AI Papers Podcast for 04/29/2024 PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs HaLo-NeRF: Learning Geometry-Guided Semantics for Exploring Unconstrained Photo Collections MaPa: Text-driven Photorealistic Material Painting for 3D Shapes...
Apr 29, 2024•9 min•Ep. 18
AI Papers Podcast for 04/26/2024 How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites Interactive3D: Create What You Want by Interactive 3D Generation Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding Tele-FLM Technical Report SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension...
Apr 26, 2024•12 min•Ep. 17
AI Papers Podcast for 04/25/2024 Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis A Multimodal Automated Interpretability Agent SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation MultiBooth: Towards Generating All Your Concepts in an Image from Text Learning H-Infinity Locomotion Control...
Apr 25, 2024•11 min•Ep. 16
AI Papers Podcast for 04/24/2024 OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework Multi-Head Mixture-of-Experts Pegasus-v1 Technical Report Align Your Steps: Optimizing Sampling Schedules in Diffusion Models SnapKV: LLM Knows What You are Looking for Before Generation...
Apr 24, 2024•11 min•Ep. 15
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions FlowMind: Automatic Workflow Generation with LLMs Music Consistency Models How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study...
Apr 23, 2024•12 min•Ep. 14
AI Papers Podcast for 04/23/2024 PhysDreamer: Physics-Based Interaction with 3D Objects via Video Generation TextSquare: Scaling up Text-Centric Visual Instruction Tuning Does Gaussian Splatting need SFM Initialization? How Far Can We Go with Practical Function-Level Program Repair? AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation...
Apr 22, 2024•12 min•Ep. 13
AI Papers Podcast for 04/21/2024 Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models On Speculative Decoding for Multimodal Large Language Models...
Apr 21, 2024•12 min•Ep. 12
AI Papers Podcast for 04/20/2024 Dynamic Typography: Bringing Words to Life Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing MeshLRM: Large Reconstruction Model for High-Quality Mesh MoA: Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation EdgeFusion: On-Device Text-to-Image Generation...
Apr 20, 2024•11 min•Ep. 11
AI Papers Podcast for 04/19/2024 BLINK: Multimodal Large Language Models Can See but Not Perceive Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models AniClipart: Clipart Animation with Text-to-Video Priors TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data...
Apr 19, 2024•12 min•Ep. 10
AI Papers Podcast for 04/19/2024 Meta releases Llama 3, claims it's among the best open models available
Apr 18, 2024•2 min•Ep. 9
AI Papers Podcast for 04/18/2024 Long-form music generation with latent diffusion Scaling Instructable Agents Across Many Simulated Worlds
Apr 18, 2024•6 min•Ep. 8
AI Papers Podcast for 04/17/2024 Learn Your Reference Model for Real Good Alignment Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length TransformerFAM: Feedback attention is working memory Compression Represents Intelligence Linearly Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video...
Apr 17, 2024•12 min•Ep. 7
AI Papers Podcast for 04/16/2024 Octopus v2: On-device language model for super agent Advancing LLM Reasoning Generalists with Preference Trees Long-context LLMs Struggle with Long In-context Learning LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model Bigger is not Always Better: Scaling Properties of Latent Diffusion Models...
Apr 16, 2024•12 min•Ep. 6
AI Papers Podcast for 04/15/2024 COCONut: Modernizing COCO Segmentation Scaling (Down) CLIP: A Comprehensive Analysis of Data, Architecture, and Training Strategies On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation Pre-training Small Base LMs with Fewer Tokens Probing the 3D Awareness of Visual Foundation Models...
Apr 15, 2024•11 min•Ep. 5
AI Papers Podcast for 04/14/2024 Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences: https://arxiv.org/abs/2404.03715 No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance: https://arxiv.org/abs/2404.04125 AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent: https://arxiv.org/abs/2404.03648 Stream of Search (SoS): Learning to Search in Language: https://arxiv.org/ab...
Apr 14, 2024•11 min•Ep. 4
AI Papers Podcast for 04/13/2024 OmniFusion Technical Report: https://arxiv.org/abs/2404.06212 LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders: https://arxiv.org/abs/2404.05961 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD: https://arxiv.org/abs/2404.06512 Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence: https://arxiv.org/abs/2404.05892 MiniCPM: Unveiling the Potential of Small Language M...
Apr 13, 2024•13 min•Ep. 3
AI Papers Podcast 04/12/2024 RecurrentGemma: Moving Past Transformers for Efficient Open Language Models: https://arxiv.org/abs/2404.07839 WILBUR: Adaptive In-Context Learning for Robust and Accurate Web Agents: https://arxiv.org/abs/2404.05902 Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models: https://arxiv.org/abs/2404.07973 Best Practices and Lessons Learned on Synthetic Data for Language Models: https://arxiv.org/abs/2404.07503 HGRN2: Gated Linear RNNs wi...
Apr 12, 2024•11 min•Ep. 2
AI Papers Podcast for 04/12/2024 ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback: https://arxiv.org/abs/2404.07987 OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments: https://arxiv.org/abs/2404.07972 Rho-1: Not All Tokens Are What You Need: https://arxiv.org/abs/2404.07965 JetMoE: Reaching Llama2 Performance with 0.1M Dollars: https://arxiv.org/abs/2404.07413 Transferable and Principled Efficiency for Open-Vocabulary Segmen...
Apr 12, 2024•11 min•Ep. 1