DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs Evaluating D-MERIT of Partial-annotation on Information Retrieval Long Context Transfer from Language to Vision...
Jun 27, 2024•11 min•Ep. 55
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task Towards Retrieval Augmented Generation over Large Video Libraries Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models...
Jun 26, 2024•10 min•Ep. 54
XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning Make It Count: Text-to-Image Generation with an Accurate Number of Objects ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation Needle In A Multimodal Haystack BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack...
Jun 21, 2024•11 min•Ep. 53
Depth Anything V2 An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels Transformers meet Neural Algorithmic Reasoners Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling OpenVLA: An Open-Source Vision-Language-Action Model Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models...
Jun 20, 2024•13 min•Ep. 52
NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing MotionClone: Training-Free Motion Cloning for Controllable Video Generation What If We Recaption Billions of Web Images with LLaMA-3? Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing PowerInfer-2: Fast Large Language Model Inference on a Smartphone...
Jun 16, 2024•12 min•Ep. 51
An Image is Worth 32 Tokens for Reconstruction and Generation McEval: Massively Multilingual Code Evaluation Zero-shot Image Editing with Reference Imitation The Prompt Report: A Systematic Survey of Prompting Techniques TextGrad: Automatic "Differentiation" via Text...
Jun 15, 2024•11 min•Ep. 50
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning Vript: A Video Is Worth Thousands of Words Lighting Every Darkness with 3DGS: Fast Training and Real-Time Rendering for HDR View Synthesis VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers...
Jun 14, 2024•11 min•Ep. 49
Mixture-of-Agents Enhances Large Language Model Capabilities WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild CRAG -- Comprehensive RAG Benchmark GenAI Arena: An Open Evaluation Platform for Generative Models Large Language Model Confidence Estimation via Black-Box Access...
Jun 12, 2024•11 min•Ep. 48
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions BitsFusion: 1.99 bits Weight Quantization of Diffusion Model Step-aware Preference Optimization: Aligning Preference with Denoising Performance at Each Step Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models SF-V: Single Forward Video Generation Model...
Jun 11, 2024•12 min•Ep. 47
Apple announced new Siri features and Apple Intelligence today, Interestingly, Apple already released a paper, titled "Ferret-UI," on how it all works - a multimodal vision-language model capable of understanding widgets, icons, and text on an iOS mobile screen, and reasoning about their spatial relationships and functional meanings. https://arxiv.org/abs/2404.05719
Jun 10, 2024•2 min•Ep. 46
Block Transformer: Global-to-Local Language Modeling for Fast Inference Parrot: Multilingual Visual Instruction Tuning Mobile-Agent-v2: Mobile Device Operation Assistant with Effective Navigation via Multi-Agent Collaboration Ouroboros3D: Image-to-3D Generation via 3D-aware Recursive Diffusion LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes...
Jun 10, 2024•11 min•Ep. 45
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models To Believe or Not to Believe Your LLM I4VGen: Image as Stepping Stone for Text-to-Video Generation Self-Improving Robust Preference Optimization Guiding a Diffusion Model with a Bad Version of Itself...
Jun 07, 2024•11 min•Ep. 44
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark Learning Temporally Consistent Video Depth from Video Diffusion Priors Show, Don't Tell: Aligning Language Models with Demonstrated Feedback Artificial Generational Intelligence: Cultural Accumulation in Reinforcement Learning ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation...
Jun 06, 2024•12 min•Ep. 43
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling 4Diffusion: Multi-view Video Diffusion Model for 4D Generation...
Jun 05, 2024•10 min•Ep. 42
AI Papers Podcast for 06/04/2024 DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation GECO: Generative Image-to-3D within a SECOnd PLA4D: Pixel-Level Alignments for Text-to-4D Gaussian Splatting DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories Parrot: Efficient Serving of LLM-based Applications with Semantic Variable...
Jun 04, 2024•11 min•Ep. 41
AI Papers Podcast for 06/03/2024 Jina CLIP: Your CLIP Model Is Also Your Text Retriever Similarity is Not All You Need: Endowing Retrieval Augmented Generation with Multi Layered Thoughts MotionLLM: Understanding Human Behaviors from Human Motions and Videos Xwin-LM: Strong and Scalable Alignment Practice for LLMs MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model...
Jun 03, 2024•11 min•Ep. 40
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series T2V-Turbo: Breaking the Quality Bottleneck of Video Consistency Model with Mixed Reward Feedback LLMs achieve adult human performance on higher-order theory of mind tasks Nearest Neighbor Speculative Decoding for LLM Generation and Attribution Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities...
May 31, 2024•9 min•Ep. 39
Phased Consistency Model 2BP: 2-Stage Backpropagation GFlow: Recovering 4D World from Monocular Video Instruct-MusicGen: Unlocking Text-to-Music Editing for Music Language Models via Instruction Tuning LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models...
May 30, 2024•8 min•Ep. 38
An Introduction to Vision-Language Modeling Transformers Can Do Arithmetic with the Right Embeddings Matryoshka Multimodal Models I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models Zamba: A Compact 7B SSM Hybrid Model Looking Backward: Streaming Video-to-Video Translation with Feature Banks...
May 29, 2024•10 min•Ep. 37
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization Aya 23: Open Weight Releases to Further Multilingual Progress Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training AutoCoder: Enhancing Code Large Language Model with AIEV-Instruct...
May 28, 2024•11 min•Ep. 36
May 26, 2024•11 min•Ep. 35
ReVideo: Remake a Video with Motion and Content Control Not All Language Model Features Are Linear RectifID: Personalizing Rectified Flow with Anchored Classifier Guidance Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data Dense Connector for MLLMs...
May 24, 2024•11 min•Ep. 34
Your Transformer is Secretly Linear Diffusion for World Modeling: Visual Details Matter in Atari Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control Reducing Transformer Key-Value Cache Size with Cross-Layer Attention OmniGlue: Generalizable Feature Matching with Foundation Model Guidance Personalized Residuals for Concept-Driven Text-to-Image Generation...
May 23, 2024•10 min•Ep. 33
FIFO-Diffusion: Generating Infinite Videos from Text without Training MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework Imp: Highly Capable Large Multimodal Models for Mobile Devices Octo: An Open-Source Generalist Robot Policy Towards Modular LLMs by Building and Reusing a Library of LoRAs...
May 22, 2024•9 min•Ep. 32
INDUS: Effective and Efficient Language Models for Scientific Applications Observational Scaling Laws and the Predictability of Language Model Performance Grounded 3D-LLM with Referent Tokens Layer-Condensed KV Cache for Efficient Inference of Large Language Models Dynamic data sampler for cross-language transfer learning in large language models...
May 21, 2024•10 min•Ep. 31
Chameleon: Mixed-Modal Early-Fusion Foundation Models LoRA Learns Less and Forgets Less Many-Shot In-Context Learning in Multimodal Foundation Models CAT3D: Create Anything in 3D with Multi-View Diffusion Models Grounding DINO 1.5: Advance the "Edge" of Open-Set Object Detection Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode...
May 18, 2024•11 min•Ep. 30
ALPINE: Unveiling the Planning Capability of Autoregressive Learning in Language Models Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation Naturalistic Music Decoding from EEG Data via Latent Diffusion Models No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding...
May 17, 2024•9 min•Ep. 29
VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory Coin3D: Controllable and Interactive 3D Assets Generation with Proxy-Guided Conditioning Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding Compositional Text-to-Image Generation with Dense Blob Representations...
May 16, 2024•9 min•Ep. 28
What matters when building vision-language models? RLHF Workflow: From Reward Modeling to Online RLHF SUTRA: Scalable Multilingual Language Model Architecture SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots...
May 15, 2024•10 min•Ep. 27
BlenderAlchemy: Editing 3D Graphics with Vision-Language Models Stylus: Automatic Adapter Selection for Diffusion Models Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations DressCode: Autoregressively Sewing and Generating Garments from Text Guidance PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning...
May 14, 2024•10 min•Ep. 26