TOP AI Developers by monthly star count
TOP AI Organization Account by AI repo star count
Top AI Project by Category star count
Top Growing Speed list by the speed of gaining stars
Top List of who create influential repos with little people known
Projects and developers that are thriving yet have not been updated for a long time.
Rankings | Organization Account | Related Project | Project intro | Star count |
---|---|---|---|---|
1 | UI-TARS-desktop | The Open All-in-One Multimodal AI Agent Stack connecting Cutting-edge AI Models and Agent Infra. | 15.0K | |
2 | VLM-R1 | Solve Visual Understanding with Reinforced VLMs | 5.3K | |
3 | SpatialLM | SpatialLM: Training Large Language Models for Structured Indoor Modeling | 3.5K | |
4 | MiniMax-01 | The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention | 3.0K | |
5 | Skywork-R1V | Skywork-R1V2:Multimodal Hybrid Reinforcement Learning for Reasoning | 2.6K | |
6 | Local-File-Organizer | An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval. | 2.4K | |
7 | vlms-zero-to-hero | This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models. | 1.0K | |
8 | GLM-4.1V-Thinking | GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning. | 644 | |
9 | VisRAG | Parsing-free RAG supported by VLMs | 611 | |
10 | UniWorld-V1 | UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation | 583 | |
11 | vlmrun-hub | A hub for various industry-specific schemas to be used with VLMs. | 510 | |
12 | llama-assistant | AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace. | 486 | |
13 | LLM-RL-Visualized | 🌟100+ 原创 LLM / RL 原理图📚,《大模型算法》作者巨献🎉 (100+ LLM/RL Algorithm Maps ) | 458 | |
14 | ghostwriter | Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine! | 436 | |
15 | Flame-Code-VLM | Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development. | 367 | |
16 | joycaption | JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models. | 349 | |
17 | VoRA | [Fully open] [Encoder-free MLLM] Vision as LoRA | 299 | |
18 | VLM2Vec | This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025] | 287 | |
19 | open-cuak | Reliable Automation Agents at Scale | 279 | |
20 | dingo | Dingo: A Comprehensive AI Data Quality Evaluation Tool | 256 | |
21 | Kolosal | Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device. | 227 | |
22 | Llama3.2-Vision-Finetune | An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta. | 156 | |
23 | ChatRex | Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding | 156 | |
24 | Namo-R1 | A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease. | 133 | |
25 | qapyq | An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA. | 130 | |
26 | video-search-and-summarization | Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A | 130 | |
27 | simlingo | [CVPR 2025, Spotlight] SimLingo (CarLLava): Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment | 126 | |
28 | BALROG | Benchmarking Agentic LLM and VLM Reasoning On Games | 117 | |
29 | BreezeApp | BreezeAPP 是一款為 Android 和 iOS 平台開發的純手機 AI 應用程式。從 App Store下載,即可在不連網的狀態下享受多項 AI 功能。源碼由聯發創新基地(MediaTek Research)提供。我們旨在推廣兩個概念: 人人都可以在自己的手機上自由選擇並運行不同的LLM - one is free to choose one's own LLM to run on a phone,以及任何app開發者都可以輕鬆寫作創意的純手機AI應用 - any dev can create purely phone-based AI apps easily。 | 110 | |
30 | TrustEval-toolkit | Toolkit for evaluating the trustworthiness of generative foundation models. | 105 | |
31 | Surveillance_Video_Summarizer | VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage. | 102 | |
32 | pyvisionai | The PyVisionAI Official Repo | 97 | |
33 | Modality-Integration-Rate | The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate". | 96 | |
34 | Helpful-Doggybot | Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models | 90 | |
35 | Mini-LLaVA | A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability. | 89 | |
36 | VLM-Grounder | [CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | 85 | |
37 | SparseVLMs | Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference". | 77 | |
38 | Awesome-Interleaving-Reasoning | Interleaving Reasoning: Next-Generation Reasoning Systems for AGI | 77 | |
39 | tokens | A token management platform that reverse-engineers the conversation interfaces of ChatGPT, Cursor, Grok, Claude, Windsurf, Gemini, and Sora, converting them into the OpenAI format./Token管理平台,逆向ChatGPT、Cursor、Grok、Claude、Windsurf、Gemini、Sora平台的对话接口转OpenAI格式 | 76 | |
40 | Mirage | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025) | 63 | |
41 | 3d-conditioning | Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset. | 61 | |
42 | SeeDo | [IROS 2025] Human Demo Videos to Robot Action Plans | 54 | |
43 | InteractVLM | [CVPR 2025] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models | 51 | |
44 | GVA-Survey | Official repository of the paper "Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms" | 49 | |
45 | ReachQA | Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs" | 48 | |
46 | SeeGround | [CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding | 46 | |
47 | vlm-grpo | An implementation of GRPO for Unsloth's VLMs training | 40 | |
48 | all-things-multimodal | Hub for researchers exploring VLMs and Multimodal Learning:) | 40 | |
49 | Emma-X | Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning | 39 | |
50 | awesome-turkish-language-models | A curated list of Turkish AI models, datasets, papers | 38 | |
51 | PhysBench | [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding> | 36 | |
52 | reverse_vlm | 🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling" | 34 | |
53 | Video-Bench | Video Generation Benchmark | 32 | |
54 | AIN | AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains. | 31 | |
55 | vision-ai-checkup | Take your LLM to the optometrist. | 31 | |
56 | IR3D-Bench | Official Code of IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | 30 | |
57 | Automodel | Day-0 support for any Hugging Face model leveraging PyTorch native functionalities while providing performance and memory optimized training and inference recipes. | 26 | |
58 | SAM_Molmo_Whisper | An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language. | 23 | |
59 | saint | a training-free approach to accelerate ViTs and VLMs by pruning redundant tokens based on similarity | 22 | |
60 | gptparse | Document parser for RAG | 20 | |
61 | Re-Align | A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models. | 19 | |
62 | bubbaloop | 🦄 Serving Platform for Spatial AI and Robotics. | 19 | |
63 | cadrille | cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning | 19 | |
64 | SubtitleAI | An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration | 17 | |
65 | ScaleDP | ScaleDP is an Open-Source extension of Apache Spark for Document Processing | 13 | |
66 | wildcard | 最新野卡wildcard虚拟信用卡使用指南:wildcard注册教程,如何开通野卡信用卡?如何为野卡充值和提现? | 13 | |
67 | srbench | Source code for the Paper "Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models" | 12 | |
68 | CAD-GPT | [AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs | 12 | |
69 | TRIM | We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. | 11 | |
70 | computer-agent-arena-hub | Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks | 11 | |
71 | Cross-the-Gap | [ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion | 11 | |
72 | VLM-Safety-MU | Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning | 11 | |
73 | CII-Bench | Can MLLMs Understand the Deep Implication Behind Chinese Images? | 9 | |
74 | sentinel | Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras. | 9 | |
75 | EgoNormia | EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs | 9 | |
76 | ImagineFSL | Official implementation of "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" [CVPR 2025 Highlight] | 9 | |
77 | Awesome-HCI-LLM | Awesome-HCI (Ubiquitous, LLM, MLLM, Agent, RAG, Embodied-AI, RLHF) | 9 | |
78 | OptVL | AVL + python + optimization = OptVL | 9 | |
79 | MyColPali | The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models | 8 | |
80 | DASH | DASH: Detection and Assessment of Systematic Hallucinations of VLMs | 8 | |
81 | vlm-api | REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model | 7 | |
82 | Chitrarth | Chitrarth: Bridging Vision and Language for a Billion People | 7 | |
83 | ollama | Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models. | 7 | |
84 | VisPruner | [ICCV 2025] Official code for paper: Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs | 7 | |
85 | CoIN | [ICCV 25] Official repository of "Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues" | 7 | |
86 | VLM-CADFeatureRecognition | This repository provides code and resources for automating manufacturing feature recognition in CAD designs using vision-language models. | 7 | |
87 | ide-cap-chan | ide-cap-chan is a utility for batch image captioning with natural language using various VL models | 6 | |
88 | Geminio | [ICCV 2025] Geminio is a VLM-powered gradient inversion attack in federated learning (FL). It allows the adversary (the FL server) to describe the data of value and reconstruct the victim client's private data matching the description. | 6 | |
89 | RadVLM | A Multitask Conversational Vision-Language Model for Radiology | 6 | |
90 | Dex-GAN-Grasp | DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 | DOI: 10.1109/Humanoids58906.2024.10769950 | 5 | |
91 | RoomAligner | A focus on aligning room elements for better flow and space utilization. | 5 | |
92 | google-veo3-from-scratch | # Google Veo 3 Implemented from ScratchThis repository contains an implementation of Google Veo 3, a cutting-edge text-to-video generation system. 🎥 Explore the code to create high-quality videos from text prompts and enhance your projects with advanced AI capabilities. 🌟 | 5 | |
93 | VLM-ZSAD-Paper-Review | Reviews of papers on zero-shot anomaly detection using vision-Language models | 4 | |
94 | Multimodal-VideoRAG | Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models | 4 | |
95 | LLMs-Journey | Various LLM resources and experiments | 4 | |
96 | VLMLight | Official implementation of VLMLight | 4 | |
97 | casp | [CVPR 2025 Highlight] CASP: Compression of Large Multimodal Models Based on Attention Sparsity | 4 | |
98 | ComfyUI-YALLM-node | Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported) | 3 | |
99 | CIDER | This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models. | 3 | |
100 | awesome-text-to-video-plus | The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps | 3 |