TOP AI Developers by monthly star count
TOP AI Organization Account by AI repo star count
Top AI Project by Category star count
Top Growing Speed list by the speed of gaining stars
Top List of who create influential repos with little people known
Projects and developers that are thriving yet have not been updated for a long time.
| Rankings | Organization Account | Related Project | Project intro | Star count |
|---|---|---|---|---|
1 | GLM-V | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | 1.5K | |
2 | LLM-RL-Visualized | 🌟100+ 原创 LLM / RL 原理图📚,《大模型算法》作者巨献!💥(100+ LLM/RL Algorithm Maps ) | 1.2K | |
3 | UniPic | Building Kontext Model with Online RL for Unified Multimodal Model | 708 | |
4 | MiMo-VL | MiMo-VL | 467 | |
5 | Mirage | Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025) | 115 | |
6 | DatasetLoom | 一个面向多模态大模型训练的智能数据集构建与评估平台 | 101 | |
7 | surfer-h-cli | Run Surfer-H agents powered by Holo1 using the Surfer-H-CLI. Includes example tasks, scripts, and configurations. | 96 | |
8 | InteractVLM | [CVPR 2025] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models | 92 | |
9 | AngelSlim | Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency. | 88 | |
10 | Awesome-Interleaving-Reasoning | Interleaving Reasoning: Next-Generation Reasoning Systems for AGI | 77 | |
11 | vlm-grpo | An implementation of GRPO for Unsloth's VLMs training | 63 | |
12 | Automodel | Fine-tune any Hugging Face LLM or VLM on day-0 using PyTorch-native features for GPU-accelerated distributed training with superior performance and memory efficiency. | 48 | |
13 | reverse_vlm | 🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling" | 39 | |
14 | google-veo3-from-scratch | A Step-by-Step Implementation of Google Veo 3 Architecture from Scratch | 33 | |
15 | vision-ai-checkup | Take your LLM to the optometrist. | 31 | |
16 | IR3D-Bench | Official Code of IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering | 30 | |
17 | cadrille | cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning | 19 | |
18 | sora-json-prompt-crafter | A vibe coded Sora JSON Prompt Crafter for curious humans and prompt engineers | 14 | |
19 | VRBench | [ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos | 14 | |
20 | CAD-GPT | [AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs | 12 | |
21 | Qwen2.5-VL-Batched | A batched implementation for efficient Qwen2.5-VL inference. | 12 | |
22 | BusterX | BusterX and BusterX++ | 12 | |
23 | VLMLight | Official implementation of VLMLight | 9 | |
24 | kolosal-cli | Super lightweight Ollama alternative to run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1 and other large language models. | 9 | |
25 | VisPruner | [ICCV 2025] Official code for paper: Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs | 7 | |
26 | google-veo3-from-scratch | # Google Veo 3 Implemented from ScratchThis repository contains an implementation of Google Veo 3, a cutting-edge text-to-video generation system. 🎥 Explore the code to create high-quality videos from text prompts and enhance your projects with advanced AI capabilities. 🌟 | 5 | |
27 | merit | Official Repo for Paper <MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query> | 5 | |
28 | multimodal-pretraining-pmi | Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models | 5 | |
29 | Awesome-LLM-reasoning-papers | This repository offers a well-organized collection of resources focused on reasoning in Large Language Models (LLMs). Explore foundational papers, evaluation benchmarks, and practical tools to enhance your understanding of LLM reasoning. 🐙🌐 | 4 | |
30 | Qwen2.5-VL-Video-Understanding | The Qwen2.5-VL-7B-Instruct model is a multimodal AI model developed by Alibaba Cloud that excels at understanding both text and images. It's a Vision-Language Model (VLM) designed to handle various visual understanding tasks, including image understanding, video analysis, and even multilingual support. | 4 | |
31 | simpleVLM | building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well | 3 | |
32 | UI-TARS-desktop | A GUI Agent application based on UI-TARS(Vision-Language Model) that allows you to control your computer using natural language. | 3 | |
33 | MIRA-Multimodal-Intelligent-Robotic-Assistant | 基于Qwen Agent框架,融合JAKA机械臂、视觉检测、语音识别与合成、MCP数据库的多模态大模型 | 2 | |
34 | vlm4ocr | Python package and Web App for OCR with vision language models. | 2 | |
35 | GeospatialVLM | VLM specially crafted for geospatial reasoning tasks | 2 | |
36 | guide | 2025 ChatGPT 使用教程和最佳实践,涵盖注册设置、Prompt 模板、Explorer GPT、DALL·E/Sora 使用技巧,适合新手与进阶者 | 2 | |
37 | awesome-turkish-vlm | A curated list of models, datasets and other useful resources for Turkish Vision-Language Models (VLM). | 2 | |
38 | TalkMateAI | 🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync | 2 | |
39 | SpatialFusion-LM | SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding. | 1 | |
40 | bagel | description: "[NVIDIA ONLY] Image generation, image editing and free-form manipulation with a VLM (Minimum Requirements 12GB VRAM / 32GB RAM Recommended Requirements 24GB VRAM / 48GB RAM)", | 1 | |
41 | aeon.ai | AEON is a lightweight, stateless RAG chatbot that answers questions using your Markdown, Text, and JSON documents. It runs locally on your CPU with at least 8GB RAM, leveraging Ollama for LLMs and Chroma as its vector database. | 1 | |
42 | LLaVA-STF | The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models" | 1 | |
43 | Imgscope-OCR-2B-0527 | Imgscope-OCR-2B-0527 is a powerful model designed for messy handwriting recognition and document OCR. It excels in multi-modal tasks, providing users with advanced capabilities for understanding complex visual and textual data. 🐙🌟 | 1 | |
44 | Kimi-VL-Colaboratory-Sample | Colaboratory上でKimi-VLをお試しするサンプル | 1 | |
45 | Stream-Omni | Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations. | 1 | |
46 | vision-token-calculator | 🧮 A calculator for vision tokens in VLMs. | 1 | |
47 | NoteMR | NoteMR enhances multimodal large language models for visual question answering by integrating structured notes. This implementation aims to reduce reasoning errors and improve visual feature perception. 🐙📚 | 1 | |
48 | SSM-As-VLM-Bridge | An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM | 1 | |
49 | calcarine | Desktop VLM: Real-time FastVLM analysis of video & textures with live compute shaders | 1 | |
50 | Caption-Creator | Caption Creator is a fast and portable tool for generating high-quality image captions and tags—ideal for custom dataset creation, especially for (FLUX Dev, Pony, SDXL 1.0 Base, Illustrious), and more. Works seamlessly for both training and image generation. | 1 | |
51 | MoleSearch | Multimodal data Retriever, including text, image, video, audio | 1 | |
52 | crag-mm | CRAG-MM Challenge Solution Code | 1 | |
53 | Zero-shot-s2 | Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models | 1 | |
54 | vlm_instruction_follower | Instruction-following vision-language model (VLM): grounded text instructions executed via multi-modal reasoning | 1 | |
55 | sora-extension | Sora extension support for VS Code | 1 | |
56 | Gemma3_OCR_Text_Extractor_LLM | Gemma-3 OCR exemplifies the confluence of abstruse computer vision and arcane NLP, leveraging Gemma-3 Vision’s neural framework for precise OCR and semantically refined text curation. Powered by Streamlit and Ollama, this hermetic system converts visual data into perspicuous, markdown-rendered output, ensuring maximal accuracy and confidentiality. | 1 | |
57 | moonlabel | Moondream VLM-powered labeler one-click YOLO export | 1 | |
58 | OMGM | OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval (ACL 2025 Main Conference) | 1 | |
59 | AlphaExtract | AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries. | 0 | |
60 | AlphaExtract | AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries. | 0 | |
61 | LLaMA-Factory | Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024) | 0 | |
62 | Agent-S | Agent S: an open agentic framework that uses computers like a human | 0 | |
63 | AICTF_2025 | AI Security in Practice: Analysis of AI CTF Tasks at Positive Hack Days | 0 | |
64 | Multimodal-Product-Intelligence-System | Solution to understand and align product images, text descriptions, and customer reviews to flag issues (e.g., misleading images, product defects, or mismatch between title and image). | 0 | |
65 | Awesome-R1 | A curated list of research papers, models, and resources related to R1-style reasoning models following DeepSeek-R1's breakthrough in January 2025. | 0 | |
66 | SmartGrader | SmartGrader represents a breakthrough in educational technology, combining cutting-edge Vision-Language Models (VLMs) and Large Language Models (LLMs) to revolutionize the assessment of handwritten computer science assignments. | 0 | |
67 | capture-to-vlm | This project provides a comprehensive system for analyzing real-time videos using Vision Language Models (VLM) and generating summaries of the content. The system works in two main phases: real-time frame analysis and post-processing summarization. | 0 | |
68 | In-Context-multiview-img-generation | Generate multiple 2D/4D views of the same object/scene with IC-LoRA and Flux | 0 | |
69 | File-Organizer-Tool | Organize your files effortlessly with the File Organizer Tool, which sorts them into subdirectories based on their prefixes. This versatile tool offers both GUI and command-line interfaces, supporting multiple languages and themes for a personalized experience. 🗂️💻 | 0 | |
70 | Multimodal-RAG | Multimodal RAG using Colsmolvlm in colab free-tier GPU | 0 | |
71 | PaliGemma-Image-Segmentation | An app with FastAPI, Docker, transformers, JAX/Flax for performing image segmentation with PaliGemma 2 mix | 0 | |
72 | FitCheck.AI | 👔 FitCheck.AI is your personal AI stylist. Upload outfits for savage critiques, auto-tag your wardrobe, and get smart recommendations - powered by Streamlit, LangChain, MongoDB, and VLMs like CLIP and Qwen. | 0 | |
73 | aiprof | 소프트웨어공학 1팀 | 0 | |
74 | video_generator_with_sora | Sora Video Generator is a Streamlit app for effortless AI video creation. Just describe your idea in one sentence—no tech skills needed. It uses Azure OpenAI to craft prompts for the Sora API, handling everything from submission to download. | 0 | |
75 | daycare_ollama_analysis | This repo presents codes that allows user to run a pipeline to analyze daycare image using YOLO, Ollama, VLM, and Reasoning LLMs locally | 0 | |
76 | SoraChatGPTDownloader | Download Videos from Sora ChatGPT php | 0 | |
77 | VLM-Mamba | Revolutionize vision-language tasks with VLM-Mamba, the first model using State Space Models. Explore innovative multi-modal architecture. 🚀💻 | 0 | |
78 | AutoVisType | Probing vision-language model alignment with human expert visual grouping over stratified sample of VIS30K dataset. | 0 | |
79 | Uc-PrUn | Uc-PrUn: Uncertainty-calibrated Data Pruning and Unlearning framework for vision-language models (VLMs) | 0 | |
80 | TalkMateAI | Create immersive conversations with TalkMateAI, a real-time voice-controlled 3D avatar. Experience natural interactions powered by advanced AI. 🐙🌐 | 0 | |
81 | cutlass | CUTLASS 4.1.0 offers high-performance matrix-matrix multiplication in CUDA, with flexible abstractions for custom kernels. Perfect for efficient linear algebra. 🚀💻 | 0 | |
82 | 3d-vlm-gaussian-splatting-pointclip-on-modelnet40-and-scanobjectnn | achieved over 96 % top1 accuracy on modelnet40 test dataset and 99.91% top1 accuracy on scanobjectnn test dataset with light weighted 3d custome models. projecting 3d pointcloud dataset(with gaussian splatting method) into 2d.image. And lastly, clip vit-16 | 0 | |
83 | dspy-experiments | Hands-on experiments with the DSPy framework using local Ollama models. Features basic QA systems, multimodal image processing with LLaVA, and interactive Jupyter notebooks. Privacy-focused with local inference and no API costs. | 0 | |
84 | Business-card-info-extraction | Detect business cards and extract information in a structured format using VLMs. | 0 | |
85 | School_Behavior_Analyzer | A Python application for detecting, tracking, and analyzing classroom behavior using computer vision and large vision-language models (VLMs). The system detects and tracks people in video streams, saves cropped person videos, and analyzes posture changes using a VLM. | 0 | |
86 | Byaldi-Qwen-img-reader | Image to text reader for English and Hindi. Made with combining Byaldi and Qwen2VL vision language models. | 0 | |
87 | lidar_vqa | Multimodal system combining RGB images and LiDAR depth cues to answer questions about driving scenes using fine-tuned CLIP (ViT-B/32) and fusion strategies. | 0 | |
88 | nuextract-2.0-receipts-fastapi | Efficient parsing of scanned receipts from walmart using NuExtract 2.0 VLM, FastAPI and hosted in Modal Labs Serverless Deployment | 0 | |
89 | Action-RecognitionVLM | A project demonstrating zero-shot and few-shot action recognition on the UCF101 dataset using CLIP. Includes evaluation, fine-tuning, and embedding space visualizations. | 0 | |
90 | vlm-lora | LoRA from scratch for VLM fine tuning | 0 | |
91 | personalization-toolkit-for-lvlm-review | Review of the paper "Personalization Toolkit: Training Free Personalization of Large Vision Language Models" | 0 | |
92 | diverticulitis_ollama_LLM | AI-driven dietary guidance for diverticulitis. Upload meal photos for analysis, get food safety ratings, and receive personalized advice. 🍽️💻 | 0 | |
93 | miramo | A Flask-based web app for managing multimodal datasets text and images with CRUD operations via SQLite, and seamless export as a structured Parquet dataset to Hugging Face Hub. | 0 | |
94 | VLM | Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle. | 0 | |
95 | mllm-gesture-eval | Code and dataset for evaluating Multimodal LLMs on indexical, iconic, and symbolic gestures (Nishida et al., ACL 2025) | 0 | |
96 | smart-vehicle-detector | **Smart Vehicle Detector** is an AI-powered system that combines YOLO for object detection and a VLM to classify vehicle types more accurately. This project demonstrates the integration of modern computer vision and language models for intelligent scene understanding. | 0 | |
97 | Sora | A collection of Sora modules | 0 | |
98 | alexpalms | The special repo for GitHub | 0 | |
99 | bandu | Bandu: AI Agents based on ROS2 | 0 | |
100 | cua | c/ua is the Docker Container for Computer-Use AI Agents. | 0 |