TOP AI Developers by monthly star count
TOP AI Organization Account by AI repo star count
Top AI Project by Category star count
Top Growing Speed list by the speed of gaining stars
Top List of who create influential repos with little people known
Projects and developers that are thriving yet have not been updated for a long time.
Rankings | Organization Account | Related Project | Project intro | Star count |
---|---|---|---|---|
1 | VLM-R1 | Solve Visual Understanding with Reinforced VLMs | 4.7K | |
2 | nexa-sdk | Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities. | 4.4K | |
3 | SpatialLM | SpatialLM: Large Language Model for Spatial Understanding | 3.0K | |
4 | UI-TARS-desktop | A GUI Agent application based on UI-TARS(Vision-Lanuage Model) that allows you to control your computer using natural language. | 2.9K | |
5 | Local-File-Organizer | An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval. | 2.1K | |
6 | Skywork-R1V | Pioneering Multimodal Reasoning with CoT | 2.1K | |
7 | minimind-v | 🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours! | 1.5K | |
8 | vlms-zero-to-hero | This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models. | 1.0K | |
9 | Awesome-Robotics-3D | A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites | 650 | |
10 | VisRAG | Parsing-free RAG supported by VLMs | 611 | |
11 | EAGLE | Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs | 602 | |
12 | Awesome-Jailbreak-on-LLMs | Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. | 504 | |
13 | llama-assistant | AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace. | 486 | |
14 | vlmrun-hub | A hub for various industry-specific schemas to be used with VLMs. | 459 | |
15 | ghostwriter | Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine! | 436 | |
16 | Flame-Code-VLM | Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development. | 367 | |
17 | joycaption | JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models. | 349 | |
18 | open-cuak | Reliable Automation Agents at Scale | 279 | |
19 | lmms-finetune | A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc. | 262 | |
20 | TokenPacker | The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM". | 236 | |
21 | AUITestAgent | AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. | 192 | |
22 | Kolosal | Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device. | 177 | |
23 | ChatRex | Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding | 156 | |
24 | VLM2Vec | This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR25] | 152 | |
25 | DenseFusion | DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception | 134 | |
26 | Namo-R1 | A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease. | 133 | |
27 | Llama3.2-Vision-Finetune | An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta. | 131 | |
28 | LLaVA-MORE | LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning | 122 | |
29 | BALROG | Benchmarking Agentic LLM and VLM Reasoning On Games | 117 | |
30 | eureka-ml-insights | A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. | 106 | |
31 | Surveillance_Video_Summarizer | VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage. | 102 | |
32 | qapyq | An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA. | 102 | |
33 | pyvisionai | The PyVisionAI Official Repo | 97 | |
34 | Modality-Integration-Rate | The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate". | 96 | |
35 | Helpful-Doggybot | Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models | 90 | |
36 | Mini-LLaVA | A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability. | 89 | |
37 | VLM-Grounder | [CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | 85 | |
38 | Lexicon3D | [NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding | 82 | |
39 | TrustEval-toolkit | TrustEval: A modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs) | 79 | |
40 | SparseVLMs | Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference". | 77 | |
41 | BreezeApp | 探索 AI 的未來!聯發創新基地(MediaTek Research)首次開源的手機應用程式,讓你直接體驗我們最新研發的 AI 模型。透過手機,我們將 AI 科技帶入每個人的生活,完全離線運作,隱私更有保障。這是一個開源專案,我們熱烈歡迎開發者和愛好者一同參與,為 AI 技術發展貢獻力量。立即加入我們,一起打造更優質的 AI 體驗! | 73 | |
42 | 3d-conditioning | Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset. | 61 | |
43 | dingo | Dingo: A Comprehensive Data Quality Evaluation Tool | 57 | |
44 | KDPL | [ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation | 53 | |
45 | ReachQA | Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs" | 48 | |
46 | SeeGround | [CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding | 46 | |
47 | MMInstruct | The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types. | 43 | |
48 | SeeDo | Human Demo Videos to Robot Action Plans | 41 | |
49 | Emma-X | Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning | 39 | |
50 | GMAI-MMBench | GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. | 37 | |
51 | PhysBench | [ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding> | 36 | |
52 | CompBench | CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes. | 35 | |
53 | Parrot | 🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch. | 35 | |
54 | AIN | AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains. | 31 | |
55 | UrBench | [AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios” | 29 | |
56 | VISTA_Evaluation_FineTuning | Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding. | 28 | |
57 | Florence-2-Vision-Language-Model | Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks. | 26 | |
58 | SAM_Molmo_Whisper | An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language. | 23 | |
59 | HiRED | [AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget. | 21 | |
60 | GeoX | Code for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training | 21 | |
61 | gptparse | Document parser for RAG | 20 | |
62 | sources | READ THE README | 19 | |
63 | Re-Align | A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models. | 19 | |
64 | SubtitleAI | An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration | 17 | |
65 | video-search-and-summarization | Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A | 17 | |
66 | worldcuisines | WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines. | 16 | |
67 | Vision-language-models-VLM | vision language models finetuning notebooks & use cases (paligemma - florence .....) | 14 | |
68 | GVA-Survey | Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms | 14 | |
69 | awesome-turkish-language-models | A curated list of Turkish AI models, datasets, papers | 14 | |
70 | exif-ai | A Node.js CLI and library that uses OpenAI, Ollama, ZhipuAI, Google Gemini or Coze to write AI-generated image descriptions and/or tags to EXIF metadata by its content. | 13 | |
71 | PICG2scoring | [MICCAI'24] Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring | 11 | |
72 | TRIM | We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance. | 11 | |
73 | macbench | Probing the limitations of multimodal language models for chemistry and materials research | 11 | |
74 | computer-agent-arena-hub | Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks | 11 | |
75 | fiftyone_florence2_plugin | Run SOTA Vision-Language Model Florence-2 on your data! | 9 | |
76 | CII-Bench | Can MLLMs Understand the Deep Implication Behind Chinese Images? | 9 | |
77 | sentinel | Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras. | 9 | |
78 | MyColPali | The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models | 8 | |
79 | VortexFusion | Transformers + Mambas + LSTMS All in One Model | 7 | |
80 | vlm-api | REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model | 7 | |
81 | Chitrarth | Chitrarth: Bridging Vision and Language for a Billion People | 7 | |
82 | Video-Bench | Video Generation Benchmark | 7 | |
83 | ollama | Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models. | 7 | |
84 | ide-cap-chan | ide-cap-chan is a utility for batch image captioning with natural language using various VL models | 6 | |
85 | VELOCITI | VELOCITI Benchmark Evaluation and Visualisation Code | 5 | |
86 | vlm_databuilder | This SDK generates datasets for training Video LLMs from youtube videos. | 5 | |
87 | Dex-GAN-Grasp | DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 | DOI: 10.1109/Humanoids58906.2024.10769950 | 5 | |
88 | RoomAligner | A focus on aligning room elements for better flow and space utilization. | 5 | |
89 | simple-multimodal-ai | Simple Gradio application integrated with Hugging Face Multimodals to support visual question answering chatbot and more features | 5 | |
90 | TextSnap | TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images. | 4 | |
91 | VLM-ZSAD-Paper-Review | Reviews of papers on zero-shot anomaly detection using vision-Language models | 4 | |
92 | Multimodal-VideoRAG | Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models | 4 | |
93 | hass_ollama_image_analysis | Image analysis with Ollama (AI models) from within Home Assistant | 3 | |
94 | iuys | Intelligently Understanding Your Screenshots | 3 | |
95 | MiniCPM-V2.6-Colaboratory-Sample | 軽量VLMのMiniCPM-V2.6のColaboratoryサンプル | 3 | |
96 | Visual-Question-Answering-using-Gemini-LLM | In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension | 3 | |
97 | svlr | SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs | 3 | |
98 | ComfyUI-YALLM-node | Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported) | 3 | |
99 | CIDER | This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models. | 3 | |
100 | awesome-text-to-video-plus | The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps | 3 |