Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)

Top AI Developers Top AI Organizations Top AI Project Top Growing Speed The Least Known Devs A Year Without Refresh

LLM Diffusion GPT RAG Multi-modality DeepSeek

Rankings	Organization Account	Related Project	Project intro	Star count
1	bytedance 11.6K followers Singapore	UI-TARS-desktop	A GUI Agent application based on UI-TARS(Vision-Language Model) that allows you to control your computer using natural language.	14.2K
2	om-ai-lab 668 followers -	VLM-R1	Solve Visual Understanding with Reinforced VLMs	5.0K
3	NexaAI 588 followers United States of America	nexa-sdk	Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.	4.6K
4	jingyaogong 805 followers HangZhou, China	minimind-v	🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!	3.6K
5	manycore-research 148 followers Hangzhou, China	SpatialLM	SpatialLM: Large Language Model for Spatial Understanding	3.2K
6	MiniMax-AI 887 followers -	MiniMax-01	The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention	2.7K
7	SkyworkAI 876 followers Singapore	Skywork-R1V	Skywork-R1V2:Multimodal Hybrid Reinforcement Learning for Reasoning	2.6K
8	QiuYannnn 51 followers Los Angeles	Local-File-Organizer	An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.	2.3K
9	SkalskiP 5.6K followers 127.0.0.1	vlms-zero-to-hero	This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.	1.0K
10	zubair-irshad 252 followers Silicon Valley, CA, USA	Awesome-Robotics-3D	A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites	650
11	OpenBMB 5.0K followers -	VisRAG	Parsing-free RAG supported by VLMs	611
12	vlm-run 40 followers United States of America	vlmrun-hub	A hub for various industry-specific schemas to be used with VLMs.	510
13	nrl-ai 51 followers -	llama-assistant	AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace.	486
14	awwaiid 97 followers Washington, DC	ghostwriter	Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine!	436
15	Flame-Code-VLM 10 followers -	Flame-Code-VLM	Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.	367
16	fpgaminer 173 followers -	joycaption	JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.	349
17	Aident-AI 5 followers United States of America	open-cuak	Reliable Automation Agents at Scale	279
18	Hon-Wong 19 followers -	VoRA	[Fully open] [Encoder-free MLLM] Vision as LoRA	235
19	Genta-Technology 20 followers Indonesia	Kolosal	Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device.	227
20	2U1 89 followers South Korea	Llama3.2-Vision-Finetune	An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.	156
21	IDEA-Research 2.4K followers China	ChatRex	Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding	156
22	TIGER-AI-Lab 215 followers Canada	VLM2Vec	This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR25]	152
23	lucasjinreal 2.3K followers Sanfancisco	Namo-R1	A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.	133
24	FennelFetish 7 followers -	qapyq	An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA.	124
25	balrog-ai 1 followers -	BALROG	Benchmarking Agentic LLM and VLM Reasoning On Games	117
26	Ravi-Teja-konda 15 followers -	Surveillance_Video_Summarizer	VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.	102
27	MDGrey33 13 followers Riga, Latvia	pyvisionai	The PyVisionAI Official Repo	97
28	shikiw 65 followers -	Modality-Integration-Rate	The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".	96
29	WooQi57 41 followers -	Helpful-Doggybot	Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models	90
30	fangyuan-ksgk 37 followers Singapore	Mini-LLaVA	A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.	89
31	OpenRobotLab 531 followers -	VLM-Grounder	[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	85
32	TrustGen 12 followers United States of America	TrustEval-toolkit	TrustEval: A modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs)	79
33	Gumpest 129 followers Beijing	SparseVLMs	Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".	77
34	mtkresearch 85 followers -	BreezeApp	探索 AI 的未來！聯發創新基地（MediaTek Research）首次開源的手機應用程式，讓你直接體驗我們最新研發的 AI 模型。透過手機，我們將 AI 科技帶入每個人的生活，完全離線運作，隱私更有保障。這是一個開源專案，我們熱烈歡迎開發者和愛好者一同參與，為 AI 技術發展貢獻力量。立即加入我們，一起打造更優質的 AI 體驗！	73
35	FakeOAI 9 followers -	tokens	A token management platform that reverse-engineers the conversation interfaces of ChatGPT, Cursor, Grok, Claude, Windsurf, Gemini, and Sora, converting them into the OpenAI format./Token管理平台，逆向ChatGPT、Cursor、Grok、Claude、Windsurf、Gemini、Sora平台的对话接口转OpenAI格式	67
36	NVIDIA-Omniverse-blueprints 18 followers -	3d-conditioning	Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.	61
37	DataEval 1 followers China	dingo	Dingo: A Comprehensive Data Quality Evaluation Tool	57
38	RenzKa 58 followers -	simlingo	[CVPR 2025, Spotlight] SimLingo (CarLLava): Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment	54
39	hewei2001 116 followers Shanghai	ReachQA	Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"	48
40	50n50 12 followers -	sources	READ THE README	48
41	iris0329 57 followers -	SeeGround	[CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	46
42	ai4ce 206 followers Brooklyn, NY, U.S.	SeeDo	Human Demo Videos to Robot Action Plans	41
43	declare-lab 285 followers Singapore University of Technology and Design	Emma-X	Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	39
44	uni-medical 106 followers -	GMAI-MMBench	GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.	37
45	USC-GVL 25 followers United States of America	PhysBench	[ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding>	36
46	thubZ09 11 followers India	All-Things-Multimodal	Hub for researchers exploring VLMs and Multimodal Learning:)	36
47	mbzuai-oryx 281 followers -	AIN	AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains.	31
48	kesimeg 17 followers -	awesome-turkish-language-models	A curated list of Turkish AI models, datasets, papers	30
49	opendatalab 1.5K followers China	UrBench	[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”	29
50	sovit-123 140 followers India	SAM_Molmo_Whisper	An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.	23
51	roboflow 3.3K followers United States of America	vision-ai-checkup	Take your LLM to the optometrist.	23
52	ArmenJeddi 8 followers Toronto	saint	a training-free approach to accelerate ViTs and VLMs by pruning redundant tokens based on similarity	22
53	hasanar1f 34 followers Blacksburg, VA, USA	HiRED	[AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.	21
54	gptscript-ai 142 followers -	gptparse	Document parser for RAG	20
55	taco-group 68 followers United States of America	Re-Align	A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.	19
56	oztrkoguz 78 followers Turkey	SubtitleAI	An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration	17
57	NVIDIA-AI-Blueprints 262 followers United States of America	video-search-and-summarization	Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A	17
58	kornia 198 followers Spain	bubbaloop	🦄 Serving Platform for Spatial AI and Robotics.	17
59	worldcuisines 1 followers -	worldcuisines	WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines.	16
60	sayedmohamedscu 43 followers Egypt	Vision-language-models-VLM	vision language models finetuning notebooks & use cases (paligemma - florence .....)	14
61	wendell0218 0 followers -	GVA-Survey	Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms	14
62	tychenjiajun 41 followers Guangzhou, China	exif-ai	A Node.js CLI and library that uses OpenAI, Ollama, ZhipuAI, Google Gemini or Coze to write AI-generated image descriptions and/or tags to EXIF metadata by its content.	13
63	FreedomIntelligence 418 followers -	TRIM	We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.	11
64	lamalab-org 69 followers -	macbench	Probing the limitations of multimodal language models for chemistry and materials research	11
65	xlang-ai 534 followers -	computer-agent-arena-hub	Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks	11
66	miccunifi 41 followers Firenze - Viale Morgagni 65 - Italia	Cross-the-Gap	[ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion	11
67	gptbmw 0 followers -	wildcard	最新野卡wildcard虚拟信用卡使用指南：wildcard注册教程，如何开通野卡信用卡？如何为野卡充值和提现？	11
68	GAD-cell 5 followers -	VLM_GRPO	An implementation of GRPO for Unsloth's VLMs training	11
69	MING-ZCH 52 followers Wuhan, China	CII-Bench	Can MLLMs Understand the Deep Implication Behind Chinese Images?	9
70	securade 5 followers Singapore	sentinel	Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras.	9
71	hyun-yang 0 followers Brisbane, Australia	MyColPali	The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models	8
72	DataFog 6 followers United States of America	vlm-api	REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model	7
73	ola-krutrim 69 followers India	Chitrarth	Chitrarth: Bridging Vision and Language for a Billion People	7
74	Video-Bench 0 followers -	Video-Bench	Video Generation Benchmark	7
75	loong64 14 followers China	ollama	Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.	7
76	2dameneko 0 followers -	ide-cap-chan	ide-cap-chan is a utility for batch image captioning with natural language using various VL models	6
77	david-s-martinez 8 followers Munich, Germany	Dex-GAN-Grasp	DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 \| DOI: 10.1109/Humanoids58906.2024.10769950	5
78	Ashad001 62 followers Karachi	RoomAligner	A focus on aligning room elements for better flow and space utilization.	5
79	nguyennpa412 12 followers Vietnam	simple-multimodal-ai	Simple Gradio application integrated with Hugging Face Multimodals to support visual question answering chatbot and more features	5
80	sonstory 16 followers Seoul, Korea	VLM-ZSAD-Paper-Review	Reviews of papers on zero-shot anomaly detection using vision-Language models	4
81	Bhavik-Ardeshna 55 followers Montreal, Quebec	Multimodal-VideoRAG	Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models	4
82	Kazuhito00 696 followers Aichi, Japan	MiniCPM-V2.6-Colaboratory-Sample	軽量VLMのMiniCPM-V2.6のColaboratoryサンプル	3
83	Pavansomisetty21 24 followers -	Visual-Question-Answering-using-Gemini-LLM	In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension	3
84	bastien-muraccioli 1 followers Tsukuba, Japan	svlr	SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs	3
85	asaddi 10 followers California, USA	ComfyUI-YALLM-node	Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported)	3
86	PandragonXIII 1 followers China	CIDER	This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models.	3
87	XiaomingX 9.8K followers japan	awesome-text-to-video-plus	The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps	3
88	StabRise 3 followers -	ScaleDP	ScaleDP is an Open-Source extension of Apache Spark for Document Processing	3
89	Open-Social-World 0 followers -	EgoNormia	EgoNormia \| Benchmarking Physical Social Norm Understanding in VLMs	3
90	wangclnlp 8 followers -	Vision-LLM-Alignment	This repository contains the code for SFT, RLHF, and DPO, designed for vision-based LLMs, including the LLaVA models and the LLaMA-3.2-vision models.	2
91	SJ9VRF 4 followers California	Fine-tune-Vision-Language-Model	This repository contains the implementation of the Vision-and-Language Transformer (ViLT) model fine-tuned for Visual Question Answering (VQA) tasks. The project is structured to be easy to set up and use, providing a streamlined approach for experimenting with different configurations and datasets.	2
92	sourceduty 52 followers International	Sora	🎬 OpenAI's Sora, a new text-to-video AI model, is set to launch later in 2024.	2
93	sandy1990418 5 followers -	Finetune-Qwen2.5-VL	Fine-tuning Qwen2.5-VL for vision-language tasks \| Optimized for Vision understanding \| LoRA & PEFT support.	2
94	MaxLSB 4 followers Paris, France	mini-paligemma2	Minimalist implementation of PaliGemma 2 & PaliGemma VLM from scratch	2
95	fork123aniket 10 followers Montreal, QC, Canada	Multi-Round-VLM-powered-Multimodal-Conversational-AI-Navigation-Bot	Streamlit App Combining Vision, Language, and Audio AI Models	2
96	anyofai 48 followers -	sora	Sora是什么？如何使用Sora？Sora入口在哪？Sora订阅保姆级教程！	2
97	Shard-AI 0 followers United States of America	Shard	Open Source Video Understanding API and Large Vision Model Observability Platform.	2
98	claw1200 3 followers UK	llama-cord	Discord App for Interacting with local Ollama Models. Multiple Agents Supported!	2
99	hamedR96 11 followers Paris, France	User-VLM	Personalized Vision Language Models for Social Human-Robot Interactions	2
100	massimilianoviola 4 followers Zurich, Switzerland	visual-translator	Translate objects in images with a click, get contextual sentences and hear their pronunciation.	2