Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)

Top AI Developers Top AI Organizations Top AI Project Top Growing Speed The Least Known Devs A Year Without Refresh

LLM Diffusion GPT RAG Multi-modality DeepSeek

Rankings	Organization Account	Related Project	Project intro	Star count
1	om-ai-lab 652 followers -	VLM-R1	Solve Visual Understanding with Reinforced VLMs	4.7K
2	NexaAI 577 followers United States of America	nexa-sdk	Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.	4.4K
3	manycore-research 138 followers Hangzhou, China	SpatialLM	SpatialLM: Large Language Model for Spatial Understanding	3.0K
4	bytedance 9.0K followers Singapore	UI-TARS-desktop	A GUI Agent application based on UI-TARS(Vision-Lanuage Model) that allows you to control your computer using natural language.	2.9K
5	QiuYannnn 48 followers Los Angeles	Local-File-Organizer	An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.	2.1K
6	SkyworkAI 573 followers Singapore	Skywork-R1V	Pioneering Multimodal Reasoning with CoT	2.1K
7	jingyaogong 522 followers HangZhou, China	minimind-v	🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM！🌏 Train a 26M-parameter VLM from scratch in just 1 hours!	1.5K
8	SkalskiP 5.6K followers 127.0.0.1	vlms-zero-to-hero	This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.	1.0K
9	zubair-irshad 252 followers Silicon Valley, CA, USA	Awesome-Robotics-3D	A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites	650
10	OpenBMB 4.8K followers -	VisRAG	Parsing-free RAG supported by VLMs	611
11	NVlabs 7.0K followers -	EAGLE	Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs	602
12	yueliu1999 296 followers Singapore	Awesome-Jailbreak-on-LLMs	Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.	504
13	nrl-ai 51 followers -	llama-assistant	AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace.	486
14	vlm-run 31 followers United States of America	vlmrun-hub	A hub for various industry-specific schemas to be used with VLMs.	459
15	awwaiid 97 followers Washington, DC	ghostwriter	Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine!	436
16	Flame-Code-VLM 10 followers -	Flame-Code-VLM	Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.	367
17	fpgaminer 173 followers -	joycaption	JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.	349
18	Aident-AI 5 followers United States of America	open-cuak	Reliable Automation Agents at Scale	279
19	zjysteven 70 followers United States	lmms-finetune	A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.	262
20	CircleRadon 72 followers Hangzhou	TokenPacker	The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".	236
21	bz-lab 4 followers -	AUITestAgent	AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.	192
22	Genta-Technology 20 followers Indonesia	Kolosal	Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device.	177
23	IDEA-Research 2.3K followers China	ChatRex	Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding	156
24	TIGER-AI-Lab 215 followers Canada	VLM2Vec	This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR25]	152
25	baaivision 600 followers China	DenseFusion	DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception	134
26	lucasjinreal 2.3K followers Sanfancisco	Namo-R1	A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.	133
27	2U1 62 followers -	Llama3.2-Vision-Finetune	An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.	131
28	aimagelab 142 followers Modena, Italy	LLaVA-MORE	LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning	122
29	balrog-ai 1 followers -	BALROG	Benchmarking Agentic LLM and VLM Reasoning On Games	117
30	microsoft 89.7K followers Redmond, WA	eureka-ml-insights	A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.	106
31	Ravi-Teja-konda 15 followers -	Surveillance_Video_Summarizer	VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.	102
32	FennelFetish 8 followers -	qapyq	An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA.	102
33	MDGrey33 13 followers Riga, Latvia	pyvisionai	The PyVisionAI Official Repo	97
34	shikiw 65 followers -	Modality-Integration-Rate	The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".	96
35	WooQi57 41 followers -	Helpful-Doggybot	Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models	90
36	fangyuan-ksgk 37 followers Singapore	Mini-LLaVA	A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.	89
37	OpenRobotLab 531 followers -	VLM-Grounder	[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	85
38	YunzeMan 88 followers Champaign, Illinois	Lexicon3D	[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding	82
39	TrustGen 12 followers United States of America	TrustEval-toolkit	TrustEval: A modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs)	79
40	Gumpest 129 followers Beijing	SparseVLMs	Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".	77
41	mtkresearch 85 followers -	BreezeApp	探索 AI 的未來！聯發創新基地（MediaTek Research）首次開源的手機應用程式，讓你直接體驗我們最新研發的 AI 模型。透過手機，我們將 AI 科技帶入每個人的生活，完全離線運作，隱私更有保障。這是一個開源專案，我們熱烈歡迎開發者和愛好者一同參與，為 AI 技術發展貢獻力量。立即加入我們，一起打造更優質的 AI 體驗！	73
42	NVIDIA-Omniverse-blueprints 18 followers -	3d-conditioning	Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.	61
43	DataEval 1 followers China	dingo	Dingo: A Comprehensive Data Quality Evaluation Tool	57
44	miccunifi 41 followers Firenze - Viale Morgagni 65 - Italia	KDPL	[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation	53
45	hewei2001 116 followers Shanghai	ReachQA	Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"	48
46	iris0329 57 followers -	SeeGround	[CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	46
47	yuecao0119 17 followers -	MMInstruct	The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.	43
48	ai4ce 206 followers Brooklyn, NY, U.S.	SeeDo	Human Demo Videos to Robot Action Plans	41
49	declare-lab 285 followers Singapore University of Technology and Design	Emma-X	Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	39
50	uni-medical 106 followers -	GMAI-MMBench	GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.	37
51	USC-GVL 25 followers United States of America	PhysBench	[ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding>	36
52	RaptorMai 70 followers Columbus	CompBench	CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.	35
53	AIDC-AI 122 followers -	Parrot	🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.	35
54	mbzuai-oryx 281 followers -	AIN	AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains.	31
55	opendatalab 1.5K followers China	UrBench	[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”	29
56	JUNJIE99 16 followers -	VISTA_Evaluation_FineTuning	Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding.	28
57	ANYANTUDRE 26 followers Morocco	Florence-2-Vision-Language-Model	Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.	26
58	sovit-123 140 followers India	SAM_Molmo_Whisper	An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.	23
59	hasanar1f 34 followers Blacksburg, VA, USA	HiRED	[AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.	21
60	Alpha-Innovator 37 followers -	GeoX	Code for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training	21
61	gptscript-ai 142 followers -	gptparse	Document parser for RAG	20
62	50n50 7 followers -	sources	READ THE README	19
63	taco-group 68 followers United States of America	Re-Align	A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.	19
64	oztrkoguz 78 followers Turkey	SubtitleAI	An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration	17
65	NVIDIA-AI-Blueprints 262 followers United States of America	video-search-and-summarization	Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A	17
66	worldcuisines 1 followers -	worldcuisines	WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines.	16
67	sayedmohamedscu 43 followers Egypt	Vision-language-models-VLM	vision language models finetuning notebooks & use cases (paligemma - florence .....)	14
68	wendell0218 0 followers -	GVA-Survey	Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms	14
69	kesimeg 16 followers -	awesome-turkish-language-models	A curated list of Turkish AI models, datasets, papers	14
70	tychenjiajun 41 followers Guangzhou, China	exif-ai	A Node.js CLI and library that uses OpenAI, Ollama, ZhipuAI, Google Gemini or Coze to write AI-generated image descriptions and/or tags to EXIF metadata by its content.	13
71	med-air 175 followers -	PICG2scoring	[MICCAI'24] Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring	11
72	FreedomIntelligence 418 followers -	TRIM	We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.	11
73	lamalab-org 69 followers -	macbench	Probing the limitations of multimodal language models for chemistry and materials research	11
74	xlang-ai 489 followers -	computer-agent-arena-hub	Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks	11
75	jacobmarks 147 followers NYC	fiftyone_florence2_plugin	Run SOTA Vision-Language Model Florence-2 on your data!	9
76	MING-ZCH 52 followers Wuhan, China	CII-Bench	Can MLLMs Understand the Deep Implication Behind Chinese Images?	9
77	securade 5 followers Singapore	sentinel	Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras.	9
78	hyun-yang 0 followers Brisbane, Australia	MyColPali	The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models	8
79	kyegomez 1.9K followers Palo Alto	VortexFusion	Transformers + Mambas + LSTMS All in One Model	7
80	DataFog 6 followers United States of America	vlm-api	REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model	7
81	ola-krutrim 69 followers India	Chitrarth	Chitrarth: Bridging Vision and Language for a Billion People	7
82	Video-Bench 0 followers -	Video-Bench	Video Generation Benchmark	7
83	loong64 14 followers China	ollama	Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.	7
84	2dameneko 0 followers -	ide-cap-chan	ide-cap-chan is a utility for batch image captioning with natural language using various VL models	6
85	katha-ai 7 followers India	VELOCITI	VELOCITI Benchmark Evaluation and Visualisation Code	5
86	tensorsense 11 followers United States of America	vlm_databuilder	This SDK generates datasets for training Video LLMs from youtube videos.	5
87	david-s-martinez 8 followers Munich, Germany	Dex-GAN-Grasp	DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 \| DOI: 10.1109/Humanoids58906.2024.10769950	5
88	Ashad001 62 followers Karachi	RoomAligner	A focus on aligning room elements for better flow and space utilization.	5
89	nguyennpa412 12 followers Vietnam	simple-multimodal-ai	Simple Gradio application integrated with Hugging Face Multimodals to support visual question answering chatbot and more features	5
90	sitamgithub-MSIT 34 followers Howrah, West Bengal	TextSnap	TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images.	4
91	sonstory 16 followers Seoul, Korea	VLM-ZSAD-Paper-Review	Reviews of papers on zero-shot anomaly detection using vision-Language models	4
92	Bhavik-Ardeshna 55 followers Montreal, Quebec	Multimodal-VideoRAG	Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models	4
93	the-smart-home-maker 5 followers Germany	hass_ollama_image_analysis	Image analysis with Ollama (AI models) from within Home Assistant	3
94	BuddyLim 2 followers Singapore	iuys	Intelligently Understanding Your Screenshots	3
95	Kazuhito00 696 followers Aichi, Japan	MiniCPM-V2.6-Colaboratory-Sample	軽量VLMのMiniCPM-V2.6のColaboratoryサンプル	3
96	Pavansomisetty21 24 followers -	Visual-Question-Answering-using-Gemini-LLM	In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension	3
97	bastien-muraccioli 1 followers Tsukuba, Japan	svlr	SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs	3
98	asaddi 10 followers California, USA	ComfyUI-YALLM-node	Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported)	3
99	PandragonXIII 1 followers China	CIDER	This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models.	3
100	XiaomingX 9.7K followers japan	awesome-text-to-video-plus	The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps	3