Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)

Top AI Developers Top AI Organizations Top AI Project Top Growing Speed The Least Known Devs

LLM Diffusion GPT RAG Multi-modality

Rankings	Organization Account	Related Project	Project intro	Star count
1	NexaAI 588 followers United States of America	nexa-sdk	Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.	4.2K
2	dvlab-research 652 followers -	MGM	Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"	3.2K
3	SoraWebui 60 followers -	SoraWebui	SoraWebui is an open-source Sora web client, enabling users to easily create videos from text with OpenAI's Sora model.	2.3K
4	deepseek-ai 2.3K followers -	DeepSeek-VL	DeepSeek-VL: Towards Real-World Vision-Language Understanding	2.1K
5	BAAI-Agents 74 followers -	Cradle	The Cradle framework is a first attempt at General Computer Control (GCC). Cradle supports agents to ace any computer task by enabling strong reasoning abilities, self-improvment, and skill curation, in a standardized general environment with minimal requirements.	1.9K
6	cambrian-mllm 33 followers -	cambrian	Cambrian-1 is a family of multimodal LLMs with a vision-centric design.	1.8K
7	QiuYannnn 35 followers Los Angeles	Local-File-Organizer	An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.	1.7K
8	ShareGPT4Omni 20 followers -	ShareGPT4Video	[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	1.3K
9	mini-sora 36 followers -	minisora	MiniSora: A community aims to explore the implementation path and future development direction of Sora.	1.2K
10	illuin-tech 39 followers Paris, France	colpali	The code used to train and run inference with the ColPali architecture.	1.2K
11	heshengtao 28 followers -	comfyui_LLM_party	LLM Agent Framework in ComfyUI includes Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai/gemini interfaces, such as o1,ollama, grok, qwen, GLM, deepseek, moonshot,doubao. Adapted to local llms, vlm, gguf such as llama-3.2, Linkage neo4j KG, graphRAG / RAG / html 2 img	1.1K
12	all-in-aigc 842 followers -	sorafm	Sora AI Video Generator by Sora.FM	956
13	mbzuai-oryx 220 followers -	LLaVA-pp	🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)	812
14	TinyLLaVA 6 followers -	TinyLLaVA_Factory	A Framework of Small-scale Large Multimodal Models	661
15	FoundationVision 359 followers -	Groma	[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization	568
16	zubair-irshad 242 followers Silicon Valley, CA, USA	Awesome-Robotics-3D	A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites	559
17	NVlabs 6.0K followers -	EAGLE	EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	541
18	AIDC-AI 61 followers -	Ovis	A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.	539
19	Blaizzy 199 followers Poland	mlx-vlm	MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.	513
20	gokayfem 134 followers Turkey	awesome-vlm-architectures	Famous Vision Language Models and Their Architectures	445
21	nrl-ai 42 followers -	llama-assistant	AI-powered assistant to help you with your daily tasks, powered by Llama 3.2. It can recognize your voice, process natural language, and perform various actions based on your commands: summarizing text, rephasing sentences, answering questions, writing emails, and more.	423
22	OpenBMB 4.3K followers -	VisRAG	Parsing-free RAG supported by VLMs	421
23	neonwatty 309 followers -	meme_search	Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast.	409
24	xiaoachen98 97 followers -	Open-LLaVA-NeXT	An open-source implementation for training LLaVA-NeXT.	398
25	jingyaogong 159 followers China	minimind-v	「大模型」3小时从0训练27M参数的视觉多模态VLM，个人显卡即可推理训练！	378
26	yueliu1999 269 followers Singapore	Awesome-Jailbreak-on-LLMs	Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.	368
27	developersdigest 425 followers -	ai-devices	AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more	281
28	RLHF-V 16 followers -	RLAIF-V	RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness	246
29	zhengli97 87 followers Hangzhou, China	PromptKD	[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"	239
30	JosefAlbers 24 followers -	Phi-3-Vision-MLX	Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon	237
31	baaivision 546 followers China	EVE	[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models	233
32	Curated-Awesome-Lists 27 followers -	Awesome-Open-AI-Sora	Sora AI Awesome List – Your go-to resource hub for all things Sora AI, OpenAI's groundbreaking model for crafting realistic scenes from text. Explore a curated collection of articles, videos, podcasts, and news about Sora's capabilities, advancements, and more.	216
33	CircleRadon 64 followers Hangzhou	TokenPacker	The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".	215
34	SoraFlows 1 followers -	SoraFlows	The most powerful and modular Sora WebUI, api and backend with OpenAI's Sora Model. Collecting the highest quality prompts for Sora. using NextJs and Tailwind CSS	195
35	JiaqiLi404 7 followers Hong Kong	IAmDirector-Text2Video-NextJS-Client	本项目开源基于NextJS的前端，希望能够提供一个用于生成式AI的文字转视频，尤其是电影从编剧到视频生成的Web前端平台参考。Everyone can become a director. The Nextjs front-end of an AI driven platform for automatic movie/video generation (form GPT script generation to text2video movie generation).这是一个免费试用AI视频创作平台，集成了基于GPT的视频剧本生成和视频生成功能。我们的理想是让每个人都能成为导演，以最快的方式将日常中的任何创意转化为高质量的视频，无论是电影、营销视频、还是自媒体视频。	190
36	TIGER-AI-Lab 165 followers Canada	Mantis	Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)	184
37	AviSoori1x 93 followers San Francisco	seemore	From scratch implementation of a vision language model in pure PyTorch	164
38	mbodiai 19 followers United States of America	embodied-agents	Seamlessly integrate state-of-the-art transformer models into robotics stacks	164
39	RobotecAI 154 followers Poland	rai	RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.	162
40	LostXine 83 followers Stony Brook, NY	LLaRA	LLaRA: Large Language and Robotics Assistant	156
41	OpenDriveLab 1.5K followers Hong Kong	ELM	[ECCV 2024] Embodied Understanding of Driving Scenarios	151
42	bz-lab 1 followers -	AUITestAgent	AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.	151
43	opendilab 1.3K followers China	PsyDI	PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent)	151
44	sterzhang 8 followers -	image-textualization	Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)	145
45	fpgaminer 158 followers -	joycaption	JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.	144
46	mala-lab 58 followers -	InCTRL	Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'.	133
47	zytx121 128 followers Singapore	Awesome-VLGFM	A Survey on Vision-Language Geo-Foundation Models (VLGFMs)	129
48	ZebangCheng 9 followers -	Emotion-LLaMA	Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning	121
49	notune 12 followers -	captcha-solver	basic google recaptcha solver using llava-v1.6-7b	120
50	WangWenhao0716 68 followers Sydney	VidProM	[NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models	115
51	xlang-ai 446 followers -	Spider2-V	[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?	109
52	thu-ml 719 followers FIT Building, Tsinghua University, Beijing, China	MMTrustEval	A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)	108
53	OpenGVLab 2.4K followers -	MM-NIAH	[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.	102
54	graphic-design-ai 5 followers -	graphist	Official Repo of Graphist	100
55	chs20 5 followers -	RobustVLM	[ICML 2024] Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models	99
56	microsoft 79.6K followers Redmond, WA	eureka-ml-insights	A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.	90
57	aimagelab 122 followers Modena, Italy	LLaVA-MORE	LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1	86
58	2U1 33 followers -	Llama3.2-Vision-Finetune	An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.	85
59	shikiw 58 followers -	Modality-Integration-Rate	The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".	85
60	mu-cai 49 followers Madison. WI	matryoshka-mm	Matryoshka Multimodal Models	84
61	fangyuan-ksgk 34 followers Singapore	Mini-LLaVA	A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.	84
62	Yxxxb 49 followers Shenzhen	VoCo-LLaMA	VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".	83
63	zjysteven 56 followers United States	VLM-Visualizer	Visualizing the attention of vision-language models	79
64	princeton-nlp 1.2K followers -	CharXiv	[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	75
65	KwaiVGI 514 followers -	Uniaa	Unified Multi-modal IAA Baseline and Benchmark	70
66	ruili3 50 followers Zürich, Switzerland	Know-Your-Neighbors	[CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	69
67	WisconsinAIVision 33 followers -	YoLLaVA	🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant	69
68	OpenRobotLab 414 followers -	VLM-Grounder	[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	69
69	HieuPhan33 15 followers Australia	CVPR2024_MAVL	Multi-Aspect Vision Language Pretraining - CVPR2024	64
70	skit-ai 42 followers Bangalore, India	SpeechLLM	This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.	61
71	yihedeng9 25 followers -	STIC	Enhancing Large Vision Language Models with Self-Training on Image Comprehension.	59
72	Hon-Wong 7 followers -	Elysium	[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM	58
73	richard-peng-xia 43 followers Chapel Hill, NC, U.S.	CARES	[NeurIPS'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models	56
74	Gumpest 122 followers Beijing	SparseVLMs	Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.	56
75	BUAADreamer 81 followers Beijing	Chinese-LLaVA-Med	中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine	55
76	jamjamjon 18 followers Shanghai	usls	A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.	51
77	whwu95 144 followers -	FreeVA	FreeVA: Offline MLLM as Training-Free Video Assistant	49
78	miccunifi 41 followers Firenze - Viale Morgagni 65 - Italia	KDPL	[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation	48
79	VisualWebBench 1 followers -	VisualWebBench	Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"	47
80	tmlr-group 97 followers Hong Kong	WCA	[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"	43
81	Victorwz 96 followers -	MLM_Filter	Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".	42
82	chenshuang-zhang 0 followers -	imagenet_d	[CVPR 2024 Highlight] ImageNet-D	38
83	bonjour-npy 18 followers -	UndergraduateDissertation	Undergraduate Dissertation of Guilin University of Electronic Technology	38
84	ProGamerGov 131 followers Multiverse	VLM-Captioning-Tools	Python scripts to use for captioning images with VLMs	34
85	yuecao0119 14 followers -	MMInstruct	The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.	34
86	balrog-ai 1 followers -	BALROG	Benchmarking Agentic LLM and VLM Reasoning On Games	34
87	Gahyeonkim09 4 followers Naju-si, South Korea	AAPL	AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024)	32
88	ParadoxZW 48 followers Hangzhou, China	LLaVA-UHD-Better	A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo	32
89	hewei2001 109 followers Shanghai	ReachQA	Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"	32
90	ai4ce 188 followers Brooklyn, NY, U.S.	LLM4VPR	Can multimodal LLM help visual place recognition?	31
91	RaptorMai 69 followers Columbus	CompBench	CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.	31
92	uni-medical 94 followers -	GMAI-MMBench	GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.	31
93	foundation-multimodal-models 6 followers -	ConBench	[NeurIPS'24] Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models".	30
94	YunzeMan 77 followers Champaign, Illinois	Situation3D	[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning	26
95	NishilBalar 6 followers Germany	Awesome-LVLM-Hallucination	up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources	26
96	erfanshayegani 15 followers California, USA 🌴 🇺🇸	Jailbreak-In-Pieces	[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	26
97	Oztobuzz 5 followers Ho Chi Minh	Vista	This is the official repository for Vista dataset - A Vietnamese multimodal dataset contains more than 700,000 samples of conversations and images	24
98	jurieo 15 followers -	chatgpt-share-web	chatgpt和claude官网完整还原，包含其官网的全部功能。具有完善的用户体系和流量变现体系。	24
99	bigai-nlco 18 followers -	VideoTGB	[EMNLP 2024] A Video Chat Agent with Temporal Prior	24
100	ai-aigc-studio 12 followers -	Kling-AI-Webui	Kling AI, Make Imagination Alive. This is a revolutionary text-to-video model like Sora. Kling AI WebUI is the open source project to integrate Kling AI Video Generation Model.	24