Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)

Top AI Developers Top AI Organizations Top AI Project Top Growing Speed The Least Known Devs

LLM Diffusion GPT RAG Multi-modality

Rankings	Organization Account	Related Project	Project intro	Star count
1	NexaAI 704 followers United States of America	nexa-sdk	Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.	5.2K
2	cambrian-mllm 33 followers -	cambrian	Cambrian-1 is a family of multimodal LLMs with a vision-centric design.	1.8K
3	QiuYannnn 35 followers Los Angeles	Local-File-Organizer	An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.	1.7K
4	ShareGPT4Omni 20 followers -	ShareGPT4Video	[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions	1.3K
5	illuin-tech 39 followers Paris, France	colpali	The code used to train and run inference with the ColPali architecture.	1.2K
6	heshengtao 30 followers -	comfyui_LLM_party	LLM Agent Framework in ComfyUI includes Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai / aisuite interfaces, such as o1,ollama, gemini, grok, qwen, GLM, deepseek, moonshot,doubao. Adapted to local llms, vlm, gguf such as llama-3.2, Linkage graphRAG / RAG	1.1K
7	mbzuai-oryx 220 followers -	LLaVA-pp	🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)	812
8	Blaizzy 226 followers Poland	mlx-vlm	MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.	663
9	FoundationVision 359 followers -	Groma	[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization	568
10	zubair-irshad 242 followers Silicon Valley, CA, USA	Awesome-Robotics-3D	A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites	559
11	NVlabs 6.0K followers -	EAGLE	EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders	541
12	AIDC-AI 61 followers -	Ovis	A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.	539
13	nrl-ai 42 followers -	llama-assistant	AI-powered assistant to help you with your daily tasks, powered by Llama 3.2. It can recognize your voice, process natural language, and perform various actions based on your commands: summarizing text, rephasing sentences, answering questions, writing emails, and more.	423
14	OpenBMB 4.3K followers -	VisRAG	Parsing-free RAG supported by VLMs	421
15	neonwatty 309 followers -	meme_search	Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast.	409
16	yueliu1999 273 followers Singapore	Awesome-Jailbreak-on-LLMs	Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.	409
17	xiaoachen98 97 followers -	Open-LLaVA-NeXT	An open-source implementation for training LLaVA-NeXT.	398
18	jingyaogong 159 followers China	minimind-v	「大模型」3小时从0训练27M参数的视觉多模态VLM，个人显卡即可推理训练！	378
19	developersdigest 425 followers -	ai-devices	AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more	281
20	RLHF-V 16 followers -	RLAIF-V	RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness	246
21	JosefAlbers 24 followers -	Phi-3-Vision-MLX	Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon	237
22	baaivision 546 followers China	EVE	[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models	233
23	CircleRadon 64 followers Hangzhou	TokenPacker	The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".	215
24	TIGER-AI-Lab 173 followers Canada	Mantis	Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)	189
25	RobotecAI 163 followers Poland	rai	RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.	178
26	AviSoori1x 93 followers San Francisco	seemore	From scratch implementation of a vision language model in pure PyTorch	164
27	mbodiai 19 followers United States of America	embodied-agents	Seamlessly integrate state-of-the-art transformer models into robotics stacks	164
28	LostXine 83 followers Stony Brook, NY	LLaRA	LLaRA: Large Language and Robotics Assistant	156
29	bz-lab 1 followers -	AUITestAgent	AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.	151
30	opendilab 1.3K followers China	PsyDI	PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent)	151
31	sterzhang 8 followers -	image-textualization	Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)	145
32	fpgaminer 158 followers -	joycaption	JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.	144
33	ZebangCheng 9 followers -	Emotion-LLaMA	Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning	121
34	xlang-ai 446 followers -	Spider2-V	[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?	109
35	thu-ml 719 followers FIT Building, Tsinghua University, Beijing, China	MMTrustEval	A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)	108
36	OpenGVLab 2.4K followers -	MM-NIAH	[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.	102
37	Ravi-Teja-konda 15 followers -	Surveillance_Video_Summarizer	VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.	92
38	microsoft 81.1K followers Redmond, WA	eureka-ml-insights	A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.	90
39	aimagelab 122 followers Modena, Italy	LLaVA-MORE	LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1	86
40	2U1 33 followers -	Llama3.2-Vision-Finetune	An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.	85
41	shikiw 58 followers -	Modality-Integration-Rate	The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".	85
42	mu-cai 49 followers Madison. WI	matryoshka-mm	Matryoshka Multimodal Models	84
43	fangyuan-ksgk 34 followers Singapore	Mini-LLaVA	A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.	84
44	Yxxxb 49 followers Shenzhen	VoCo-LLaMA	VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".	83
45	zjysteven 56 followers United States	VLM-Visualizer	Visualizing the attention of vision-language models	79
46	balrog-ai 1 followers -	BALROG	Benchmarking Agentic LLM and VLM Reasoning On Games	77
47	princeton-nlp 1.2K followers -	CharXiv	[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs	75
48	ruili3 50 followers Zürich, Switzerland	Know-Your-Neighbors	[CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning	69
49	WisconsinAIVision 33 followers -	YoLLaVA	🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant	69
50	OpenRobotLab 414 followers -	VLM-Grounder	[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	69
51	skit-ai 42 followers Bangalore, India	SpeechLLM	This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.	61
52	yihedeng9 25 followers -	STIC	Enhancing Large Vision Language Models with Self-Training on Image Comprehension.	59
53	richard-peng-xia 43 followers Chapel Hill, NC, U.S.	CARES	[NeurIPS'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models	56
54	Gumpest 122 followers Beijing	SparseVLMs	Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.	56
55	BUAADreamer 81 followers Beijing	Chinese-LLaVA-Med	中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine	55
56	whwu95 144 followers -	FreeVA	FreeVA: Offline MLLM as Training-Free Video Assistant	49
57	miccunifi 41 followers Firenze - Viale Morgagni 65 - Italia	KDPL	[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation	48
58	tmlr-group 97 followers Hong Kong	WCA	[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"	43
59	yuecao0119 14 followers -	MMInstruct	The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.	34
60	Gahyeonkim09 4 followers Naju-si, South Korea	AAPL	AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024)	32
61	ParadoxZW 48 followers Hangzhou, China	LLaVA-UHD-Better	A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo	32
62	hewei2001 109 followers Shanghai	ReachQA	Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"	32
63	ai4ce 188 followers Brooklyn, NY, U.S.	LLM4VPR	Can multimodal LLM help visual place recognition?	31
64	RaptorMai 69 followers Columbus	CompBench	CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.	31
65	uni-medical 94 followers -	GMAI-MMBench	GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.	31
66	foundation-multimodal-models 6 followers -	ConBench	[NeurIPS'24] Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models".	30
67	H-Freax 87 followers Boston	ThinkGrasp	[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter. https://arxiv.org/abs/2407.11298	30
68	YunzeMan 77 followers Champaign, Illinois	Situation3D	[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning	26
69	erfanshayegani 15 followers California, USA 🌴 🇺🇸	Jailbreak-In-Pieces	[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models	26
70	ai-aigc-studio 12 followers -	Kling-AI-Webui	Kling AI, Make Imagination Alive. This is a revolutionary text-to-video model like Sora. Kling AI WebUI is the open source project to integrate Kling AI Video Generation Model.	24
71	sStonemason 0 followers -	RET-CLIP	RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports	23
72	declare-lab 257 followers Singapore University of Technology and Design	Emma-X	Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	23
73	JUNJIE99 11 followers -	VISTA_Evaluation_FineTuning	Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding.	21
74	Sid2697 76 followers Bristol	HOI-Ref	Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"	20
75	hllj 38 followers Ho Chi Minh City, Vietnam	Vistral-V	Vistral-V: Visual Instruction Tuning for Vistral - Vietnamese Large Vision-Language Model.	19
76	TobyYang7 18 followers Beijing & Shenzhen	Llava_Qwen2	Visual Instruction Tuning for Qwen2 Base Model	19
77	gptscript-ai 132 followers -	gptparse	Document parser for RAG	18
78	obiyoag 14 followers Shanghai, China	evi-CEM	Official implementation of MICCAI2024 paper "Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis"	17
79	reidbarber 28 followers -	webmarker	Mark web pages for use with vision-language models	16
80	AIDevBytes 20 followers -	LLava-Image-Analyzer	Llava, Ollama and Streamlit \| Create POWERFUL Image Analyzer Chatbot for FREE - Windows & Mac	16
81	NVIDIA-Omniverse-blueprints 2 followers -	3d-conditioning	Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.	16
82	zabir-nabil 211 followers California, USA	awesome-multilingual-large-language-models	A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.	15
83	egeozsoy 27 followers -	ORacle	Official code of the paper ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling accepted at MICCAI 2024.	15
84	HyperMink 0 followers Sydney, Australia	inferenceable	Scalable AI Inference Server for CPU and GPU with Node.js \| Utilizes llama.cpp and parts of llamafile C/C++ core under the hood.	14
85	hasanar1f 31 followers Blacksburg, VA, USA	HiRED	[AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.	14
86	ANYANTUDRE 23 followers Morocco	Florence-2-Vision-Language-Model	Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.	13
87	showlab 631 followers -	VisInContext	Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning	13
88	worldcuisines 1 followers -	worldcuisines	WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines.	13
89	aliencaocao 68 followers Singapore	TIL-2024	Brainhack TIL 2024: Team 12000SGDPLUSHIE	12
90	S4mpl3r 6 followers Dubai, United Arab Emirates	okra	Okra, your all in one personal AI assistant	12
91	JinhaoLee 5 followers Melbourne, Australia	WCA	[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models	11
92	wendell0218 0 followers -	GVA-Survey	Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms	10
93	jacobmarks 133 followers NYC	fiftyone_florence2_plugin	Run SOTA Vision-Language Model Florence-2 on your data!	9
94	Fsoft-AIC 45 followers -	Z-GMOT	[NAACL 2024] Z-GMOT: Zero-shot Generic Multiple Object Tracking	9
95	lamalab-org 56 followers -	mac-bench	Probing the limitations of multimodal language models for chemistry and materials research	9
96	xyproto 549 followers Oslo	describeimage	Describe images by using LLMs	8
97	med-air 163 followers -	PICG2scoring	[MICCAI'24] Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring	8
98	sovit-123 129 followers India	SAM_Molmo_Whisper	An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.	8
99	tian1327 13 followers College Station	SWAT		7
100	sayedmohamedscu 40 followers Egypt	Vision-language-models-VLM	vision language models finetuning notebooks & use cases (paligemma - florence .....)	7