Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)
RankingsOrganization Account
Related Project
Project intro
Star count
1

bytedance

12.3K followers
Singapore
UI-TARS-desktop
The Open All-in-One Multimodal AI Agent Stack connecting Cutting-edge AI Models and Agent Infra.
15.0K
2

om-ai-lab

682 followers
-
VLM-R1
Solve Visual Understanding with Reinforced VLMs
5.3K
3

manycore-research

159 followers
Hangzhou, China
SpatialLM
SpatialLM: Training Large Language Models for Structured Indoor Modeling
3.5K
4

MiniMax-AI

2.0K followers
-
MiniMax-01
The official repo of MiniMax-Text-01 and MiniMax-VL-01, large-language-model & vision-language-model based on Linear Attention
3.0K
5

SkyworkAI

958 followers
Singapore
Skywork-R1V
Skywork-R1V2:Multimodal Hybrid Reinforcement Learning for Reasoning
2.6K
6

QiuYannnn

51 followers
Los Angeles
Local-File-Organizer
An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.
2.4K
7

SkalskiP

5.6K followers
127.0.0.1
vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
1.0K
8

THUDM

10.8K followers
FIT Building, Tsinghua University
GLM-4.1V-Thinking
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning.
644
9

OpenBMB

5.1K followers
-
VisRAG
Parsing-free RAG supported by VLMs
611
10

PKU-YuanGroup

1.2K followers
China
UniWorld-V1
UniWorld: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation
583
11

vlm-run

46 followers
United States of America
vlmrun-hub
A hub for various industry-specific schemas to be used with VLMs.
510
12

nrl-ai

51 followers
-
llama-assistant
AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace.
486
13

changyeyu

31 followers
China
LLM-RL-Visualized
🌟100+ 原创 LLM / RL 原理图📚,《大模型算法》作者巨献🎉 (100+ LLM/RL Algorithm Maps )
458
14

awwaiid

97 followers
Washington, DC
ghostwriter
Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine!
436
15

Flame-Code-VLM

10 followers
-
Flame-Code-VLM
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
367
16

fpgaminer

173 followers
-
joycaption
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
349
17

Hon-Wong

20 followers
-
VoRA
[Fully open] [Encoder-free MLLM] Vision as LoRA
299
18

TIGER-AI-Lab

307 followers
Canada
VLM2Vec
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR 2025]
287
19

Aident-AI

5 followers
United States of America
open-cuak
Reliable Automation Agents at Scale
279
20

MigoXLab

1 followers
-
dingo
Dingo: A Comprehensive AI Data Quality Evaluation Tool
256
21

KolosalAI

24 followers
United States of America
Kolosal
Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device.
227
22

2U1

89 followers
South Korea
Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
156
23

IDEA-Research

2.5K followers
China
ChatRex
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
156
24

lucasjinreal

2.3K followers
Sanfancisco
Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.
133
25

FennelFetish

8 followers
-
qapyq
An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA.
130
26

NVIDIA-AI-Blueprints

474 followers
United States of America
video-search-and-summarization
Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A
130
27

RenzKa

63 followers
-
simlingo
[CVPR 2025, Spotlight] SimLingo (CarLLava): Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment
126
28

balrog-ai

1 followers
-
BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
117
29

mtkresearch

101 followers
-
BreezeApp
BreezeAPP 是一款為 Android 和 iOS 平台開發的純手機 AI 應用程式。從 App Store下載,即可在不連網的狀態下享受多項 AI 功能。源碼由聯發創新基地(MediaTek Research)提供。我們旨在推廣兩個概念: 人人都可以在自己的手機上自由選擇並運行不同的LLM - one is free to choose one's own LLM to run on a phone,以及任何app開發者都可以輕鬆寫作創意的純手機AI應用 - any dev can create purely phone-based AI apps easily。
110
30

TrustGen

13 followers
United States of America
TrustEval-toolkit
Toolkit for evaluating the trustworthiness of generative foundation models.
105
31

Ravi-Teja-konda

15 followers
-
Surveillance_Video_Summarizer
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
102
32

MDGrey33

13 followers
Riga, Latvia
pyvisionai
The PyVisionAI Official Repo
97
33

shikiw

65 followers
-
Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
96
34

WooQi57

41 followers
-
Helpful-Doggybot
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models
90
35

fangyuan-ksgk

37 followers
Singapore
Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.
89
36

OpenRobotLab

531 followers
-
VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
85
37

Gumpest

129 followers
Beijing
SparseVLMs
Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".
77
38

Osilly

55 followers
-
Awesome-Interleaving-Reasoning
Interleaving Reasoning: Next-Generation Reasoning Systems for AGI
77
39

FakeOAI

9 followers
-
tokens
A token management platform that reverse-engineers the conversation interfaces of ChatGPT, Cursor, Grok, Claude, Windsurf, Gemini, and Sora, converting them into the OpenAI format./Token管理平台,逆向ChatGPT、Cursor、Grok、Claude、Windsurf、Gemini、Sora平台的对话接口转OpenAI格式
76
40

UMass-Embodied-AGI

141 followers
United States of America
Mirage
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)
63
41
3d-conditioning
Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.
61
42

ai4ce

220 followers
Brooklyn, NY, U.S.
SeeDo
[IROS 2025] Human Demo Videos to Robot Action Plans
54
43

saidwivedi

85 followers
Germany
InteractVLM
[CVPR 2025] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
51
44

wendell0218

0 followers
-
GVA-Survey
Official repository of the paper "Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms"
49
45

hewei2001

116 followers
Shanghai
ReachQA
Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
48
46

iris0329

57 followers
-
SeeGround
[CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
46
47

GAD-cell

10 followers
-
vlm-grpo
An implementation of GRPO for Unsloth's VLMs training
40
48

thubZ09

13 followers
India
all-things-multimodal
Hub for researchers exploring VLMs and Multimodal Learning:)
40
49

declare-lab

285 followers
Singapore University of Technology and Design
Emma-X
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
39
50

kesimeg

19 followers
-
awesome-turkish-language-models
A curated list of Turkish AI models, datasets, papers
38
51

USC-GVL

25 followers
United States of America
PhysBench
[ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding>
36
52

tsunghan-wu

61 followers
Berkeley, CA
reverse_vlm
🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"
34
53

Video-Bench

0 followers
-
Video-Bench
Video Generation Benchmark
32
54

mbzuai-oryx

281 followers
-
AIN
AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains.
31
55

roboflow

3.5K followers
United States of America
vision-ai-checkup
Take your LLM to the optometrist.
31
56

LiuHengyu321

59 followers
Hong Kong
IR3D-Bench
Official Code of IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
30
57

NVIDIA-NeMo

34 followers
-
Automodel
Day-0 support for any Hugging Face model leveraging PyTorch native functionalities while providing performance and memory optimized training and inference recipes.
26
58

sovit-123

140 followers
India
SAM_Molmo_Whisper
An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
23
59

ArmenJeddi

8 followers
Toronto
saint
a training-free approach to accelerate ViTs and VLMs by pruning redundant tokens based on similarity
22
60

gptscript-ai

142 followers
-
gptparse
Document parser for RAG
20
61

taco-group

111 followers
United States of America
Re-Align
A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.
19
62

kornia

205 followers
Spain
bubbaloop
🦄 Serving Platform for Spatial AI and Robotics.
19
63

col14m

15 followers
-
cadrille
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
19
64

oztrkoguz

78 followers
Turkey
SubtitleAI
An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration
17
65

StabRise

5 followers
-
ScaleDP
ScaleDP is an Open-Source extension of Apache Spark for Document Processing
13
66

gptbmw

0 followers
-
wildcard
最新野卡wildcard虚拟信用卡使用指南:wildcard注册教程,如何开通野卡信用卡?如何为野卡充值和提现?
13
67

stogiannidis

13 followers
Edinburgh
srbench
Source code for the Paper "Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models"
12
68

SiyuWang0906

1 followers
-
CAD-GPT
[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
12
69

FreedomIntelligence

455 followers
-
TRIM
We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.
11
70

xlang-ai

554 followers
-
computer-agent-arena-hub
Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks
11
71

miccunifi

41 followers
Firenze - Viale Morgagni 65 - Italia
Cross-the-Gap
[ICLR 2025] - Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion
11
72

OPTML-Group

76 followers
East Lansing, Michigan
VLM-Safety-MU
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-tuning
11
73

MING-ZCH

52 followers
Wuhan, China
CII-Bench
Can MLLMs Understand the Deep Implication Behind Chinese Images?
9
74

securade

5 followers
Singapore
sentinel
Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras.
9
75

Open-Social-World

0 followers
-
EgoNormia
EgoNormia | Benchmarking Physical Social Norm Understanding in VLMs
9
76

HaoyuanYang-2023

1 followers
-
ImagineFSL
Official implementation of "ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning" [CVPR 2025 Highlight]
9
77

WILLOSCAR

0 followers
China
Awesome-HCI-LLM
Awesome-HCI (Ubiquitous, LLM, MLLM, Agent, RAG, Embodied-AI, RLHF)
9
78

joanibal

23 followers
California
OptVL
AVL + python + optimization = OptVL
9
79

hyun-yang

0 followers
Brisbane, Australia
MyColPali
The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models
8
80

YanNeu

11 followers
-
DASH
DASH: Detection and Assessment of Systematic Hallucinations of VLMs
8
81

DataFog

6 followers
United States of America
vlm-api
REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model
7
82

ola-krutrim

69 followers
India
Chitrarth
Chitrarth: Bridging Vision and Language for a Billion People
7
83

loong64

14 followers
China
ollama
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
7
84

Theia-4869

26 followers
Beijing, China
VisPruner
[ICCV 2025] Official code for paper: Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
7
85

intelligolabs

12 followers
Italy
CoIN
[ICCV 25] Official repository of "Collaborative Instance Object Navigation: Leveraging Uncertainty-Awareness to Minimize Human-Agent Dialogues"
7
86

Davidlequnchen

16 followers
Singapore
VLM-CADFeatureRecognition
​This repository provides code and resources for automating manufacturing feature recognition in CAD designs using vision-language models.
7
87

2dameneko

0 followers
-
ide-cap-chan
ide-cap-chan is a utility for batch image captioning with natural language using various VL models
6
88

HKU-TASR

13 followers
Hong Kong
Geminio
[ICCV 2025] Geminio is a VLM-powered gradient inversion attack in federated learning (FL). It allows the adversary (the FL server) to describe the data of value and reconstruct the victim client's private data matching the description.
6
89

uzh-dqbm-cmi

17 followers
Zurich
RadVLM
A Multitask Conversational Vision-Language Model for Radiology
6
90

david-s-martinez

8 followers
Munich, Germany
Dex-GAN-Grasp
DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 | DOI: 10.1109/Humanoids58906.2024.10769950
5
91

Ashad001

62 followers
Karachi
RoomAligner
A focus on aligning room elements for better flow and space utilization.
5
92

Sathees2482

0 followers
-
google-veo3-from-scratch
# Google Veo 3 Implemented from ScratchThis repository contains an implementation of Google Veo 3, a cutting-edge text-to-video generation system. 🎥 Explore the code to create high-quality videos from text prompts and enhance your projects with advanced AI capabilities. 🌟
5
93

sonstory

16 followers
Seoul, Korea
VLM-ZSAD-Paper-Review
Reviews of papers on zero-shot anomaly detection using vision-Language models
4
94

Bhavik-Ardeshna

55 followers
Montreal, Quebec
Multimodal-VideoRAG
Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models
4
95

JoeJoe1313

24 followers
Sofia, Bulgaria
LLMs-Journey
Various LLM resources and experiments
4
96

Traffic-Alpha

22 followers
-
VLMLight
Official implementation of VLMLight
4
97

vbdi

6 followers
Canada
casp
[CVPR 2025 Highlight] CASP: Compression of Large Multimodal Models Based on Attention Sparsity
4
98

asaddi

10 followers
California, USA
ComfyUI-YALLM-node
Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported)
3
99

PandragonXIII

1 followers
China
CIDER
This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models.
3
100

XiaomingX

9.8K followers
japan
awesome-text-to-video-plus
The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps
3