Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)
RankingsOrganization Account
Related Project
Project intro
Star count
1

om-ai-lab

652 followers
-
VLM-R1
Solve Visual Understanding with Reinforced VLMs
4.7K
2

NexaAI

577 followers
United States of America
nexa-sdk
Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
4.4K
3

manycore-research

138 followers
Hangzhou, China
SpatialLM
SpatialLM: Large Language Model for Spatial Understanding
3.0K
4

bytedance

9.0K followers
Singapore
UI-TARS-desktop
A GUI Agent application based on UI-TARS(Vision-Lanuage Model) that allows you to control your computer using natural language.
2.9K
5

QiuYannnn

48 followers
Los Angeles
Local-File-Organizer
An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.
2.1K
6

SkyworkAI

573 followers
Singapore
Skywork-R1V
Pioneering Multimodal Reasoning with CoT
2.1K
7

jingyaogong

522 followers
HangZhou, China
minimind-v
🚀 「大模型」1小时从0训练26M参数的视觉多模态VLM!🌏 Train a 26M-parameter VLM from scratch in just 1 hours!
1.5K
8

SkalskiP

5.6K followers
127.0.0.1
vlms-zero-to-hero
This series will take you on a journey from the fundamentals of NLP and Computer Vision to the cutting edge of Vision-Language Models.
1.0K
9

zubair-irshad

252 followers
Silicon Valley, CA, USA
Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
650
10

OpenBMB

4.8K followers
-
VisRAG
Parsing-free RAG supported by VLMs
611
11

NVlabs

7.0K followers
-
EAGLE
Eagle Family: Exploring Model Designs, Data Recipes and Training Strategies for Frontier-Class Multimodal LLMs
602
12

yueliu1999

296 followers
Singapore
Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.
504
13

nrl-ai

51 followers
-
llama-assistant
AI-powered assistant to help you with your daily tasks, powered by Llama 3, DeepSeek R1, and many more models on HuggingFace.
486
14

vlm-run

31 followers
United States of America
vlmrun-hub
A hub for various industry-specific schemas to be used with VLMs.
459
15

awwaiid

97 followers
Washington, DC
ghostwriter
Use the reMarkable2 as an interface to vision-LLMs (ChatGPT, Claude, Gemini). Ghost in the machine!
436
16

Flame-Code-VLM

10 followers
-
Flame-Code-VLM
Flame is an open-source multimodal AI system designed to translate UI design mockups into high-quality React code. It leverages vision-language modeling, automated data synthesis, and structured training workflows to bridge the gap between design and front-end development.
367
17

fpgaminer

173 followers
-
joycaption
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
349
18

Aident-AI

5 followers
United States of America
open-cuak
Reliable Automation Agents at Scale
279
19

zjysteven

70 followers
United States
lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
262
20

CircleRadon

72 followers
Hangzhou
TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
236
21

bz-lab

4 followers
-
AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
192
22

Genta-Technology

20 followers
Indonesia
Kolosal
Kolosal AI is an OpenSource and Lightweight alternative to LM Studio to run LLMs 100% offline on your device.
177
23

IDEA-Research

2.3K followers
China
ChatRex
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
156
24

TIGER-AI-Lab

215 followers
Canada
VLM2Vec
This repo contains the code for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks" [ICLR25]
152
25

baaivision

600 followers
China
DenseFusion
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
134
26

lucasjinreal

2.3K followers
Sanfancisco
Namo-R1
A CPU Realtime VLM in 500M. Surpassed Moondream2 and SmolVLM. Training from scratch with ease.
133
27

2U1

62 followers
-
Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
131
28

aimagelab

142 followers
Modena, Italy
LLaVA-MORE
LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
122
29

balrog-ai

1 followers
-
BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
117
30

microsoft

89.7K followers
Redmond, WA
eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
106
31

Ravi-Teja-konda

15 followers
-
Surveillance_Video_Summarizer
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
102
32

FennelFetish

8 followers
-
qapyq
An image viewer and AI-assisted editing/captioning/masking tool that helps with curating datasets for generative AI models, finetunes and LoRA.
102
33

MDGrey33

13 followers
Riga, Latvia
pyvisionai
The PyVisionAI Official Repo
97
34

shikiw

65 followers
-
Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
96
35

WooQi57

41 followers
-
Helpful-Doggybot
Helpful DoggyBot: Open-World Object Fetching using Legged Robots and Vision-Language Models
90
36

fangyuan-ksgk

37 followers
Singapore
Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.
89
37

OpenRobotLab

531 followers
-
VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
85
38

YunzeMan

88 followers
Champaign, Illinois
Lexicon3D
[NeurIPS 2024] Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
82
39

TrustGen

12 followers
United States of America
TrustEval-toolkit
TrustEval: A modular and extensible toolkit for comprehensive trust evaluation of generative foundation models (GenFMs)
79
40

Gumpest

129 followers
Beijing
SparseVLMs
Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference".
77
41

mtkresearch

85 followers
-
BreezeApp
探索 AI 的未來!聯發創新基地(MediaTek Research)首次開源的手機應用程式,讓你直接體驗我們最新研發的 AI 模型。透過手機,我們將 AI 科技帶入每個人的生活,完全離線運作,隱私更有保障。這是一個開源專案,我們熱烈歡迎開發者和愛好者一同參與,為 AI 技術發展貢獻力量。立即加入我們,一起打造更優質的 AI 體驗!
73
42
3d-conditioning
Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.
61
43

DataEval

1 followers
China
dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
57
44

miccunifi

41 followers
Firenze - Viale Morgagni 65 - Italia
KDPL
[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
53
45

hewei2001

116 followers
Shanghai
ReachQA
Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
48
46

iris0329

57 followers
-
SeeGround
[CVPR'25] SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
46
47

yuecao0119

17 followers
-
MMInstruct
The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.
43
48

ai4ce

206 followers
Brooklyn, NY, U.S.
SeeDo
Human Demo Videos to Robot Action Plans
41
49

declare-lab

285 followers
Singapore University of Technology and Design
Emma-X
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
39
50

uni-medical

106 followers
-
GMAI-MMBench
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.
37
51

USC-GVL

25 followers
United States of America
PhysBench
[ICLR 2025] Official implementation and benchmark evaluation repository of <PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding>
36
52

RaptorMai

70 followers
Columbus
CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
35
53

AIDC-AI

122 followers
-
Parrot
🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
35
54

mbzuai-oryx

281 followers
-
AIN
AIN - The First Arabic Inclusive Large Multimodal Model. It is a versatile bilingual LMM excelling in visual and contextual understanding across diverse domains.
31
55

opendatalab

1.5K followers
China
UrBench
[AAAI 2025]This repo contains evaluation code for the paper “UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios”
29
56

JUNJIE99

16 followers
-
VISTA_Evaluation_FineTuning
Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding.
28
57

ANYANTUDRE

26 followers
Morocco
Florence-2-Vision-Language-Model
Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
26
58

sovit-123

140 followers
India
SAM_Molmo_Whisper
An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
23
59

hasanar1f

34 followers
Blacksburg, VA, USA
HiRED
[AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.
21
60

Alpha-Innovator

37 followers
-
GeoX
Code for GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
21
61

gptscript-ai

142 followers
-
gptparse
Document parser for RAG
20
62

50n50

7 followers
-
sources
READ THE README
19
63

taco-group

68 followers
United States of America
Re-Align
A novel alignment framework that leverages image retrieval to mitigate hallucinations in Vision Language Models.
19
64

oztrkoguz

78 followers
Turkey
SubtitleAI
An AI-powered tool for summarizing YouTube videos by generating scene descriptions, translating them, and creating subtitled videos with text-to-speech narration
17
65

NVIDIA-AI-Blueprints

262 followers
United States of America
video-search-and-summarization
Blueprint for Ingesting massive volumes of live or archived videos and extract insights for summarization and interactive Q&A
17
66

worldcuisines

1 followers
-
worldcuisines
WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines.
16
67

sayedmohamedscu

43 followers
Egypt
Vision-language-models-VLM
vision language models finetuning notebooks & use cases (paligemma - florence .....)
14
68

wendell0218

0 followers
-
GVA-Survey
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
14
69

kesimeg

16 followers
-
awesome-turkish-language-models
A curated list of Turkish AI models, datasets, papers
14
70

tychenjiajun

41 followers
Guangzhou, China
exif-ai
A Node.js CLI and library that uses OpenAI, Ollama, ZhipuAI, Google Gemini or Coze to write AI-generated image descriptions and/or tags to EXIF metadata by its content.
13
71

med-air

175 followers
-
PICG2scoring
[MICCAI'24] Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring
11
72

FreedomIntelligence

418 followers
-
TRIM
We introduce new approach, Token Reduction using CLIP Metric (TRIM), aimed at improving the efficiency of MLLMs without sacrificing their performance.
11
73

lamalab-org

69 followers
-
macbench
Probing the limitations of multimodal language models for chemistry and materials research
11
74

xlang-ai

489 followers
-
computer-agent-arena-hub
Computer Agent Arena Hub: Compare & Test AI Agents on Crowdsourced Real-World Computer Use Tasks
11
75

jacobmarks

147 followers
NYC
fiftyone_florence2_plugin
Run SOTA Vision-Language Model Florence-2 on your data!
9
76

MING-ZCH

52 followers
Wuhan, China
CII-Bench
Can MLLMs Understand the Deep Implication Behind Chinese Images?
9
77

securade

5 followers
Singapore
sentinel
Securade.ai Sentinel - A monitoring and surveillance application that enables visual Q&A and video captioning for existing CCTV cameras.
9
78

hyun-yang

0 followers
Brisbane, Australia
MyColPali
The PyQt6 application using ColPali and OpenAI to show Efficient Document Retrieval with Vision Language Models
8
79

kyegomez

1.9K followers
Palo Alto
VortexFusion
Transformers + Mambas + LSTMS All in One Model
7
80

DataFog

6 followers
United States of America
vlm-api
REST API for computing cross-modal similarity between images and text using the ColPaLI vision-language model
7
81

ola-krutrim

69 followers
India
Chitrarth
Chitrarth: Bridging Vision and Language for a Billion People
7
82

Video-Bench

0 followers
-
Video-Bench
Video Generation Benchmark
7
83

loong64

14 followers
China
ollama
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
7
84

2dameneko

0 followers
-
ide-cap-chan
ide-cap-chan is a utility for batch image captioning with natural language using various VL models
6
85

katha-ai

7 followers
India
VELOCITI
VELOCITI Benchmark Evaluation and Visualisation Code
5
86

tensorsense

11 followers
United States of America
vlm_databuilder
This SDK generates datasets for training Video LLMs from youtube videos.
5
87

david-s-martinez

8 followers
Munich, Germany
Dex-GAN-Grasp
DexGANGrasp: Dexterous Generative Adversarial Grasping Synthesis for Task-Oriented Manipulation - IEEE-RAS International Conference on Humanoid Robots (Humanoids) 2024 | DOI: 10.1109/Humanoids58906.2024.10769950
5
88

Ashad001

62 followers
Karachi
RoomAligner
A focus on aligning room elements for better flow and space utilization.
5
89

nguyennpa412

12 followers
Vietnam
simple-multimodal-ai
Simple Gradio application integrated with Hugging Face Multimodals to support visual question answering chatbot and more features
5
90

sitamgithub-MSIT

34 followers
Howrah, West Bengal
TextSnap
TextSnap: Demo for Florence 2 model used in OCR tasks to extract and visualize text from images.
4
91

sonstory

16 followers
Seoul, Korea
VLM-ZSAD-Paper-Review
Reviews of papers on zero-shot anomaly detection using vision-Language models
4
92

Bhavik-Ardeshna

55 followers
Montreal, Quebec
Multimodal-VideoRAG
Multimodal-VideoRAG: Using BridgeTower Embeddings and Large Vision Language Models
4
93

the-smart-home-maker

5 followers
Germany
hass_ollama_image_analysis
Image analysis with Ollama (AI models) from within Home Assistant
3
94

BuddyLim

2 followers
Singapore
iuys
Intelligently Understanding Your Screenshots
3
95

Kazuhito00

696 followers
Aichi, Japan
MiniCPM-V2.6-Colaboratory-Sample
軽量VLMのMiniCPM-V2.6のColaboratoryサンプル
3
96

Pavansomisetty21

24 followers
-
Visual-Question-Answering-using-Gemini-LLM
In this we explore into visual Question Answering Using Gemini LLM and image was in URL or any other extension
3
97

bastien-muraccioli

1 followers
Tsukuba, Japan
svlr
SVLR: Scalable, Training-Free Visual Language Robotics: a modular multi-model framework for consumer-grade GPUs
3
98

asaddi

10 followers
California, USA
ComfyUI-YALLM-node
Yet another set of LLM nodes for ComfyUI (for local/remote OpenAI-like APIs, multi-modal models supported)
3
99

PandragonXIII

1 followers
China
CIDER
This is the official repository for Cross-modality Information Check for Detecting Jailbreaking in Multimodal Large Language Models.
3
100

XiaomingX

9.7K followers
japan
awesome-text-to-video-plus
The Ultimate Guide to Effortlessly Creating AI Videos for Social Media Go From Text to Eye-Catching Videos in Just a Few Steps
3