Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)
RankingsOrganization Account
Related Project
Project intro
Star count
1

zai-org

673 followers
-
GLM-V
GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning
1.5K
2

changyeyu

47 followers
China
LLM-RL-Visualized
🌟100+ 原创 LLM / RL 原理图📚,《大模型算法》作者巨献!💥(100+ LLM/RL Algorithm Maps )
1.2K
3

SkyworkAI

1.2K followers
Singapore
UniPic
Building Kontext Model with Online RL for Unified Multimodal Model
708
4

XiaomiMiMo

294 followers
-
MiMo-VL
MiMo-VL
467
5

UMass-Embodied-AGI

150 followers
United States of America
Mirage
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (arXiv 2025)
115
6

599yongyang

3 followers
-
DatasetLoom
一个面向多模态大模型训练的智能数据集构建与评估平台
101
7

hcompai

9 followers
-
surfer-h-cli
Run Surfer-H agents powered by Holo1 using the Surfer-H-CLI. Includes example tasks, scripts, and configurations.
96
8

saidwivedi

89 followers
Germany
InteractVLM
[CVPR 2025] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
92
9

Tencent

11.6K followers
Shenzhen, China
AngelSlim
Model compression toolkit engineered for enhanced usability, comprehensiveness, and efficiency.
88
10

Osilly

55 followers
-
Awesome-Interleaving-Reasoning
Interleaving Reasoning: Next-Generation Reasoning Systems for AGI
77
11

GAD-cell

15 followers
Paris
vlm-grpo
An implementation of GRPO for Unsloth's VLMs training
63
12

NVIDIA-NeMo

79 followers
-
Automodel
Fine-tune any Hugging Face LLM or VLM on day-0 using PyTorch-native features for GPU-accelerated distributed training with superior performance and memory efficiency.
48
13

tsunghan-wu

63 followers
Berkeley, CA
reverse_vlm
🔥 Official implementation of "Generate, but Verify: Reducing Visual Hallucination in Vision-Language Models with Retrospective Resampling"
39
14

FareedKhan-dev

575 followers
Karachi, Pakistan
google-veo3-from-scratch
A Step-by-Step Implementation of Google Veo 3 Architecture from Scratch
33
15

roboflow

3.8K followers
United States of America
vision-ai-checkup
Take your LLM to the optometrist.
31
16

LiuHengyu321

59 followers
Hong Kong
IR3D-Bench
Official Code of IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering
30
17

col14m

15 followers
-
cadrille
cadrille: Multi-modal CAD Reconstruction with Online Reinforcement Learning
19
18

supermarsx

14 followers
Δ on earth™
sora-json-prompt-crafter
A vibe coded Sora JSON Prompt Crafter for curious humans and prompt engineers
14
19

OpenGVLab

3.1K followers
-
VRBench
[ICCV 2025] A Benchmark for Multi-Step Reasoning in Long Narrative Videos
14
20

SiyuWang0906

1 followers
-
CAD-GPT
[AAAI2025] CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
12
21

liuyifan22

12 followers
-
Qwen2.5-VL-Batched
A batched implementation for efficient Qwen2.5-VL inference.
12
22

l8cv

0 followers
China
BusterX
BusterX and BusterX++
12
23

Traffic-Alpha

26 followers
-
VLMLight
Official implementation of VLMLight
9
24

KolosalAI

27 followers
United States of America
kolosal-cli
Super lightweight Ollama alternative to run Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1 and other large language models.
9
25

Theia-4869

26 followers
Beijing, China
VisPruner
[ICCV 2025] Official code for paper: Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
7
26

Sathees2482

0 followers
-
google-veo3-from-scratch
# Google Veo 3 Implemented from ScratchThis repository contains an implementation of Google Veo 3, a cutting-edge text-to-video generation system. 🎥 Explore the code to create high-quality videos from text prompts and enhance your projects with advanced AI capabilities. 🌟
5
27

weichow23

28 followers
-
merit
Official Repo for Paper <MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query>
5
28

helenqu

20 followers
Philadelphia, PA
multimodal-pretraining-pmi
Impact of Pretraining Word Co-occurrence on Compositional Generalization in Multimodal Models
5
29

Oznake

0 followers
-
Awesome-LLM-reasoning-papers
This repository offers a well-organized collection of resources focused on reasoning in Large Language Models (LLMs). Explore foundational papers, evaluation benchmarks, and practical tools to enhance your understanding of LLM reasoning. 🐙🌐
4
30

PRITHIVSAKTHIUR

79 followers
India
Qwen2.5-VL-Video-Understanding
The Qwen2.5-VL-7B-Instruct model is a multimodal AI model developed by Alibaba Cloud that excels at understanding both text and images. It's a Vision-Language Model (VLM) designed to handle various visual understanding tasks, including image understanding, video analysis, and even multilingual support.
4
31

SuyogKamble

0 followers
-
simpleVLM
building a simple VLM. Implementing LlaMA-SmolLM2 from scratch + SigLip2 Vision Model. KV-Caching is supported and implemented from scratch as well
3
32

alaa-nadi

0 followers
Cairo
UI-TARS-desktop
A GUI Agent application based on UI-TARS(Vision-Language Model) that allows you to control your computer using natural language.
3
33

johnnyhank

4 followers
Shanghai
MIRA-Multimodal-Intelligent-Robotic-Assistant
基于Qwen Agent框架,融合JAKA机械臂、视觉检测、语音识别与合成、MCP数据库的多模态大模型
2
34

daviden1013

6 followers
-
vlm4ocr
Python package and Web App for OCR with vision language models.
2
35

argirovga

1 followers
Moscow
GeospatialVLM
VLM specially crafted for geospatial reasoning tasks
2
36

chatgpt-helper-tech

40 followers
China
guide
2025 ChatGPT 使用教程和最佳实践,涵盖注册设置、Prompt 模板、Explorer GPT、DALL·E/Sora 使用技巧,适合新手与进阶者
2
37

atasoglu

51 followers
-
awesome-turkish-vlm
A curated list of models, datasets and other useful resources for Turkish Vision-Language Models (VLM).
2
38

kiranbaby14

9 followers
London, UK
TalkMateAI
🎭 Real-time voice-controlled 3D avatar with multimodal AI - speak naturally and watch your AI companion respond with perfect lip-sync
2
39

jagennath-hari

28 followers
California
SpatialFusion-LM
SpatialFusion-LM is a real-time spatial reasoning framework that combines neural depth, 3D reconstruction, and language-driven scene understanding.
1
40

6Morpheus6

53 followers
-
bagel
description: "[NVIDIA ONLY] Image generation, image editing and free-form manipulation with a VLM (Minimum Requirements 12GB VRAM / 32GB RAM Recommended Requirements 24GB VRAM / 48GB RAM)",
1
41

gustavokuklinski

94 followers
Rio de Janeiro
aeon.ai
AEON is a lightweight, stateless RAG chatbot that answers questions using your Markdown, Text, and JSON documents. It runs locally on your CPU with at least 8GB RAM, leveraging Ollama for LLMs and Chroma as its vector database.
1
42

visresearch

18 followers
-
LLaVA-STF
The official implementation of "Learning Compact Vision Tokens for Efficient Large Multimodal Models"
1
43

Mitsuya133

0 followers
-
Imgscope-OCR-2B-0527
Imgscope-OCR-2B-0527 is a powerful model designed for messy handwriting recognition and document OCR. It excels in multi-modal tasks, providing users with advanced capabilities for understanding complex visual and textual data. 🐙🌟
1
44

Kazuhito00

720 followers
Aichi, Japan
Kimi-VL-Colaboratory-Sample
Colaboratory上でKimi-VLをお試しするサンプル
1
45

ictnlp

247 followers
Beijing, China
Stream-Omni
Stream-Omni is an end-to-end language-vision-speech chatbot that simultaneously supports interaction across various modality combinations.
1
46

thisisiron

31 followers
Seoul, South Korea
vision-token-calculator
🧮 A calculator for vision tokens in VLMs.
1
47

Ethel75

0 followers
-
NoteMR
NoteMR enhances multimodal large language models for visual question answering by integrating structured notes. This implementation aims to reduce reasoning errors and improve visual feature perception. 🐙📚
1
48

kyegomez

2.0K followers
Palo Alto
SSM-As-VLM-Bridge
An exploration into leveraging SSM's as Bridge/Adapter Layers for VLM
1
49

altunenes

87 followers
-
calcarine
Desktop VLM: Real-time FastVLM analysis of video & textures with live compute shaders
1
50

Merserk

0 followers
-
Caption-Creator
Caption Creator is a fast and portable tool for generating high-quality image captions and tags—ideal for custom dataset creation, especially for (FLUX Dev, Pony, SDXL 1.0 Base, Illustrious), and more. Works seamlessly for both training and image generation.
1
51

adam-aimoscloud

0 followers
-
MoleSearch
Multimodal data Retriever, including text, image, video, audio
1
52

rbiswasfc

146 followers
Singapore
crag-mm
CRAG-MM Challenge Solution Code
1
53

Zoher15

11 followers
Bloomington
Zero-shot-s2
Task-aligned prompting improves zero-shot detection of AI-generated images by Vision-Language Models
1
54

youcefgheffari3

2 followers
Oran, Algeria
vlm_instruction_follower
Instruction-following vision-language model (VLM): grounded text instructions executed via multi-modal reasoning
1
55

iameas

2 followers
Abuja, Nigeria
sora-extension
Sora extension support for VS Code
1
56

ricochetservice

0 followers
-
Gemma3_OCR_Text_Extractor_LLM
Gemma-3 OCR exemplifies the confluence of abstruse computer vision and arcane NLP, leveraging Gemma-3 Vision’s neural framework for precise OCR and semantically refined text curation. Powered by Streamlit and Ollama, this hermetic system converts visual data into perspicuous, markdown-rendered output, ensuring maximal accuracy and confidentiality.
1
57

muratcanlaloglu

2 followers
Istanbul
moonlabel
Moondream VLM-powered labeler one-click YOLO export
1
58

ChaoLinAViy

1 followers
-
OMGM
OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval (ACL 2025 Main Conference)
1
59

RauhanAhmed

16 followers
Bhopal, Madhya Pradesh, India
AlphaExtract
AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.
0
60

LoupXpro

0 followers
-
AlphaExtract
AlphaExtract is a sophisticated PDF summarization tool that combines cutting-edge AI technology with efficient document processing. The project is built using Python and leverages Meta's LLaMA 4 MOE Maverick model along with Groq's inference engine to provide fast and accurate PDF summaries.
0
61

kosaokis

0 followers
-
LLaMA-Factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
0
62

elesxx

0 followers
-
Agent-S
Agent S: an open agentic framework that uses computers like a human
0
63

sevkaz

12 followers
-
AICTF_2025
AI Security in Practice: Analysis of AI CTF Tasks at Positive Hack Days
0
64

txkhaund

2 followers
Plano, TX
Multimodal-Product-Intelligence-System
Solution to understand and align product images, text descriptions, and customer reviews to flag issues (e.g., misleading images, product defects, or mismatch between title and image).
0
65

kaicheng001

23 followers
Nanjing, China
Awesome-R1
A curated list of research papers, models, and resources related to R1-style reasoning models following DeepSeek-R1's breakthrough in January 2025.
0
66

manumishra12

43 followers
New Delhi, Delhi, India
SmartGrader
SmartGrader represents a breakthrough in educational technology, combining cutting-edge Vision-Language Models (VLMs) and Large Language Models (LLMs) to revolutionize the assessment of handwritten computer science assignments.
0
67

lschirripa

12 followers
Buenos Aires
capture-to-vlm
This project provides a comprehensive system for analyzing real-time videos using Vision Language Models (VLM) and generating summaries of the content. The system works in two main phases: real-time frame analysis and post-processing summarization.
0
68

rmsandu

16 followers
Switzerland
In-Context-multiview-img-generation
Generate multiple 2D/4D views of the same object/scene with IC-LoRA and Flux
0
69

esdraskololo

0 followers
-
File-Organizer-Tool
Organize your files effortlessly with the File Organizer Tool, which sorts them into subdirectories based on their prefixes. This versatile tool offers both GUI and command-line interfaces, supporting multiple languages and themes for a personalized experience. 🗂️💻
0
70

blackteck

1 followers
Bangalore
Multimodal-RAG
Multimodal RAG using Colsmolvlm in colab free-tier GPU
0
71

JoeJoe1313

27 followers
Sofia, Bulgaria
PaliGemma-Image-Segmentation
An app with FastAPI, Docker, transformers, JAX/Flax for performing image segmentation with PaliGemma 2 mix
0
72

daniel-mehta

3 followers
Toronto, ON
FitCheck.AI
👔 FitCheck.AI is your personal AI stylist. Upload outfits for savage critiques, auto-tag your wardrobe, and get smart recommendations - powered by Streamlit, LangChain, MongoDB, and VLMs like CLIP and Qwen.
0
73

yewonseowill

0 followers
-
aiprof
소프트웨어공학 1팀
0
74

easonlai

67 followers
Hong Kong
video_generator_with_sora
Sora Video Generator is a Streamlit app for effortless AI video creation. Just describe your idea in one sentence—no tech skills needed. It uses Azure OpenAI to craft prompts for the Sora API, handling everything from submission to download.
0
75

ParthaPRay

15 followers
India
daycare_ollama_analysis
This repo presents codes that allows user to run a pipeline to analyze daycare image using YOLO, Ollama, VLM, and Reasoning LLMs locally
0
76

AzozzALFiras

110 followers
127.0.0.1
SoraChatGPTDownloader
Download Videos from Sora ChatGPT php
0
77

mrvaibhavsoni

0 followers
-
VLM-Mamba
Revolutionize vision-language tasks with VLM-Mamba, the first model using State Space Models. Explore innovative multi-modal architecture. 🚀💻
0
78

peter-gy

71 followers
Vienna
AutoVisType
Probing vision-language model alignment with human expert visual grouping over stratified sample of VIS30K dataset.
0
79

Phantom-fs

15 followers
Asia
Uc-PrUn
Uc-PrUn: Uncertainty-calibrated Data Pruning and Unlearning framework for vision-language models (VLMs)
0
80

mustapharochdi

0 followers
-
TalkMateAI
Create immersive conversations with TalkMateAI, a real-time voice-controlled 3D avatar. Experience natural interactions powered by advanced AI. 🐙🌐
0
81

Fortunato777a

0 followers
-
cutlass
CUTLASS 4.1.0 offers high-performance matrix-matrix multiplication in CUDA, with flexible abstractions for custom kernels. Perfect for efficient linear algebra. 🚀💻
0
82

genji970

0 followers
Korea, Gyeonggi-do
3d-vlm-gaussian-splatting-pointclip-on-modelnet40-and-scanobjectnn
achieved over 96 % top1 accuracy on modelnet40 test dataset and 99.91% top1 accuracy on scanobjectnn test dataset with light weighted 3d custome models. projecting 3d pointcloud dataset(with gaussian splatting method) into 2d.image. And lastly, clip vit-16
0
83

DoMaLi94

4 followers
Germany
dspy-experiments
Hands-on experiments with the DSPy framework using local Ollama models. Features basic QA systems, multimodal image processing with LLaVA, and interactive Jupyter notebooks. Privacy-focused with local inference and no API costs.
0
84

skanda-vijaykumar

0 followers
-
Business-card-info-extraction
Detect business cards and extract information in a structured format using VLMs.
0
85

dingdingboy

0 followers
-
School_Behavior_Analyzer
A Python application for detecting, tracking, and analyzing classroom behavior using computer vision and large vision-language models (VLMs). The system detects and tracks people in video streams, saves cropped person videos, and analyzes posture changes using a VLM.
0
86

gaurisharan

28 followers
Mumbai, Maharashtra, India
Byaldi-Qwen-img-reader
Image to text reader for English and Hindi. Made with combining Byaldi and Qwen2VL vision language models.
0
87

ChiragVaghela10

1 followers
Mönchengladbach, Germany
lidar_vqa
Multimodal system combining RGB images and LiDAR depth cues to answer questions about driving scenes using fine-tuned CLIP (ViT-B/32) and fusion strategies.
0
88

mahimairaja

71 followers
Toronto 🇨🇦
nuextract-2.0-receipts-fastapi
Efficient parsing of scanned receipts from walmart using NuExtract 2.0 VLM, FastAPI and hosted in Modal Labs Serverless Deployment
0
89

liswahyuni

5 followers
-
Action-RecognitionVLM
A project demonstrating zero-shot and few-shot action recognition on the UCF101 dataset using CLIP. Includes evaluation, fine-tuning, and embedding space visualizations.
0
90

phrugsa-limbunlom

5 followers
United Kingdom
vlm-lora
LoRA from scratch for VLM fine tuning
0
91

simon-gardier

7 followers
Europe
personalization-toolkit-for-lvlm-review
Review of the paper "Personalization Toolkit: Training Free Personalization of Large Vision Language Models"
0
92

mbodua1

0 followers
-
diverticulitis_ollama_LLM
AI-driven dietary guidance for diverticulitis. Upload meal photos for analysis, get food safety ratings, and receive personalized advice. 🍽️💻
0
93

kucingcoder

0 followers
Kota Tegal, Jawa Tengah, Indonesia
miramo
A Flask-based web app for managing multimodal datasets text and images with CRUD operations via SQLite, and seamless export as a structured Parquet dataset to Hugging Face Hub.
0
94

FaNa-AI

3 followers
Tehran,IRI
VLM
Generate natural language captions for images using the BLIP vision-language model by Salesforce. Easily run it in Google Colab with GPU support, using the Flickr8k-2k image dataset from Kaggle.
0
95

norikinishida

14 followers
Tokyo, Japan
mllm-gesture-eval
Code and dataset for evaluating Multimodal LLMs on indexical, iconic, and symbolic gestures (Nishida et al., ACL 2025)
0
96

Y1D1R

4 followers
Paris
smart-vehicle-detector
**Smart Vehicle Detector** is an AI-powered system that combines YOLO for object detection and a VLM to classify vehicle types more accurately. This project demonstrates the integration of modern computer vision and language models for intelligent scene understanding.
0
97

waruhachi

51 followers
-
Sora
A collection of Sora modules
0
98

alexpalms

19 followers
Montreal, Quebec, Canada
alexpalms
The special repo for GitHub
0
99

sachinkum0009

40 followers
Dortmund, Germany
bandu
Bandu: AI Agents based on ROS2
0
100

Y-0023

1 followers
-
cua
c/ua is the Docker Container for Computer-Use AI Agents.
0