Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)
RankingsOrganization Account
Related Project
Project intro
Star count
1

NexaAI

563 followers
United States of America
nexa-sdk
Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
3.9K
2

dvlab-research

651 followers
-
MGM
Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
3.2K
3

SoraWebui

60 followers
-
SoraWebui
SoraWebui is an open-source Sora web client, enabling users to easily create videos from text with OpenAI's Sora model.
2.3K
4

deepseek-ai

2.2K followers
-
DeepSeek-VL
DeepSeek-VL: Towards Real-World Vision-Language Understanding
2.1K
5

cambrian-mllm

33 followers
-
cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
1.8K
6

QiuYannnn

35 followers
Los Angeles
Local-File-Organizer
An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.
1.7K
7

ShareGPT4Omni

20 followers
-
ShareGPT4Video
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
1.3K
8

mini-sora

36 followers
-
minisora
MiniSora: A community aims to explore the implementation path and future development direction of Sora.
1.2K
9

illuin-tech

39 followers
Paris, France
colpali
The code used to train and run inference with the ColPali architecture.
1.1K
10

heshengtao

28 followers
-
comfyui_LLM_party
LLM Agent Framework in ComfyUI includes Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai/gemini interfaces, such as o1,ollama, grok, qwen, GLM, deepseek, moonshot,doubao. Adapted to local llms, vlm, gguf such as llama-3.2, Linkage neo4j KG, graphRAG / RAG / html 2 img
1.0K
11

all-in-aigc

837 followers
-
sorafm
Sora AI Video Generator by Sora.FM
956
12

BAAI-DCAI

133 followers
-
Bunny
A family of lightweight multimodal models.
933
13

mbzuai-oryx

220 followers
-
LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
813
14

TinyLLaVA

6 followers
-
TinyLLaVA_Factory
A Framework of Small-scale Large Multimodal Models
657
15

FoundationVision

358 followers
-
Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
563
16

zubair-irshad

242 followers
Silicon Valley, CA, USA
Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
555
17

NVlabs

6.0K followers
-
EAGLE
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
539
18

AIDC-AI

29 followers
-
Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
527
19

Blaizzy

195 followers
Poland
mlx-vlm
MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.
498
20

gokayfem

133 followers
Turkey
awesome-vlm-architectures
Famous Vision Language Models and Their Architectures
431
21

nrl-ai

41 followers
-
llama-assistant
AI-powered assistant to help you with your daily tasks, powered by Llama 3.2. It can recognize your voice, process natural language, and perform various actions based on your commands: summarizing text, rephasing sentences, answering questions, writing emails, and more.
415
22

neonwatty

309 followers
-
meme_search
Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast.
408
23

OpenBMB

4.3K followers
-
VisRAG
Parsing-free RAG supported by VLMs
399
24

xiaoachen98

97 followers
-
Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
395
25

jingyaogong

155 followers
China
minimind-v
「大模型」3小时从0训练27M参数的视觉多模态VLM,个人显卡即可推理训练!
365
26

yueliu1999

265 followers
Singapore
Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.
350
27

developersdigest

425 followers
-
ai-devices
AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more
281
28

RLHF-V

16 followers
-
RLAIF-V
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
244
29

zhengli97

87 followers
Hangzhou, China
PromptKD
[CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models"
237
30

JosefAlbers

24 followers
-
Phi-3-Vision-MLX
Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon
237
31

baaivision

544 followers
China
EVE
[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
231
32
Awesome-Open-AI-Sora
Sora AI Awesome List – Your go-to resource hub for all things Sora AI, OpenAI's groundbreaking model for crafting realistic scenes from text. Explore a curated collection of articles, videos, podcasts, and news about Sora's capabilities, advancements, and more.
216
33

CircleRadon

64 followers
Hangzhou
TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
214
34

SoraFlows

1 followers
-
SoraFlows
The most powerful and modular Sora WebUI, api and backend with OpenAI's Sora Model. Collecting the highest quality prompts for Sora. using NextJs and Tailwind CSS
195
35

JiaqiLi404

7 followers
Hong Kong
IAmDirector-Text2Video-NextJS-Client
本项目开源基于NextJS的前端, 希望能够提供一个用于生成式AI的文字转视频, 尤其是电影从编剧到视频生成的Web前端平台参考。Everyone can become a director. The Nextjs front-end of an AI driven platform for automatic movie/video generation (form GPT script generation to text2video movie generation).这是一个免费试用AI视频创作平台,集成了基于GPT的视频剧本生成和视频生成功能。 我们的理想是让每个人都能成为导演,以最快的方式将日常中的任何创意转化为高质量的视频, 无论是电影、营销视频、还是自媒体视频。
190
36

TIGER-AI-Lab

164 followers
Canada
Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)
184
37

mbodiai

19 followers
United States of America
embodied-agents
Seamlessly integrate state-of-the-art transformer models into robotics stacks
163
38

AviSoori1x

93 followers
San Francisco
seemore
From scratch implementation of a vision language model in pure PyTorch
162
39

RobotecAI

154 followers
Poland
rai
RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.
159
40

LostXine

83 followers
Stony Brook, NY
LLaRA
LLaRA: Large Language and Robotics Assistant
155
41

bz-lab

1 followers
-
AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
151
42

OpenDriveLab

1.5K followers
Hong Kong
ELM
[ECCV 2024] Embodied Understanding of Driving Scenarios
149
43

opendilab

1.3K followers
China
PsyDI
PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent)
149
44

sterzhang

8 followers
-
image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
144
45

fpgaminer

157 followers
-
joycaption
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
135
46

mala-lab

58 followers
-
InCTRL
Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'.
131
47

zytx121

128 followers
Singapore
Awesome-VLGFM
A Survey on Vision-Language Geo-Foundation Models (VLGFMs)
127
48

notune

12 followers
-
captcha-solver
basic google recaptcha solver using llava-v1.6-7b
120
49

ZebangCheng

7 followers
-
Emotion-LLaMA
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
115
50

WangWenhao0716

68 followers
Sydney
VidProM
[NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models
115
51

thu-ml

717 followers
FIT Building, Tsinghua University, Beijing, China
MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
108
52

xlang-ai

442 followers
-
Spider2-V
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
107
53

BAAI-Agents

75 followers
-
GPA-LM
This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges".
106
54

OpenGVLab

2.4K followers
-
MM-NIAH
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
102
55

chs20

5 followers
-
RobustVLM
[ICML 2024] Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
99
56

graphic-design-ai

5 followers
-
graphist
Official Repo of Graphist
99
57

microsoft

79.4K followers
Redmond, WA
eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
87
58

aimagelab

122 followers
Modena, Italy
LLaVA-MORE
LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1
86
59

fangyuan-ksgk

34 followers
Singapore
Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.
84
60

2U1

33 followers
-
Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
83
61

Yxxxb

49 followers
Shenzhen
VoCo-LLaMA
VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
82
62

mu-cai

49 followers
Madison. WI
matryoshka-mm
Matryoshka Multimodal Models
82
63

shikiw

57 followers
-
Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
80
64

zjysteven

56 followers
United States
VLM-Visualizer
Visualizing the attention of vision-language models
76
65

princeton-nlp

1.2K followers
-
CharXiv
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
75
66

KwaiVGI

511 followers
-
Uniaa
Unified Multi-modal IAA Baseline and Benchmark
70
67

ruili3

50 followers
Zürich, Switzerland
Know-Your-Neighbors
[CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning
69
68

WisconsinAIVision

33 followers
-
YoLLaVA
🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant
67
69

OpenRobotLab

406 followers
-
VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
67
70

HieuPhan33

15 followers
Australia
CVPR2024_MAVL
Multi-Aspect Vision Language Pretraining - CVPR2024
64
71

fly-apps

287 followers
-
ollama-open-webui
Self-host a ChatGPT-style web interface for Ollama 🦙
61
72

skit-ai

42 followers
Bangalore, India
SpeechLLM
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
61
73

yihedeng9

25 followers
-
STIC
Enhancing Large Vision Language Models with Self-Training on Image Comprehension.
59
74

FlyCole

30 followers
UK
Dream2Real
[ICRA 2024] Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
59
75

Hon-Wong

7 followers
-
Elysium
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
58
76

richard-peng-xia

43 followers
Chapel Hill, NC, U.S.
CARES
[NeurIPS'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
56
77

Gumpest

122 followers
Beijing
SparseVLMs
Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.
55
78

BUAADreamer

80 followers
Beijing
Chinese-LLaVA-Med
中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine
52
79

jamjamjon

18 followers
Shanghai
usls
A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models.
50
80

whwu95

144 followers
-
FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
49
81

miccunifi

41 followers
Firenze - Viale Morgagni 65 - Italia
KDPL
[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
48
82

VisualWebBench

1 followers
-
VisualWebBench
Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?"
47
83

ys-zong

16 followers
Edinburgh
VLGuard
[ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models.
45
84

tmlr-group

97 followers
Hong Kong
WCA
[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"
43
85

Victorwz

96 followers
-
MLM_Filter
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
42
86

chenshuang-zhang

0 followers
-
imagenet_d
[CVPR 2024 Highlight] ImageNet-D
38
87

bonjour-npy

18 followers
-
UndergraduateDissertation
Undergraduate Dissertation of Guilin University of Electronic Technology
38
88

ProGamerGov

131 followers
Multiverse
VLM-Captioning-Tools
Python scripts to use for captioning images with VLMs
34
89

yuecao0119

13 followers
-
MMInstruct
The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.
33
90

Gahyeonkim09

4 followers
Naju-si, South Korea
AAPL
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024)
31
91

RaptorMai

69 followers
Columbus
CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
31
92

ParadoxZW

48 followers
Hangzhou, China
LLaVA-UHD-Better
A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo
31
93

ai4ce

187 followers
Brooklyn, NY, U.S.
LLM4VPR
Can multimodal LLM help visual place recognition?
30
94
ConBench
[NeurIPS'24] Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models".
30
95

uni-medical

92 followers
-
GMAI-MMBench
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.
29
96

hewei2001

108 followers
Shanghai
ReachQA
Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
29
97

YunzeMan

77 followers
Champaign, Illinois
Situation3D
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
26
98

erfanshayegani

15 followers
California, USA 🌴 🇺🇸
Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
26
99

NishilBalar

6 followers
Germany
Awesome-LVLM-Hallucination
up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources
25
100

Oztobuzz

5 followers
Ho Chi Minh
Vista
This is the official repository for Vista dataset - A Vietnamese multimodal dataset contains more than 700,000 samples of conversations and images
24