Top AI Project by Categories

A list of Top influential AI open source project listed by different categories. ( Data sourced from GitHub, updated automatically everyday.)
RankingsOrganization Account
Related Project
Project intro
Star count
1

NexaAI

704 followers
United States of America
nexa-sdk
Nexa SDK is a comprehensive toolkit for supporting GGML and ONNX models. It supports text generation, image generation, vision-language models (VLM), Audio Language Model, auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities.
5.2K
2

cambrian-mllm

33 followers
-
cambrian
Cambrian-1 is a family of multimodal LLMs with a vision-centric design.
1.8K
3

QiuYannnn

35 followers
Los Angeles
Local-File-Organizer
An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval.
1.7K
4

ShareGPT4Omni

20 followers
-
ShareGPT4Video
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
1.3K
5

illuin-tech

39 followers
Paris, France
colpali
The code used to train and run inference with the ColPali architecture.
1.2K
6

heshengtao

30 followers
-
comfyui_LLM_party
LLM Agent Framework in ComfyUI includes Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai / aisuite interfaces, such as o1,ollama, gemini, grok, qwen, GLM, deepseek, moonshot,doubao. Adapted to local llms, vlm, gguf such as llama-3.2, Linkage graphRAG / RAG
1.1K
7

mbzuai-oryx

220 followers
-
LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
812
8

Blaizzy

226 followers
Poland
mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
663
9

FoundationVision

359 followers
-
Groma
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
568
10

zubair-irshad

242 followers
Silicon Valley, CA, USA
Awesome-Robotics-3D
A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites
559
11

NVlabs

6.0K followers
-
EAGLE
EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
541
12

AIDC-AI

61 followers
-
Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
539
13

nrl-ai

42 followers
-
llama-assistant
AI-powered assistant to help you with your daily tasks, powered by Llama 3.2. It can recognize your voice, process natural language, and perform various actions based on your commands: summarizing text, rephasing sentences, answering questions, writing emails, and more.
423
14

OpenBMB

4.3K followers
-
VisRAG
Parsing-free RAG supported by VLMs
421
15

neonwatty

309 followers
-
meme_search
Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast.
409
16

yueliu1999

273 followers
Singapore
Awesome-Jailbreak-on-LLMs
Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.
409
17

xiaoachen98

97 followers
-
Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
398
18

jingyaogong

159 followers
China
minimind-v
「大模型」3小时从0训练27M参数的视觉多模态VLM,个人显卡即可推理训练!
378
19

developersdigest

425 followers
-
ai-devices
AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more
281
20

RLHF-V

16 followers
-
RLAIF-V
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
246
21

JosefAlbers

24 followers
-
Phi-3-Vision-MLX
Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon
237
22

baaivision

546 followers
China
EVE
[NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models
233
23

CircleRadon

64 followers
Hangzhou
TokenPacker
The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM".
215
24

TIGER-AI-Lab

173 followers
Canada
Mantis
Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)
189
25

RobotecAI

163 followers
Poland
rai
RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more.
178
26

AviSoori1x

93 followers
San Francisco
seemore
From scratch implementation of a vision language model in pure PyTorch
164
27

mbodiai

19 followers
United States of America
embodied-agents
Seamlessly integrate state-of-the-art transformer models into robotics stacks
164
28

LostXine

83 followers
Stony Brook, NY
LLaRA
LLaRA: Large Language and Robotics Assistant
156
29

bz-lab

1 followers
-
AUITestAgent
AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification.
151
30

opendilab

1.3K followers
China
PsyDI
PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent)
151
31

sterzhang

8 followers
-
image-textualization
Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024)
145
32

fpgaminer

158 followers
-
joycaption
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
144
33

ZebangCheng

9 followers
-
Emotion-LLaMA
Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
121
34

xlang-ai

446 followers
-
Spider2-V
[NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
109
35

thu-ml

719 followers
FIT Building, Tsinghua University, Beijing, China
MMTrustEval
A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
108
36

OpenGVLab

2.4K followers
-
MM-NIAH
[NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.
102
37

Ravi-Teja-konda

15 followers
-
Surveillance_Video_Summarizer
VLM driven tool that processes surveillance videos, extracts frames, and generates insightful annotations using a fine-tuned Florence-2 Vision-Language Model. Includes a Gradio-based interface for querying and analyzing video footage.
92
38

microsoft

81.1K followers
Redmond, WA
eureka-ml-insights
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
90
39

aimagelab

122 followers
Modena, Italy
LLaVA-MORE
LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1
86
40

2U1

33 followers
-
Llama3.2-Vision-Finetune
An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta.
85
41

shikiw

58 followers
-
Modality-Integration-Rate
The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate".
85
42

mu-cai

49 followers
Madison. WI
matryoshka-mm
Matryoshka Multimodal Models
84
43

fangyuan-ksgk

34 followers
Singapore
Mini-LLaVA
A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability.
84
44

Yxxxb

49 followers
Shenzhen
VoCo-LLaMA
VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models".
83
45

zjysteven

56 followers
United States
VLM-Visualizer
Visualizing the attention of vision-language models
79
46

balrog-ai

1 followers
-
BALROG
Benchmarking Agentic LLM and VLM Reasoning On Games
77
47

princeton-nlp

1.2K followers
-
CharXiv
[NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
75
48

WisconsinAIVision

33 followers
-
YoLLaVA
🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant
69
49

OpenRobotLab

414 followers
-
VLM-Grounder
[CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding
69
50

skit-ai

42 followers
Bangalore, India
SpeechLLM
This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface.
61
51

yihedeng9

25 followers
-
STIC
Enhancing Large Vision Language Models with Self-Training on Image Comprehension.
59
52

richard-peng-xia

43 followers
Chapel Hill, NC, U.S.
CARES
[NeurIPS'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models
56
53

Gumpest

122 followers
Beijing
SparseVLMs
Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.
56
54

BUAADreamer

81 followers
Beijing
Chinese-LLaVA-Med
中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine
55
55

whwu95

144 followers
-
FreeVA
FreeVA: Offline MLLM as Training-Free Video Assistant
49
56

miccunifi

41 followers
Firenze - Viale Morgagni 65 - Italia
KDPL
[ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation
48
57

tmlr-group

97 followers
Hong Kong
WCA
[ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models"
43
58

yuecao0119

14 followers
-
MMInstruct
The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types.
34
59

Gahyeonkim09

4 followers
Naju-si, South Korea
AAPL
AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024)
32
60

ParadoxZW

48 followers
Hangzhou, China
LLaVA-UHD-Better
A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo
32
61

hewei2001

109 followers
Shanghai
ReachQA
Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs"
32
62

ai4ce

188 followers
Brooklyn, NY, U.S.
LLM4VPR
Can multimodal LLM help visual place recognition?
31
63

RaptorMai

69 followers
Columbus
CompBench
CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes.
31
64

uni-medical

94 followers
-
GMAI-MMBench
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI.
31
65
ConBench
[NeurIPS'24] Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models".
30
66

H-Freax

87 followers
Boston
ThinkGrasp
[CoRL2024] ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter. https://arxiv.org/abs/2407.11298
30
67

YunzeMan

77 followers
Champaign, Illinois
Situation3D
[CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning
26
68

erfanshayegani

15 followers
California, USA 🌴 🇺🇸
Jailbreak-In-Pieces
[ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models
26
69

ai-aigc-studio

12 followers
-
Kling-AI-Webui
Kling AI, Make Imagination Alive. This is a revolutionary text-to-video model like Sora. Kling AI WebUI is the open source project to integrate Kling AI Video Generation Model.
24
70

sStonemason

0 followers
-
RET-CLIP
RET-CLIP: A Retinal Image Foundation Model Pre-trained with Clinical Diagnostic Reports
23
71

declare-lab

257 followers
Singapore University of Technology and Design
Emma-X
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning
23
72

JUNJIE99

11 followers
-
VISTA_Evaluation_FineTuning
Evaluation code and datasets for the ACL 2024 paper, VISTA: Visualized Text Embedding for Universal Multi-Modal Retrieval. The original code and model can be accessed at FlagEmbedding.
21
73

Sid2697

76 followers
Bristol
HOI-Ref
Code implementation for paper titled "HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision"
20
74

hllj

38 followers
Ho Chi Minh City, Vietnam
Vistral-V
Vistral-V: Visual Instruction Tuning for Vistral - Vietnamese Large Vision-Language Model.
19
75

TobyYang7

18 followers
Beijing & Shenzhen
Llava_Qwen2
Visual Instruction Tuning for Qwen2 Base Model
19
76

gptscript-ai

132 followers
-
gptparse
Document parser for RAG
18
77

obiyoag

14 followers
Shanghai, China
evi-CEM
Official implementation of MICCAI2024 paper "Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis"
17
78

reidbarber

28 followers
-
webmarker
Mark web pages for use with vision-language models
16
79

AIDevBytes

20 followers
-
LLava-Image-Analyzer
Llava, Ollama and Streamlit | Create POWERFUL Image Analyzer Chatbot for FREE - Windows & Mac
16
80
3d-conditioning
Enhance and modify high-quality compositions using real-time rendering and generative AI output without affecting a hero product asset.
16
81

zabir-nabil

211 followers
California, USA
awesome-multilingual-large-language-models
A comprehensive collection of multilingual datasets and large language models, meticulously curated for evaluating and enhancing the performance of large language models across diverse languages and tasks.
15
82

egeozsoy

27 followers
-
ORacle
Official code of the paper ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling accepted at MICCAI 2024.
15
83

HyperMink

0 followers
Sydney, Australia
inferenceable
Scalable AI Inference Server for CPU and GPU with Node.js | Utilizes llama.cpp and parts of llamafile C/C++ core under the hood.
14
84

hasanar1f

31 followers
Blacksburg, VA, USA
HiRED
[AAAI 2025] HiRED strategically drops visual tokens in the image encoding stage to improve inference efficiency for High-Resolution Vision-Language Models (e.g., LLaVA-Next) under a fixed token budget.
14
85

ANYANTUDRE

23 followers
Morocco
Florence-2-Vision-Language-Model
Florence-2 is a novel vision foundation model with a unified, prompt-based representation for a variety of computer vision and vision-language tasks.
13
86

showlab

631 followers
-
VisInContext
Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
13
87

worldcuisines

1 followers
-
worldcuisines
WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines.
13
88

aliencaocao

68 followers
Singapore
TIL-2024
Brainhack TIL 2024: Team 12000SGDPLUSHIE
12
89

S4mpl3r

6 followers
Dubai, United Arab Emirates
okra
Okra, your all in one personal AI assistant
12
90

JinhaoLee

5 followers
Melbourne, Australia
WCA
[ICML 2024] Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
11
91

wendell0218

0 followers
-
GVA-Survey
Generalist Virtual Agents: A Survey on Autonomous Agents Across Digital Platforms
10
92

jacobmarks

133 followers
NYC
fiftyone_florence2_plugin
Run SOTA Vision-Language Model Florence-2 on your data!
9
93

Fsoft-AIC

45 followers
-
Z-GMOT
[NAACL 2024] Z-GMOT: Zero-shot Generic Multiple Object Tracking
9
94

lamalab-org

56 followers
-
mac-bench
Probing the limitations of multimodal language models for chemistry and materials research
9
95

xyproto

549 followers
Oslo
describeimage
Describe images by using LLMs
8
96

med-air

163 followers
-
PICG2scoring
[MICCAI'24] Incorporating Clinical Guidelines through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring
8
97

sovit-123

129 followers
India
SAM_Molmo_Whisper
An integration of Segment Anything Model, Molmo, and, Whisper to segment objects using voice and natural language.
8
98

tian1327

13 followers
College Station
SWAT
7
99

sayedmohamedscu

40 followers
Egypt
Vision-language-models-VLM
vision language models finetuning notebooks & use cases (paligemma - florence .....)
7
100

RPIDIAL

8 followers
United States of America
Disease-informed-VLM-Adaptation
MICCAI 2024 - Disease-informed Adaptation of Vision-Language Models
6