TOP AI Developers by monthly star count
TOP AI Organization Account by AI repo star count
Top AI Project by Category star count
Top Growing Speed list by the speed of gaining stars
Top List of who create influential repos with little people known
Rankings | Organization Account | Related Project | Project intro | Star count |
---|---|---|---|---|
1 | nexa-sdk | Nexa SDK is a comprehensive toolkit for supporting ONNX and GGML models. It supports text generation, image generation, vision-language models (VLM), auto-speech-recognition (ASR), and text-to-speech (TTS) capabilities. | 3.9K | |
2 | MGM | Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models" | 3.2K | |
3 | SoraWebui | SoraWebui is an open-source Sora web client, enabling users to easily create videos from text with OpenAI's Sora model. | 2.3K | |
4 | DeepSeek-VL | DeepSeek-VL: Towards Real-World Vision-Language Understanding | 2.1K | |
5 | cambrian | Cambrian-1 is a family of multimodal LLMs with a vision-centric design. | 1.8K | |
6 | Local-File-Organizer | An AI-powered file management tool that ensures privacy by organizing local texts, images. Using Llama3.2 3B and Llava v1.6 models with the Nexa SDK, it intuitively scans, restructures, and organizes files for quick, seamless access and easy retrieval. | 1.7K | |
7 | ShareGPT4Video | [NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | 1.3K | |
8 | minisora | MiniSora: A community aims to explore the implementation path and future development direction of Sora. | 1.2K | |
9 | colpali | The code used to train and run inference with the ColPali architecture. | 1.1K | |
10 | comfyui_LLM_party | LLM Agent Framework in ComfyUI includes Omost,GPT-sovits, ChatTTS,GOT-OCR2.0, and FLUX prompt nodes,access to Feishu,discord,and adapts to all llms with similar openai/gemini interfaces, such as o1,ollama, grok, qwen, GLM, deepseek, moonshot,doubao. Adapted to local llms, vlm, gguf such as llama-3.2, Linkage neo4j KG, graphRAG / RAG / html 2 img | 1.0K | |
11 | sorafm | Sora AI Video Generator by Sora.FM | 956 | |
12 | Bunny | A family of lightweight multimodal models. | 933 | |
13 | LLaVA-pp | 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3) | 813 | |
14 | TinyLLaVA_Factory | A Framework of Small-scale Large Multimodal Models | 657 | |
15 | Groma | [ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization | 563 | |
16 | Awesome-Robotics-3D | A curated list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites | 555 | |
17 | EAGLE | EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | 539 | |
18 | Ovis | A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings. | 527 | |
19 | mlx-vlm | MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX. | 498 | |
20 | awesome-vlm-architectures | Famous Vision Language Models and Their Architectures | 431 | |
21 | llama-assistant | AI-powered assistant to help you with your daily tasks, powered by Llama 3.2. It can recognize your voice, process natural language, and perform various actions based on your commands: summarizing text, rephasing sentences, answering questions, writing emails, and more. | 415 | |
22 | meme_search | Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast. | 408 | |
23 | VisRAG | Parsing-free RAG supported by VLMs | 399 | |
24 | Open-LLaVA-NeXT | An open-source implementation for training LLaVA-NeXT. | 395 | |
25 | minimind-v | 「大模型」3小时从0训练27M参数的视觉多模态VLM,个人显卡即可推理训练! | 365 | |
26 | Awesome-Jailbreak-on-LLMs | Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. | 350 | |
27 | ai-devices | AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more | 281 | |
28 | RLAIF-V | RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness | 244 | |
29 | PromptKD | [CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models" | 237 | |
30 | Phi-3-Vision-MLX | Phi-3.5 for Mac: Locally-run Vision and Language Models for Apple Silicon | 237 | |
31 | EVE | [NeurIPS'24 Spotlight] EVE: Encoder-Free Vision-Language Models | 231 | |
32 | Awesome-Open-AI-Sora | Sora AI Awesome List – Your go-to resource hub for all things Sora AI, OpenAI's groundbreaking model for crafting realistic scenes from text. Explore a curated collection of articles, videos, podcasts, and news about Sora's capabilities, advancements, and more. | 216 | |
33 | TokenPacker | The code for "TokenPacker: Efficient Visual Projector for Multimodal LLM". | 214 | |
34 | SoraFlows | The most powerful and modular Sora WebUI, api and backend with OpenAI's Sora Model. Collecting the highest quality prompts for Sora. using NextJs and Tailwind CSS | 195 | |
35 | IAmDirector-Text2Video-NextJS-Client | 本项目开源基于NextJS的前端, 希望能够提供一个用于生成式AI的文字转视频, 尤其是电影从编剧到视频生成的Web前端平台参考。Everyone can become a director. The Nextjs front-end of an AI driven platform for automatic movie/video generation (form GPT script generation to text2video movie generation).这是一个免费试用AI视频创作平台,集成了基于GPT的视频剧本生成和视频生成功能。 我们的理想是让每个人都能成为导演,以最快的方式将日常中的任何创意转化为高质量的视频, 无论是电影、营销视频、还是自媒体视频。 | 190 | |
36 | Mantis | Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024) | 184 | |
37 | embodied-agents | Seamlessly integrate state-of-the-art transformer models into robotics stacks | 163 | |
38 | seemore | From scratch implementation of a vision language model in pure PyTorch | 162 | |
39 | rai | RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more. | 159 | |
40 | LLaRA | LLaRA: Large Language and Robotics Assistant | 155 | |
41 | AUITestAgent | AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. | 151 | |
42 | ELM | [ECCV 2024] Embodied Understanding of Driving Scenarios | 149 | |
43 | PsyDI | PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent) | 149 | |
44 | image-textualization | Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions (NeurIPS 2024) | 144 | |
45 | joycaption | JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models. | 135 | |
46 | InCTRL | Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'. | 131 | |
47 | Awesome-VLGFM | A Survey on Vision-Language Geo-Foundation Models (VLGFMs) | 127 | |
48 | captcha-solver | basic google recaptcha solver using llava-v1.6-7b | 120 | |
49 | Emotion-LLaMA | Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning | 115 | |
50 | VidProM | [NeurIPS 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models | 115 | |
51 | MMTrustEval | A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks) | 108 | |
52 | Spider2-V | [NeurIPS 2024] Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? | 107 | |
53 | GPA-LM | This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges". | 106 | |
54 | MM-NIAH | [NeurIPS 2024] Needle In A Multimodal Haystack (MM-NIAH): A comprehensive benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. | 102 | |
55 | RobustVLM | [ICML 2024] Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models | 99 | |
56 | graphist | Official Repo of Graphist | 99 | |
57 | eureka-ml-insights | A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings. | 87 | |
58 | LLaVA-MORE | LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1 | 86 | |
59 | Mini-LLaVA | A minimal implementation of LLaVA-style VLM with interleaved image & text & video processing ability. | 84 | |
60 | Llama3.2-Vision-Finetune | An open-source implementaion for fine-tuning Llama3.2-Vision series by Meta. | 83 | |
61 | VoCo-LLaMA | VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models". | 82 | |
62 | matryoshka-mm | Matryoshka Multimodal Models | 82 | |
63 | Modality-Integration-Rate | The official code of the paper "Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate". | 80 | |
64 | VLM-Visualizer | Visualizing the attention of vision-language models | 76 | |
65 | CharXiv | [NeurIPS 2024] CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | 75 | |
66 | Uniaa | Unified Multi-modal IAA Baseline and Benchmark | 70 | |
67 | Know-Your-Neighbors | [CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning | 69 | |
68 | YoLLaVA | 🌋👵🏻 Yo'LLaVA: Your Personalized Language and Vision Assistant | 67 | |
69 | VLM-Grounder | [CoRL 2024] VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding | 67 | |
70 | CVPR2024_MAVL | Multi-Aspect Vision Language Pretraining - CVPR2024 | 64 | |
71 | ollama-open-webui | Self-host a ChatGPT-style web interface for Ollama 🦙 | 61 | |
72 | SpeechLLM | This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface. | 61 | |
73 | STIC | Enhancing Large Vision Language Models with Self-Training on Image Comprehension. | 59 | |
74 | Dream2Real | [ICRA 2024] Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models | 59 | |
75 | Elysium | [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM | 58 | |
76 | CARES | [NeurIPS'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models | 56 | |
77 | SparseVLMs | Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley. | 55 | |
78 | Chinese-LLaVA-Med | 中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine | 52 | |
79 | usls | A Rust library integrated with ONNXRuntime, providing a collection of Computer Vison and Vision-Language models. | 50 | |
80 | FreeVA | FreeVA: Offline MLLM as Training-Free Video Assistant | 49 | |
81 | KDPL | [ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation | 48 | |
82 | VisualWebBench | Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?" | 47 | |
83 | VLGuard | [ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. | 45 | |
84 | WCA | [ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models" | 43 | |
85 | MLM_Filter | Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters". | 42 | |
86 | imagenet_d | [CVPR 2024 Highlight] ImageNet-D | 38 | |
87 | UndergraduateDissertation | Undergraduate Dissertation of Guilin University of Electronic Technology | 38 | |
88 | VLM-Captioning-Tools | Python scripts to use for captioning images with VLMs | 34 | |
89 | MMInstruct | The official implementation of the paper "MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity". The MMInstruct dataset includes 973K instructions from 24 domains and four instruction types. | 33 | |
90 | AAPL | AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024) | 31 | |
91 | CompBench | CompBench evaluates the comparative reasoning of multimodal large language models (MLLMs) with 40K image pairs and questions across 8 dimensions of relative comparison: visual attribute, existence, state, emotion, temporality, spatiality, quantity, and quality. CompBench covers diverse visual domains, including animals, fashion, sports, and scenes. | 31 | |
92 | LLaVA-UHD-Better | A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo | 31 | |
93 | LLM4VPR | Can multimodal LLM help visual place recognition? | 30 | |
94 | ConBench | [NeurIPS'24] Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models". | 30 | |
95 | GMAI-MMBench | GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI. | 29 | |
96 | ReachQA | Code & Dataset for Paper: "Distill Visual Chart Reasoning Ability from LLMs to MLLMs" | 29 | |
97 | Situation3D | [CVPR 2024] Situational Awareness Matters in 3D Vision Language Reasoning | 26 | |
98 | Jailbreak-In-Pieces | [ICLR 2024 Spotlight 🔥 ] - [ Best Paper Award SoCal NLP 2023 🏆] - Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models | 26 | |
99 | Awesome-LVLM-Hallucination | up-to-date curated list of state-of-the-art Large vision language models hallucinations research work, papers & resources | 25 | |
100 | Vista | This is the official repository for Vista dataset - A Vietnamese multimodal dataset contains more than 700,000 samples of conversations and images | 24 |