TOP AI Developers by monthly star count
TOP AI Organization Account by AI repo star count
Top AI Project by Category star count
Top Growing Speed list by the speed of gaining stars
Top List of who create influential repos with little people known
Rankings | Organization Account | Related Project | Project intro | Star count |
---|---|---|---|---|
1 | sglang | SGLang is a fast serving framework for large language models and vision language models. | 4.9K | |
2 | SUPIR | SUPIR aims at developing Practical Algorithms for Photo-Realistic Image Restoration In the Wild. Our new online demo is also released at suppixel.ai. | 4.0K | |
3 | MobileAgent | Mobile-Agent: The Powerful Mobile Device Operation Assistant Family | 2.4K | |
4 | SoraWebui | SoraWebui is an open-source Sora web client, enabling users to easily create videos from text with OpenAI's Sora model. | 2.3K | |
5 | DeepSeek-VL | DeepSeek-VL: Towards Real-World Vision-Language Understanding | 1.9K | |
6 | cambrian | Cambrian-1 is a family of multimodal LLMs with a vision-centric design. | 1.5K | |
7 | minisora | MiniSora: A community aims to explore the implementation path and future development direction of Sora. | 1.1K | |
8 | ShareGPT4Video | An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions | 994 | |
9 | sorafm | Sora AI Video Generator by Sora.FM | 907 | |
10 | Bunny | A family of lightweight multimodal models. | 808 | |
11 | VLMEvalKit | Open-source evaluation toolkit of large vision-language models (LVLMs), support ~100 VLMs, 30+ benchmarks | 747 | |
12 | LLaVA-pp | 🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3) | 730 | |
13 | Osprey | [CVPR2024] The code for "Osprey: Pixel Understanding with Visual Instruction Tuning" | 728 | |
14 | AlphaCLIP | [CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want | 595 | |
15 | Awesome-LM-SSP | A reading list for large models safety, security, and privacy (including Awesome LLM Security, Safety, etc.). | 534 | |
16 | Groma | Grounded Multimodal Large Language Model with Localized Visual Tokenization | 466 | |
17 | TinyLLaVA_Factory | A Framework of Small-scale Large Multimodal Models | 462 | |
18 | EAGLE | EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders | 281 | |
19 | ai-devices | AI Device Template Featuring Whisper, TTS, Groq, Llama3, OpenAI and more | 262 | |
20 | ViP-LLaVA | [CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts | 242 | |
21 | meme_search | Index your memes by their content and text, making them easily retrievable for your meme warfare pleasures. Find funny fast. | 235 | |
22 | ScreenAgent | ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24) | 234 | |
23 | chatgpt-share-web | chatgpt和claude官网完整还原,包含其官网的全部功能。具有完善的用户体系和流量变现体系。 | 233 | |
24 | Awesome-Open-AI-Sora | Sora AI Awesome List – Your go-to resource hub for all things Sora AI, OpenAI's groundbreaking model for crafting realistic scenes from text. Explore a curated collection of articles, videos, podcasts, and news about Sora's capabilities, advancements, and more. | 208 | |
25 | OPERA | [CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation | 206 | |
26 | awesome-vlm-architectures | Famous Vision Language Models and Their Architectures | 200 | |
27 | colpali | The code used to train and run inference with the ColPali architecture. | 196 | |
28 | SoraFlows | The most powerful and modular Sora WebUI, api and backend with OpenAI's Sora Model. Collecting the highest quality prompts for Sora. using NextJs and Tailwind CSS | 189 | |
29 | RLHF-V | [CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback | 188 | |
30 | IAmDirector-Text2Video-NextJS-Client | 本项目开源基于NextJS的前端, 希望能够提供一个用于生成式AI的文字转视频, 尤其是电影从编剧到视频生成的Web前端平台参考。Everyone can become a director. The Nextjs front-end of an AI driven platform for automatic movie/video generation (form GPT script generation to text2video movie generation).这是一个免费试用AI视频创作平台,集成了基于GPT的视频剧本生成和视频生成功能。 我们的理想是让每个人都能成为导演,以最快的方式将日常中的任何创意转化为高质量的视频, 无论是电影、营销视频、还是自媒体视频。 | 175 | |
31 | Awesome-LLM-related-Papers-Comprehensive-Topics | Awesome LLM-related papers and repos on very comprehensive topics. | 170 | |
32 | EVE | EVE: Encoder-Free Vision-Language Models from BAAI | 168 | |
33 | mlx-vlm | MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX. | 166 | |
34 | Phi-3-Vision-MLX | Phi-3 for Mac: Locally-run Vision and Language Models for Apple Silicon | 160 | |
35 | PromptKD | [CVPR 2024] Official PyTorch Code for "PromptKD: Unsupervised Prompt Distillation for Vision-Language Models" | 157 | |
36 | Open-LLaVA-NeXT | An open-source implementation for training LLaVA-NeXT. | 148 | |
37 | seemore | From scratch implementation of a vision language model in pure PyTorch | 136 | |
38 | ollama-ai | A Ruby gem for interacting with Ollama's API that allows you to run open source AI LLMs (Large Language Models) locally. | 134 | |
39 | embodied-agents | Seamlessly integrate state-of-the-art transformer models into robotics stacks | 134 | |
40 | t2v_metrics | Evaluating text-to-image/video/3D models with VQAScore | 132 | |
41 | Mantis | Official code for Paper "Mantis: Multi-Image Instruction Tuning" | 127 | |
42 | ELM | [ECCV 2024] Embodied Understanding of Driving Scenarios | 119 | |
43 | image-textualization | Image Textualization: An Automatic Framework for Generating Rich and Detailed Image Descriptions | 117 | |
44 | Prompt-Highlighter | [CVPR 2024] Prompt Highlighter: Interactive Control for Multi-Modal LLMs | 110 | |
45 | LLaRA | LLaRA: Large Language and Robotics Assistant | 110 | |
46 | captcha-solver | basic google recaptcha solver using llava-v1.6-7b | 99 | |
47 | AUITestAgent | AUITestAgent is the first automatic, natural language-driven GUI testing tool for mobile apps, capable of fully automating the entire process of GUI interaction and function verification. | 91 | |
48 | VidProM | VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models | 91 | |
49 | InCTRL | Official implementation of CVPR'24 paper 'Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts'. | 86 | |
50 | KarmaVLM | 🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model. | 84 | |
51 | Awesome-VLGFM | A Survey on Vision-Language Geo-Foundation Models (VLGFMs) | 82 | |
52 | graphist | Official Repo of Graphist | 82 | |
53 | Spider2-V | Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? | 82 | |
54 | freegenius | FreeGenius AI, an advanced AI assistant that can talk and take multi-step actions. Supports numerous open-source LLMs via Llama.cpp or Ollama or Groq Cloud API, with optional integration with AutoGen agents, OpenAI API, Google Gemini Pro and unlimited plugins. | 81 | |
55 | merlin | [ECCV2024] Official code implementation of Merlin: Empowering Multimodal LLMs with Foresight Minds | 77 | |
56 | RobustVLM | [ICML 2024] Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models | 73 | |
57 | MoE-Mamba | Implementation of MoE Mamba from the paper: "MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts" in Pytorch and Zeta | 72 | |
58 | rai | RAI is a multi-vendor agent framework for robotics, utilizing Langchain and ROS 2 tools to perform complex actions, defined scenarios, free interface execution, log summaries, voice interaction and more. | 71 | |
59 | PsyDI | PsyDI: Towards a Personalized and Progressively In-depth Chatbot for Psychological Measurements. (e.g. MBTI Measurement Agent) | 68 | |
60 | VL-RLHF | A RLHF Infrastructure for Vision-Language Models | 67 | |
61 | MM-NIAH | This is the official implementation of the paper "Needle In A Multimodal Haystack" | 66 | |
62 | Uniaa | Unified Multi-modal IAA Baseline and Benchmark | 66 | |
63 | Know-Your-Neighbors | [CVPR 2024] 🏡Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning | 64 | |
64 | MMTrustEval | A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust) | 61 | |
65 | matryoshka-mm | Matryoshka Multimodal Models | 61 | |
66 | GPA-LM | This repo is a live list of papers on game playing and large multimodality model - "A Survey on Game Playing Agents and Large Models: Methods, Applications, and Challenges". | 57 | |
67 | VoCo-LLaMA | VoCo-LLaMA: This repo is the official implementation of "VoCo-LLaMA: Towards Vision Compression with Large Language Models". | 55 | |
68 | rscir | Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper: "Composed Image Retrieval for Remote Sensing" | 54 | |
69 | Ovis | A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings. | 52 | |
70 | M3DBench | M3DBench introduces a comprehensive 3D instruction-following dataset with support for interleaved multi-modal prompts. Furthermore, M3DBench provides a new benchmark to assess large models across 3D vision-centric tasks. | 52 | |
71 | Dream2Real | [ICRA 2024] Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models | 50 | |
72 | LLaVA-JP | LLaVA-JP is a Japanese VLM trained by LLaVA method | 47 | |
73 | captain | Give your computer an AI Brain | 47 | |
74 | DMN | CVPR2024: Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models | 45 | |
75 | STIC | Enhancing Large Vision Language Models with Self-Training on Image Comprehension. | 45 | |
76 | Awesome-Robotics-3D | A curative list of 3D Vision papers relating to Robotics domain in the era of large models i.e. LLMs/VLMs, inspired by awesome-computer-vision, including papers, codes, and related websites | 44 | |
77 | ollama-open-webui | Self-host a ChatGPT-style web interface for Ollama 🦙 | 42 | |
78 | Awesome-SD-Inference | 📖A small curated list of Awesome SD/DiT/ViT/Diffusion Inference with Distributed/Caching/Sampling: DistriFusion, PipeFusion, AsyncDiff, DeepCache, Block Caching etc. | 42 | |
79 | CVPR2024_MAVL | Multi-Aspect Vision Language Pretraining - CVPR2024 | 39 | |
80 | CharXiv | CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs | 39 | |
81 | MLM_Filter | Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters". | 39 | |
82 | LLaVA-CLI-with-multiple-images | LLaVA inference with multiple images at once for cross-image analysis. | 38 | |
83 | FreeVA | FreeVA: Offline MLLM as Training-Free Video Assistant | 38 | |
84 | VisualWebBench | Evaluation framework for paper "VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?" | 38 | |
85 | UndergraduateDissertation | Undergraduate Dissertation of Guilin University of Electronic Technology | 38 | |
86 | imagenet_d | [CVPR2024 Highlight] Official Code for "ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object" | 36 | |
87 | LLaVA-MORE | LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1 | 33 | |
88 | CARES | [arXiv'24 & ICMLW'24] CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models | 32 | |
89 | Elysium | [ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM | 32 | |
90 | SpeechLLM | This repository contains the training, inference, evaluation code for SpeechLLM models and details about the model releases on huggingface. | 31 | |
91 | LLM-Image-Classification | Image Classification Testing with LLMs | 30 | |
92 | WCA | [ICML 2024] "Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models" | 29 | |
93 | AAPL | AAPL: Adding Attributes to Prompt Learning for Vision-Language Models (CVPRw 2024) | 28 | |
94 | KDPL | [ECCV 2024] - Improving Zero-shot Generalization of Learned Prompts via Unsupervised Knowledge Distillation | 26 | |
95 | Chinese-LLaVA-Med | 中文医学多模态大模型 Large Chinese Language-and-Vision Assistant for BioMedicine | 26 | |
96 | ConBench | Official implementation of paper "Unveiling the Tapestry of Consistency in Large Vision-Language Models". | 26 | |
97 | VLM-Captioning-Tools | Python scripts to use for captioning images with VLMs | 24 | |
98 | VLGuard | [ICML 2024] Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models. | 24 | |
99 | LLaVA-UHD-Better | A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo | 24 | |
100 | Kling-AI-Webui | Kling AI, Make Imagination Alive. This is a revolutionary text-to-video model like Sora. Kling AI WebUI is the open source project to integrate Kling AI Video Generation Model. | 24 |