lechmazur

CEO, Advameg, Inc.

Advameg

Github Data

Followers 63

Following 0

Links

https://x.com/lechmazur https://www.advameg.com

AI Project

Public repos: 12Public gists: 0

pgg_bench

Public Goods Game (PGG) Benchmark: Contribute & Punish is a multi-agent benchmark that tests cooperative and self-interested strategies among Large Language Models (LLMs) in a resource-sharing economic scenario. Our experiment extends the classic PGG with a punishment phase, allowing players to penalize free-riders or retaliate against others.

star: 33fork: 2

created at: 2025-03-19

updated at: 2025-04-10

elimination_game

A multi-player tournament benchmark that tests LLMs in social reasoning, strategy, and deception. Players engage in public and private conversations, form alliances, and vote to eliminate each other

star: 277fork: 9

created at: 2025-02-22

updated at: 2025-06-10

step_game

Multi-Agent Step Race Benchmark: Assessing LLM Collaboration and Deception Under Pressure. A multi-player “step-race” that challenges LLMs to engage in public conversation before secretly picking a move (1, 3, or 5 steps). Whenever two or more players choose the same number, all colliding players fail to advance.

star: 49fork: 2

created at: 2025-01-21

updated at: 2025-05-06

writing

This benchmark tests how well LLMs incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short creative story

star: 239fork: 6

language: Batchfile

created at: 2025-01-05

updated at: 2025-06-13

divergent

LLM Divergent Thinking Creativity Benchmark. LLMs generate 25 unique words that start with a given letter with no connections to each other or to 50 initial random words.

star: 30fork: 1

created at: 2024-12-28

updated at: 2025-03-20

deception

Benchmark evaluating LLMs on their ability to create and resist disinformation. Includes comprehensive testing across major models (Claude, GPT-4, Gemini, Llama, etc.) with standardized evaluation metrics.

star: 24fork: 2

created at: 2024-10-22

updated at: 2025-03-20

nyt-connections

Benchmark that evaluates LLMs using 436 NYT Connections puzzles

star: 36fork: 3

language: Python

created at: 2024-10-15

updated at: 2025-03-03

confabulations

Hallucinations (Confabulations) Document-Based Benchmark for RAG. Includes human-verified questions and answers.

star: 171fork: 4

language: HTML

created at: 2024-10-10

updated at: 2025-06-11