Autonomous Agents

Autonomous Agents-research papers. Updated daily. See as well the Resources-section.

Research papers

Chronological order.

20th May 2025

ContextAgent: Context-Aware Proactive LLM Agents with Open-World Sensory Perceptions

ContextAgent: introduces a context-aware proactive LLM agent framework with Sensory Context Extraction (Extracts context from perceptions), Persona Context Extraction (Extracts context from historical data), Context-aware Reasoner (Integrates contexts, reasons, predicts), LLM (Core reasoning engine), Thought Traces (Generated reasoning steps), Proactive Predictions (Predicts need for service), External Tool Calling (Calls external tools), and Services (Provides assistance).
The framework leverages extensive sensory perceptions from wearables and persona contexts to understand user intentions and predict the need for proactive assistance.
ContextAgent utilizes tool-augmented LLM reasoning and introduces ContextAgentBench, a benchmark for evaluating such agents.

Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation

LAG (log-augmented generation): introduces a framework that directly reuses prior computation and reasoning from past logs at test time, utilizing a Log Store, Log Encoder, Log Retriever, and augmented Generator LM.
The framework represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for a subset of tokens.
This approach directly reuses prior reasoning and computations without additional steps for knowledge extraction, enhancing performance and efficiency.

Empowering LLMs in Task-Oriented Dialogues: A Domain-Independent Multi-Agent Framework and Fine-Tuning Strategy

DIMF (Domain-Independent Multi-Agent Framework): introduces a task-oriented dialogue system with Intent Classification Agent (extracts user intent), Slot Filling Agent (extracts dialogue slots), and Response Agent (generates system response), trained using SFT (initial model fine-tuning), DPO (preference-based training), and DDA (mitigates DPO degradation).
The framework separates complex tasks into domain-independent components to improve performance on lightweight large language models.
The proposed Data Distribution Adaptation method enhances DPO training stability and the framework demonstrates strong generalizability and zero-shot capabilities.

Safety Devolution in AI Agents

Core Evaluation Framework: introduces, "a framework to measure the impact of retrieval and alignment mechanisms on model bias and harmfulness", with Censored LLM, Uncensored LLM, Agents with Censored LLM, Generate Query, Search, ReRank, Crawl, Answer, WikiAgent, WebAgent, System-Level Safety Prompts, and Evaluator components, where "the framework systematically compares LLMs with and without retrieval augmentation and safety mitigations across various benchmarks".
The framework reveals that integrating external retrieval into safety-aligned LLMs leads to a phenomenon termed safety devolution, characterized by reduced refusal rates, increased bias, and degraded safety scores.
Controlled experiments within the framework indicate that this safety degradation is primarily caused by the mere presence of retrieved context, rather than retrieval depth or accuracy, highlighting a structural vulnerability in RAG systems.

DSMENTOR: ENHANCING DATA SCIENCE AGENTS WITH CURRICULUM LEARNING AND ONLINE KNOWLEDGE ACCUMULATION

DSMentor: introduces a framework with a Mentor agent (curriculum designer) that processes a Dataset (input tasks) to create a Curriculum-based dataset (ordered tasks), which is then used by a Student agent (code generator) interacting with a Long-term memory (accumulated knowledge) and an Environment (evaluates code) for problem-solving.
The framework operates in two stages: curriculum generation and problem-solving, leveraging curriculum learning and online knowledge accumulation.
The Mentor agent determines task difficulty to sequence problems from easy to hard, guiding the Student agent's learning progression.

MM-Agent: LLM as Agents for Real-world Mathematical Modeling Problem

MM-Agent: introduces an expert-inspired framework that decomposes mathematical modeling into four sequential phases: Problem Analysis, Mathematical Modeling, Computational Solving, and Solution Reporting.
The framework utilizes specialized agents like the Analyst Agent, Task Coordinator Agent, Modeling Actor, Modeling Critic, Modeling Programmer Agent, and Reporting Agent to handle distinct tasks within each phase.
Key components such as the Hierarchical Mathematical Modeling Library (HMML) and MLE-Solver support knowledge retrieval, model formulation, and computational execution for real-world problems.

s3: You Don't Need That Much Data to Train a Search Agent via RL

s3: introduces a modular, RL-based search framework with a Searcher LLM (RL-trained agent), Search Engine (retrieval source), frozen Generator LLM (frozen answer generator), and Gain Beyond RAG (reward signal), which trains a search-only agent using a novel reward signal to optimize retrieval for generation quality.
The framework decouples the searcher from the generator, allowing the searcher to be trained with reinforcement learning based on the improvement in generator accuracy using retrieved documents compared to naive retrieval.
By focusing training solely on the searcher using a generation-aware reward, s3 achieves strong performance with significantly less training data and is compatible with black-box generator LLMs.

Building a Stable Planner: An Extended Finite State Machine Based Planning Module for Mobile GUI Agent

SPlanner: introduces a framework for mobile GUI agents that includes Application Modeling via EFSM (models applications), Structured Knowledge Base (collection of EFSMs), Plan Generation (creates execution plan), Instruction Parsing (parses user instruction), EFSM Solving (finds execution path), Path Polishing (refines execution path), Task Execution with VLM (executes the plan), Vision-Language Model (VLM) (executes GUI actions), LLM (parses/polishes text), BFS-based Solver (finds path in EFSM), User Instruction (input command), GUI Screenshot (current screen state), Action History (previous actions), Task Plan (step-by-step guide), and Operation Instruction (GUI action).
The framework models mobile applications using Extended Finite State Machines (EFSMs) to create a structured knowledge base for planning.
SPlanner generates interpretable and reliable execution plans by parsing user instructions, solving EFSMs, and polishing the resulting paths using LLMs, which are then executed by a VLM.

BAR: A Backward Reasoning based Agent for Complex Minecraft Tasks

BAR (Backward Reasoning based Agent): introduces an agent for complex Minecraft tasks with recursive goal decomposition (Recursive goal decomposition), state consistency maintaining (State conflict resolution), and stage memory (Memory from environment interaction) modules.
The agent utilizes backward reasoning to plan from the terminal state, aiming to overcome the perception gap faced by forward reasoning in complex tasks.
State consistency is ensured by integrating forward and backward reasoning, and planning efficiency is enhanced by leveraging successful past interactions stored in stage memory.

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

ReasonRAG: introduces, "Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning", with all LLM (Core reasoning model), Retriever (External knowledge access), Reasoning Stage (Decide query or answer), Grounding Stage (Extract evidence), Terminal Stage (Final answer state), Query Generation (Formulate search query), Evidence Extraction (Identify relevant text), Answer Generation (Produce final response), Memory (Stores previous steps)-components, where ReasonRAG is a process-supervised agentic RAG method using fine-grained rewards for policy optimization.
The framework employs Monte Carlo Tree Search and Shortest Path Reward Estimation to generate a high-quality process-level dataset, RAG-ProGuide, for training.
ReasonRAG enables LLMs to autonomously manage dynamic retrieval, iterative context refinement, and adaptive workflows for complex search queries.

Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning

SPLIT-RAG (Semantic Partitioning of Linked Information for Type-Specialized Multi-Agent RAG): introduces a multi-agent RAG framework with Knowledge Base Preprocessing (Prepare data), QA Input Processing (Analyze query), Retrieval Plan Decision (Determine subgraphs/agents), Multi-Agent RAG (Distributed retrieval), Answer Generation (Combine, resolve, finalize), Lightweight LLM Agents (Query subgraphs), and Head Agent (Final answer generation), which partitions knowledge graphs based on question types and uses multiple agents for efficient, conflict-resistant retrieval and answer generation.
The framework employs question-driven graph partitioning to create semantically coherent subgraphs, enabling lightweight agents to query only relevant partitions in parallel.
A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications, with a head agent synthesizing the final response.

MLZero: A Multi-Agent System for End-to-end Machine Learning Automation

MLZero: introduces a multi-agent system for end-to-end machine learning automation, featuring Perception, Semantic Memory, Episodic Memory, and Iterative Coding modules, coordinated by specialized agents including File Grouping and File Perception, Task Perception, ML Library Selection, Condensation, Summarization, Retrieval, Error Analyzer, Coder, and Executer agents.
The system processes raw multimodal data through perception, leverages dual memory modules for knowledge and history, and employs iterative coding with agents for code generation, execution, and debugging.
MLZero achieves end-to-end ML automation with minimal human intervention by transforming raw data into ready-to-use models and predictions through this integrated multi-agent architecture.

DRUGPILOT: LLM-BASED PARAMETERIZED REASONING AGENT FOR DRUG DISCOVERY

DrugPilot (LLM-based parameterized reasoning agent): introduces an agent system for automating multi-stage drug discovery workflows, comprising LLM Backbones (Core language model), Parameterized Memory Pool (PMP) (Structured key-value data storage), AI Model Zoo (Drug discovery tools/models), and Fe-Fo Mechanism (Error feedback and focus).
The Parameterized Memory Pool (PMP) is a core component designed to handle large-scale, multi-modal drug data by converting it into standardized parametric representations for efficient retrieval and interaction.
The Fe-Fo Mechanism enhances the agent's robustness by providing specific error feedback and maintaining focus during complex multi-turn tasks and tool interactions.

CLEVER: A Curated Benchmark for Formally Verified Code Generation

CLEVER (Curated Lean Verified Code Generation Benchmark): introduces a benchmark for formally verified code generation, requiring models to perform specification generation, isomorphism proving, Lean implementation generation, and correctness proving to achieve end-to-end verification.
The benchmark evaluates models in two stages: specification certification (generating and proving equivalence of a Lean specification) and implementation certification (generating and proving correctness of a Lean implementation).
Success in CLEVER requires both the generated specification and implementation to be formally certified via Lean proofs, ensuring semantic correctness beyond test cases.

14th May 2025

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: introduces an evolutionary coding agent that orchestrates an autonomous pipeline including a User defining the task, Task Specification, an Initial Program, an Evaluation Function, a Prompt Sampler, an LLMs Ensemble generating Code Modifications, an Evaluators Pool executing and scoring programs, a Program Database storing results and guiding evolution, and a Distributed Controller Loop orchestrating the process to find the Best Program.
The system iteratively improves algorithms by making direct code changes using an evolutionary approach, continuously receiving feedback from evaluators.
AlphaEvolve leverages state-of-the-art LLMs and automated evaluation to discover novel algorithms and optimize computational infrastructure.

13th May 2025

Enhancing Software Development with Context-Aware Conversational Agents: A User Study on Developer Interactions with Chatbots

Rasa-based Chatbot Prototype: introduces a study using a prototype built on the Rasa chatbot platform, including NLU, Dialogue Management, Facebook Messenger, and RASA webhook components, to investigate software developers' preferences and requirements for conversational agents.
The study employed a mixed-methods approach with 29 developers interacting with the prototype via Facebook Messenger based on a predefined scenario.
Findings from the interactions, questionnaires, and interviews aim to inform the design of context-aware chatbots for software development tasks like task and repository management.

TRAIL: Trace Reasoning and Agentic Issue Localization

TRAIL: introduces a formal taxonomy (Classifies agent errors) and a dataset of human-annotated traces from agentic workflows, including Manager Agent (Orchestrates tasks), Search Agent (Performs web search), and various Tools (External functions/APIs).
The paper evaluates the ability of large language models to act as judges for debugging complex agentic workflow traces using the proposed taxonomy and dataset.
Evaluation results show that current state-of-the-art models perform poorly at identifying and localizing errors within these traces, highlighting the challenge of evaluating complex agentic systems.

The Truth Becomes Clearer Through Debate! Multi-Agent Systems with Large Language Models Unmask Fake News

TED (TruEDebate): introduces a multi-agent system for fake news detection, simulating a structured debate process with DebateFlow Agents and analyzing the outcome with InsightFlow Agents.
The DebateFlow Agents organize LLM-powered agents into Proponents and Opponents teams that engage in Opening Statement, Cross-examination and Rebuttal, and Closing Statement stages.
The InsightFlow Agents, consisting of a Synthesis Agent for summarization and an Analysis Agent utilizing a Role-aware Encoder, Debate Graph, and News-Debate Interactive Attention, predict the news truth value.

Strategy-Augmented Planning for Large Language Models via Opponent Exploitation

SAP (Strategy-Augmented Planning framework): introduces a two-stage framework with LLM (Identifies opponent strategy), LLM (Generates action plan), Strategy Space Ξ (Explicit strategy dimensions), Strategy Set Dξ (Generated strategy library), SEN U (Strategy Evaluation Network), Trajectory Extractor E (Summarizes environment trajectory), abstract trajectory Tabs (Summarized observation data), Strategy Search (Finds optimal counter strategy), best response strategy ξ¹,* (Optimal counter strategy), Expert Tips H (Guides LLM planning), and Environment (Simulation environment), designed to enhance LLM-based agents' opponent exploitation in competitive environments.
The offline stage of SAP involves LLM generating strategies within the Strategy Space, evaluating them in the Environment to create a Strategy Set and Battle Result Dataset, which trains the SEN.
In the online stage, SAP uses the Trajectory Extractor to summarize observations, the LLM as Recognizer to identify the opponent's strategy, the SEN and Strategy Search to find the best response, and the LLM as Planner, guided by Expert Tips, to generate the final action Plan.

Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models

MRLMN (Multi-agent Reinforcement learning with Large language model in Multi-hop Networking): introduces a framework integrating MARL and LLMs for scalable UAV multi-hop networking.
The framework includes MARL agents with policy/critic networks, enhanced by information aggregation, agent grouping, reward decomposition, and behavioral constraints.
It leverages an LLM agent, knowledge distillation, bipartite matching, an LLM verifier, and prompt engineering to guide MARL training and improve exploration.

Benchmarking AI scientists in omics data-driven biological research

BaisBench (Biological AI Scientist Benchmark): introduces a benchmark for evaluating AI scientists in biological research with two tasks, BAIS-CTA (Cell type annotation task) and BAIS-SD (Scientific discovery task).
BAIS-CTA assesses cell type identification on single-cell datasets, while BAIS-SD evaluates reasoning and insight generation through multiple-choice questions based on data analysis.
The benchmark uses real biological omics data and compares AI performance to human experts, highlighting current limitations in data-driven scientific discovery.

Aitomia: Your Intelligent Assistant for AI-Driven Atomistic and Quantum Chemical Simulations

Aitomia: introduces an intelligent assistant platform for AI-driven atomistic and quantum chemical simulations, with Chatbot (user interface), AI Agents (task execution), LLMs (fine-tuned models), Rule-based Agents (fail-safe logic), Retrieval-Augmented Generation (RAG) system (knowledge retrieval), MLatom ecosystem (computational backend), Cloud Computing Services (simulation execution), and Database (information storage).
The platform leverages fine-tuned large language models, rule-based agents, and a retrieval-augmented generation system to assist users in setting up, running, and analyzing simulations.
Aitomia integrates with the MLatom ecosystem and cloud computing services like Aitomistic Hub and XACS to provide a wide range of computational chemistry capabilities.

DSADF: Thinking Fast and Slow for Decision Making

DSADF (Dual-System Adaptive Decision Framework): introduces a framework integrating System 1 (Fast thinking component) with RL Agent (Goal-conditional action selection) and Memory Space (Stores task proficiency), and System 2 (Slow thinking component) with VLM (Vision Language Model) acting as Planner (Decomposes tasks, reflects) and Auxiliary Performer (Handles unfamiliar tasks), utilizing CLIP (Image to text), Image Encoder (Encodes image observation), Text Encoder (Encodes text observation/goal), and Self-reflection (Evaluates and refines plans) for generalized decision making.
The framework draws inspiration from Kahneman's dual-process theory, leveraging the RL agent for fast, intuitive responses and the VLM for slow, analytical reasoning and planning.
DSADF demonstrates improved efficiency and generalization in complex environments by dynamically allocating tasks between the fast and slow systems based on task familiarity and agent proficiency.

12th May 2025

Putting It All into Context: Simplifying Agents with LCLMs

State-in-Context Agent: introduces a simplified agent architecture using LCLMs (Processes large context) to process the entire Environment (Code repository state) as Context (Input to LM), eliminating complex scaffolding to produce a Solution (Output patch).
The approach leverages LCLMs' long-context capabilities for full observability, transforming open-ended tasks into direct, close-ended problems.
Variations include a Compressor (Ranks/selects files) for large environments and a SELECTSOLVE method combining LCLMs via a Selector (LCLM identifies files) with SCLMs (Superior problem-solving) for repair.

Agent RL Scaling Law: Spontaneous Code Execution for Mathematical Problem Solving

ZeroTIR: introduces a framework for training a base LLM agent to spontaneously use a code execution environment for mathematical problem solving via reinforcement learning, including an RL trainer, value network, replay buffer, interaction logic, and reward signal.
The framework utilizes outcome-based rewards and techniques like dynamic stop tokens and replay buffer filtering to enable the LLM agent to learn effective tool use strategies.
ZeroTIR demonstrates an Agent RL Scaling Law where training progression correlates with increased code usage frequency, response length, and task accuracy, outperforming non-tool baselines.

Codifying Character Logic in Role-Playing

Codified Profiles: introduces a framework that represents character logic as structured, executable functions, including Codified Profile (Executable logic), parse_by_scene function (Outputs triggered statements), check_condition function (Evaluates scene conditions), Role-playing LLM (Generates character response), Scene (Input context), Triggered Statements (Guide LLM response), Groundtruth Reference (Evaluation target), NLI Scoring (Compares response to reference), Profile Update (Revises codified logic), Randomness Components (Control behavioral variability), Textual Profile (Original character description), and Distilled Condition Checker (Efficient condition evaluation), enabling persistent, updatable, and controllable role-playing.
The approach compiles natural language character descriptions into executable code, offloading complex reasoning from the LLM to deterministic control logic.
Experiments demonstrate improved behavioral consistency, adaptability, and diversity compared to prompt-based methods, particularly benefiting smaller language models.

ARE LLMS COMPLICATED ETHICAL DILEMMA ANALYZERS?

Evaluation Framework: introduces, a novel evaluation framework, with Data Retrieval (Collects ethical dilemmas), Preprocessing (Structures data), Text Processing (Generates/formats responses), LLMs (Generate dilemma responses), Human (Provide baseline responses), Structured Output (Formatted responses), Human Evaluation (Collects human feedback), Benchmark Metrics (Quantitative evaluation methods), Metrics Weighting (Assigns metric importance), and Aggregated Score (Final performance score), where the framework assesses LLM performance on ethical dilemmas using structured responses and quantitative metrics.
The framework utilizes a dataset of ethical dilemmas with expert and non-expert responses, processed into a five-section structured format for component-wise evaluation.
Performance is measured using a composite metric combining lexical, n-gram, embedding, and semantic similarity scores, weighted based on inversion analysis and AHP.

HYPERNYM MERCURY: Token Optimization through Semantic Field Constriction and Reconstruction from Hypernyms. A New Text Compression Method

Hypernym Mercury: introduces a novel text compression method using Field Constriction, which involves Parsing and Structuring input text into a Dart intermediate representation, performing Detail Importance Evaluation, and applying Compression Optimization to the dart.
The Dart structure splits information into a core statement and attached details, allowing for controllable granularity during Recomposition back into text.
Multi-Model Verification ensures semantic fidelity of the compressed output by checking against independent models.

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Graph-informed adversarial multi-agent interaction framework: introduces a system to generate diverse, challenging over-refusal queries using interacting agents and LLM validation, including Generator, Discriminator, LLM Refusal Validation, and Orchestrator components.
The framework is guided by an Entity Graph extracted from safety-related datasets and uses Feedback between agents to refine generated prompts.
This iterative process produces Collected Over-refusal Queries that appear unsafe but are objectively benign, simulating scenarios where LLMs might over-refuse.

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

Web-Bench (Evaluation System): introduces a benchmark and evaluation system for LLM code generation on web development tasks, including an Evaluator, Web-Agent, LLM, Web-Bench Dataset, Tasks, Projects, E2E Tests, and Generated Files.
The system evaluates LLMs on sequential coding tasks within projects, simulating real-world web development workflows based on Web Standards and Frameworks.
The Evaluator orchestrates the process, using the Web-Agent to interact with the LLM, and verifies the generated code against E2E tests.

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

MLE-Dojo: introduces, an interactive Gym-style framework for training, evaluating, and benchmarking autonomous LLM agents in machine learning engineering workflows, with MLE-Agent (LLM-based assistant), Environment (Task-specific interactive space), Error (Encodes error types), Interface (Governs action execution), Feedback (Translates outcomes to guide), Metric (Defines evaluation metrics), Task Space (Collection of tasks), Docker container (Isolates task execution), Sandbox (Executes agent code safely), Observation Space (Environment state information), Dataset Information (Task data details), Evaluation Metric Scores (Performance metrics), Code Execution Results (Outcome of code runs), Error Messages (Debugging information), Interaction History (Record of interactions), Action Space (Agent's possible operations), request_info (Action to query task info), validate_code (Action for syntax/runtime check), execute_code (Action for full execution/submission), get_history (Action to retrieve past interactions), reset (Action to restart environment), Reward Space (Signal for performance), HumanRank Score (Relative performance metric), Agent Scaffolds (Agent implementations), MLE Agent (Minimalistic agent design), and AIDE (Iterative problem-solving agent), enabling systematic experimentation and rigorous evaluation on 200+ real-world Kaggle challenges.
The framework provides a fully executable environment supporting comprehensive agent training via supervised fine-tuning and reinforcement learning, facilitating iterative experimentation and real-time outcome verification through structured feedback loops.
MLE-Dojo features a modular and extensible architecture that decouples agent capabilities from the environment, promoting interoperability, scalability, and reproducibility, and is open-sourced to foster community-driven innovation.

KAQG: A Knowledge-Graph-Enhanced RAG for Difficulty-Controlled Question Generation

KAQG (Knowledge Augmented Question Generation): introduces a framework that fuses knowledge graphs, RAG retrieval, and educational assessment theory into a pipeline for difficulty-controlled question generation.
The framework includes a KAQG-Retriever for building a Knowledge Graph from educational materials and a KAQG-Generator for creating and evaluating questions based on the graph and assessment theory.
Implemented using an AI Agents Framework, the system operationalizes difficulty metrics and demonstrates strong performance in generating psychometrically sound exam items.

Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

IKEA: introduces Reinforced Internal-External Knowledge Synergistic REasoning Agent, with LLM Agent, Environment (Search Engine/Retriever/Text Corpus), Reward Model, Knowledge-boundary aware reward function, Knowledge-boundary aware training dataset, Reinforcement Learning (GRPO), and Special Tags components, which trains an efficient adaptive search agent to synergistically integrate internal and external knowledge.
The agent learns to identify its knowledge boundary, prioritizing internal knowledge and resorting to external search only when necessary, guided by a novel reward function and training data.
This approach aims to reduce redundant retrievals, mitigate knowledge conflicts, and improve inference efficiency compared to methods relying solely on internal or external knowledge.

YuLan-OneSim: Towards the Next Generation of Social Simulator with Large Language Models

YuLan-OneSim: introduces a novel social simulator, with Scenario Auto-Construction Subsystem (User input to code), Simulation Subsystem (Execute and manage simulation), Feedback-driven Evolving Subsystem (Improve LLMs via feedback), and AI Social Researcher Subsystem (Automate social science research), designed for code-free scenario construction, large-scale simulation, evolvability, and automating the social science research loop.
The simulator is built upon four core subsystems that handle scenario creation, simulation execution, model refinement, and autonomous research tasks.
YuLan-OneSim aims to advance LLM-based social simulation by enabling automatic scenario construction, autonomous evolution, and completing the full research cycle.

Learning to Reason and Navigate: Parameter Efficient Action Planning with Large Language Models

PEAP-LLM (Parameter Efficient Action Planner using Large Language Models): introduces a novel parameter-efficient action planner for embodied agents, consisting of an LLM goal planner (LGP) that extracts task goals and a LoRA action planner (LAP) that generates single-step instructions using a fine-tuned LLM.
The framework utilizes a Base LLM for goal planning and a Fine-tuned LLM for action planning, with fine-tuning performed via supervised fine-tuning (SFT) and direct preference optimization (DPO) using specific datasets.
PEAP-LLM integrates with a Policy Model that predicts the next action based on high-level instructions, generated single-step instructions, and visual observations processed by Object Retrieval and State Text Generator components.

Can Generative AI agents behave like humans? Evidence from laboratory market experiments

LLM Agent Simulation: introduces, "explore the potential of Large Language Models (LLMs) to replicate human behavior in economic market experiments", with LLM Agents (Simulate human participants), OpenAI API (Interface for LLMs), Model (GPT-3.5 or GPT-4), Temperature (Controls response randomness), Context Window (Total text model considers), Memory (Number of previous messages), Seed (Initializes random generator), Market Environment (Simulated economic market), Feedback Mechanism (Positive or negative price feedback), where "the framework simulates market dynamics by having LLM agents predict prices iteratively based on market information and their own history".
The simulation compares LLM agent behavior to human participants in positive and negative feedback markets, analyzing market dynamics and forecasting strategies.
Key parameters like memory and temperature significantly influence LLM agent behavior and their ability to replicate human-like market dynamics and bounded rationality.

Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption

Private LoRA Fine-tuning: introduces an interactive client-server protocol for private fine-tuning of open-source LLMs, with Client (orchestrates training, non-linear operations), Server (linear operations under HE), Homomorphic Encryption (HE) (enables encrypted computation), LoRA Weights (U, D) (client-side adaptation parameters), and Base Model Weights (W) (server-side public parameters) components.
The client manages private data and LoRA weights while performing non-linear computations locally.
The server handles computationally intensive linear operations on public base model weights using homomorphic encryption on client-provided encrypted activations.

Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Multi-Agent Reasoning System: introduces a system with Agents (Individual LLMs), Expertise Specialization (Domain-specific roles), Collaboration Paradigm (Interaction mechanism), Communication Protocol (Information exchange), and System Scale (Number of agents) to investigate collaborative reasoning performance.
The study empirically evaluates how expertise-domain alignment, collaboration paradigm (structured workflow vs. diversity-driven), and system scale affect collective reasoning.
Findings indicate that expertise alignment is domain-contingent, diversity-driven collaboration outperforms structured workflows, and increasing agents generally boosts performance with diminishing returns.

UAV-CodeAgents: Scalable UAV Mission Planning via Multi-Agent ReAct and Vision-Language Reasoning

UAV-CodeAgents: introduces a scalable multi-agent framework for autonomous UAV mission generation, utilizing Airspace Management Agent (AMA), UAV Agent, LLM, VLM, ReAct, Pixel-Pointing Grounding Mechanism, smolagents framework, Message-passing interface, and Tools to interpret instructions and generate UAV trajectories.
The system leverages the ReAct paradigm for iterative reasoning and dynamic adaptation, enabling agents to reflect on observations and revise mission goals in evolving environments.
A key component is the vision-grounded pixel-pointing mechanism, which facilitates precise localization of semantic targets on aerial maps for spatial grounding and context-aware flight routes.

DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

DynamicRAG: introduces a novel RAG framework, with Retriever (Retrieves documents), Dynamic Reranker (Dynamically adjusts documents), Generator (Generates final answer), Reward Function (Evaluates response quality), Reinforcement Learning (Optimizes reranker agent), Direct Preference Optimization (RL optimization method), and Behavioral Cloning (Initial reranker training), where the reranker dynamically adjusts document order and number using LLM feedback and RL.
The reranker is modeled as an RL agent trained via behavioral cloning and DPO, leveraging LLM output quality as reward signals.
This dynamic reranking approach enhances RAG system efficiency and effectiveness by optimizing the generator's input based on query context.

Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMS

SENATOR (Structural Entropy-guided Knowledge Navigator): introduces a framework for detecting and repairing LLM knowledge deficiencies, with MCTS (Knowledge graph exploration), LLM (Model under evaluation), KG (External knowledge source), Structural Entropy (Exploration reward signal), Synthetic Data (Generated training samples), and SFT (Model fine-tuning) components.
The framework employs MCTS guided by structural entropy on a knowledge graph to efficiently explore and identify areas where the LLM exhibits high uncertainty or knowledge deficiencies.
Based on the identified high-uncertainty paths, SENATOR generates targeted synthetic data used for supervised fine-tuning to repair the LLM's knowledge deficiencies.

11th May 2025

Exploring Anthropomorphism in Conversational Agents for Environmental Sustainability

Washy: introduces a system integrating a Conversational Agent (User interface) powered by an LLM (Language model) using a Function Calling API (LLM tool interface) to interact with External API (Solar data source), a Smart Plug (Appliance controller), a Scheduler (Slot/notification management), and a Database (Data storage), supported by a Backend (Server logic), Client (User applications), and Notification System (Alert delivery), to help users schedule Washing Machine (Physical appliance) cycles based on solar energy availability.
The system compares a Personified Agent and a Traditional Assistant interface to evaluate the impact of anthropomorphism on user interaction and eco-friendly behavior adoption.
A lab study assessed the system's effectiveness in promoting sustainable home energy management and the influence of agent personality on user engagement and rapport.

Architectural Precedents for General Agents using Large Language Models

Cognitive Design Patterns: introduces recurring patterns of processes, representations, and memories found in cognitive architectures and Agentic LLM Systems, including Observe-decide-act, 3-stage memory commitment, Hierarchical decomposition, Short-term (context) memory, Ahistorical KR/memory, Historical KR/memory, Procedural KR/memory, Reconsideration, Knowledge compilation, and Step-wise reflection.
The paper analyzes how these cognitive design patterns are evident in existing Agentic LLM systems and identifies patterns apt for future exploration.
Examining these patterns helps predict gaps and deficiencies in current LLM systems and suggests future research directions towards general intelligence.

Can LLM-based Financial Investing Strategies Outperform the Market in Long Run?

FINSABER (Financial Investing Strategy Assessment with Bias mitigation, Expanded time, and Range of symbols): introduces a comprehensive framework for benchmarking LLM timing-based investing strategies, with Multi-source Data Module (Integrates diverse financial data), Strategies Base Module (Covers selection and timing strategies), Bias-Mitigated Backtest Pipeline (Supports robust backtesting), Selection-based Strategy (Identifies asset subset), Timing-based Strategy (Dictates buy/sell/hold decisions), Traditional Rule-based (Uses technical indicators/rules), Predictor-based (Relies on data-driven models), RL-based (Learns optimal investing policies), LLM-based (Leverages large language models), Rolling Window Test (Evaluates across multiple periods), and Evaluation Metrics (Measures strategy performance).
The framework integrates 20 years of multi-source data, expands symbol coverage, and explicitly mitigates survivorship, look-ahead, and data-snooping biases.
FINSABER supports robust and reproducible benchmarking across diverse experimental setups to provide empirical guidance for LLM-based investment research.

DialogueReason: Rule-Based RL Sparks Dialogue Reasoning in LLMS

DialogueReason: introduces a dialogue-based reasoning pattern, with System Prompt (input instruction), Adaptive Thinking Pattern Config (configuration), Thinking Process (iterative simulation), and Final Answer (output).
The Adaptive Thinking Pattern Config includes Agent Config (agent roles), Environment Config (setting), and Interaction Config (communication rules).
The Thinking Process involves iterative Agent-Agent Interaction (dialogue) and Agent-Environment Interaction (task progression).

Seed1.5-VL Technical Report

Seed1.5-VL: introduces a vision-language foundation model composed of Seed-ViT (Vision encoder (encode images/videos)), MLP Adapter (Project visual features), and Large Language Model (LLM) (Process multimodal inputs (MoE)).
The Seed-ViT vision encoder handles dynamic image resolutions, while the LLM is a 20B active parameter Mixture-of-Experts model.
The model is designed for general-purpose multimodal understanding and reasoning across diverse tasks.

The Wisdom of Agent Crowds: A Human-AI Interaction Innovation Ignition Framework

Brainwrite: introduces a human-AI interaction framework for multi-agent brainstorming, incorporating Human (User), LLM (Large Language Model), Cothinker (Interactive module), Internet (External information source), Knowledge Base (Internal information source), and Mindmap (Structured text summary) components.
The framework utilizes LLMs and a Cothinker module to assist human users in topic definition, deep exploration, and output generation, drawing information from the Internet and a Knowledge Base.
The system aims to reduce user cognitive load through structured text summaries like Mindmaps and enhance viewpoint diversity in complex financial analysis tasks.

EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation

EcoLANG: introduces a two-stage paradigm, with Language Evolution (create efficient language) and Language Utilization (agents use evolved language), to induce efficient and effective agent communication language for large-scale social simulations.
The Language Evolution stage comprises Vocab Compression (reduce vocabulary size) via Semantic Clustering (group words by meaning), Intra-Cluster Selection (filter words within groups), and Tokenization (map words to tokens), alongside Rule Evolution (evolve communication rules) through Initialization (start with initial rules), Communication (simulate agent dialogues), Selection (evaluate and select rules), Crossover & Mutation (generate new rules), and Update and Iteration (refine rule population).
The Language Utilization stage applies the evolved language by modifying LLM decoding and incorporating rules into prompts, enabling agents to communicate more efficiently in social simulations.

Towards Human-Centric Autonomous Driving: A Fast-Slow Architecture Integrating Large Language Model Guidance with Reinforcement Learning

Fast-Slow Architecture: introduces a human-centric decision-making framework integrating an LLM-Based Slow System and an RL-Based Fast System, designed to interpret high-level user instructions and execute real-time control.
The LLM-Based Slow System processes user commands and scene context using components like Human-Language Parsing and CoT Analytic Reasoning, referencing a Memory Bank to generate structured Human-Centric Instruction.
The RL-Based Fast System, utilizing Instruction and Scenario Encoders and a Multi-Head Attention-based Actor-Critic network, executes actions validated by a Safety Mask via a PID Controller, balancing user preference and safety.

ThreatLens: LLM-guided Threat Modeling and Test Plan Generation for Hardware Security Verification

ThreatLens: introduces, with Threat Identification Agent (identifies physical/supply threats), Security Policy Generator Agent (extracts security policies), Test Plan Generator Agent (generates test plans), LLM (performs reasoning/generation), RAG (retrieves relevant knowledge), System-User Conversation (interacts with engineer), Security Knowledge Dataset (stores threat models), and Design Spec. & ISA document (input design information), a multi-agent framework automating hardware security threat modeling and test plan generation.
The framework leverages LLMs for reasoning and generation, RAG for efficient knowledge retrieval from datasets and documents, and interactive conversation with verification engineers.
ThreatLens aims to reduce manual effort, enhance coverage, and provide a structured approach for hardware security verification by automating threat identification and test plan formulation.

Control Plane as a Tool: A Scalable Design Pattern for Agentic AI Systems

Control Plane as a Tool: introduces a design pattern for Agentic AI systems that modularizes tool orchestration using a Request Router, Registration Module, Invocation Module, Input Validator, Intent Resolver, Intent Validator, Routing Handler, Feedback Integrator, Output Validator, Failure Handler, Usage tracker, Agent Registry, Tool Registry, validation rules, Metrics Registry, and Log/DB.
This pattern decouples tool management from agent logic, enabling dynamic tool selection, governance, and extensibility across multiple agents.
The architecture provides a single tool interface to agents while encapsulating complex routing and validation logic internally.

10th May 2025

VTutor: An Animated Pedagogical Agent SDK that Provide Real Time Multi-Model Feedback

VTutor: introduces an animated pedagogical agent SDK, with Large Language Model Integration (AI-generated text responses), Text-to-Speech (text to audio conversion), LipSync Module (audio to avatar mouth movements), and WebGL-Based Rendering (Unity environment web embedding), designed for real-time multi-model feedback in education.
The SDK leverages lightweight WebGL, Unity, and JavaScript frameworks to convert LLM text outputs into audio and then render a real-time, lip-synced pedagogical agent.
VTutor provides on-demand, personalized feedback using anime-like aesthetics to avoid the uncanny valley effect and enhance engagement.

9th May 2025

Reliable Collaborative Conversational Agent System based on LLMs and Answer Set Programming

AutoManager: introduces a dual-agent system with Administrator Bot (Manages knowledge base) and Assistant Bot (Interacts with customers) that share a Knowledge Base (Shared facts and menu), Temporary Information (Shared session data), and Collaborative Rule Set (Shared collaboration rules), utilizing Knowledge Extraction (Natural language to predicates), Commonsense Reasoning (Predicate reasoning with ASP), and Response Generation (Predicates to natural language) for reliable collaborative task-oriented dialogue.
The system leverages Large Language Models for natural language processing and Answer Set Programming for robust logical reasoning and consistency checking within each agent.
This architecture enables reliable collaboration between agents by sharing knowledge and rules, demonstrated in a fast-food restaurant management scenario.

SCALEMCP: DYNAMIC AND AUTO-SYNCHRONIZING MODEL CONTEXT PROTOCOL TOOLS FOR LLM AGENTS

ScaleMCP: introduces a novel tool selection approach, with Agent, MCP Retrieval Tool, Automatic Indexing Pipeline, MCP Storage Index, and MCP Servers, enabling LLM agents to dynamically discover and equip Model Context Protocol (MCP) servers as tools.
The framework features an auto-synchronizing tool storage system pipeline that uses CRUD operations with MCP servers as the single source of truth to maintain the MCP storage index.
LLM agents are equipped with an MCP retrieval tool, allowing them to autonomously query the storage index and invoke relevant MCP servers during multi-turn interactions.

A New DAPO Algorithm for Stock Trading

Improved DAPO Algorithm: introduces a novel trading agent integrating GRPO, Decoupled Clipping, Dynamic Sampling, and Sentiment-Risk Adjusted Rewards for financial trading.
The approach adapts DAPO principles to a GRPO framework, incorporating LLM-based risk and sentiment signals into an adjustable reward function.
This method demonstrates improved performance and significantly reduced computational requirements compared to a CPPO-DeepSeek baseline on the NASDAQ-100 index.

LATENT: LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies

LATENT (LLM-Augmented Trojan Insertion and Evaluation Framework for Analog Netlist Topologies): introduces a framework for generating stealthy analog Trojans, utilizing an LLM Agent (autonomous agent) to modify a Circuit Netlist (analog circuit design), validated by a Syntax Checker (validates Trojan syntax), simulated by HSPICE (circuit simulator), evaluated by SPICED (LLM-based detection tool), and refined via Feedback-driven learning (iterative strategy refinement).
The framework employs a Thought-Action-Observation loop where the LLM agent iteratively selects and inserts Trojan components based on detection feedback to evade detection.
By integrating simulation and detection tools into the iterative process, the framework generates diverse, circuit-specific analog Trojans with low activation ranges and significant performance degradation upon activation.

Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients

PPO-TS (PPO-based adversarial attack framework): introduces a novel threat vector that leverages a PPO Attack Agent to generate Adversarial Waveforms interfering with Client Sensors, resulting in Clustered Updates that induce Rowhammer bit-flips on Server DRAM.
The framework utilizes reinforcement learning to manipulate client sensor observations, maximizing server repetitive memory updates necessary for Rowhammer exploitation.
This approach enables remote Rowhammer attacks on federated learning servers without requiring direct access or system-level privileges.

Multi-Agent Systems for Robotic Autonomy with LLMS

Multi-Agent System: introduces a framework for robotic autonomy, with Task Analyst (analyzes task input), Robot Designer (designs robot configuration), RL Designer (generates RL components), Code & Report Extractor (extracts code/reports), RL Execution (runs RL training/evaluation), Figures (visualizes results), and Report (summarizes analysis/results).
The system takes task scenario descriptions as input and outputs multimodal results including code files, technical reports, and visualizations.
This framework enables autonomous robotic task analysis, mechanical design, and path generation using LLMs and reinforcement learning.

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

APOLLO: introduces a modular pipeline combining LLM, Lean Server, Syntax Refiner, Sorrifier, Auto Solver, Subproof Extractor, and Proof Assembler.
The pipeline directs LLM proof generation, analyzes and fixes errors using Lean feedback, isolates failing sub-lemmas, and applies automated solvers.
This iterative process repairs and recombines sub-proofs, improving proof generation efficiency and correctness with lower sampling budgets.

EVOLUTIONARY ECOLOGY OF WORDS

Word Evolutionary Ecology Model: introduces a model utilizing a Large Language Model for word creation, interaction judgment, and mutation within a spatial agent-based simulation, where individuals possessing words compete and evolve.
The model simulates the evolutionary ecology of words by having agents with words move in a grid, compete based on LLM-determined outcomes, and mutate their words using the LLM.
Competition outcomes between words are stored in a dictionary to improve computational efficiency for repeated interactions.

ELA-ZSON: Efficient Layout-Aware Zero-Shot Object Navigation Agent with Hierarchical Planning

ELA-ZSON (Efficient Layout-Aware Zero-Shot Object Navigation): introduces an efficient zero-shot object navigation approach with an LLM Agent (manages process) that leverages a hierarchical Scene Representation (hierarchical environment map) for Hierarchical Planning (two-level path generation) and Robotic Navigation (executes planned path).
The Scene Representation includes a global Topometric Map (global topological graph) and a local Learned Scene Representation (local dense memory), supporting both Global Topology Plan (coarse route planning) and Local Ego-centric Plan (dense waypoint generation).
The LLM Agent manages the overall workflow, integrating Perception (RGB-D input, pose) and Control Flow (manages actions, status, errors) for autonomous navigation in complex indoor environments without costly training.

AGENTXPLOIT: End-to-End Redteaming of Black-Box AI Agents

AGENTXPLOIT: introduces a generic black-box fuzzing framework for indirect prompt injection attacks, utilizing an Initial Corpus (High-quality templates), Seed Storage (Pool of seeds), Seed Selector (MCTS-based algorithm), a Mutator (Generates new variants), and a Scorer (Evaluates seeds) to iteratively refine adversarial prompts.
The framework systematically explores adversarial prompts by selecting promising seeds, mutating them, and scoring their effectiveness based on attack success rate and task coverage.
This adaptive and iterative process enables the framework to effectively discover and exploit indirect prompt injection vulnerabilities in black-box LLM agents across diverse architectures and tasks.

8th May 2025

Scalable Chain of Thoughts via Elastic Reasoning

Elastic Reasoning: introduces a framework for scalable chain of thoughts, explicitly separating reasoning into thinking and solution phases with independently allocated budgets.
The framework employs a budget-constrained rollout strategy during training to teach the model adaptive reasoning under truncated conditions.
At inference, separate budgeting prioritizes the completeness of the solution segment, improving reliability under strict resource constraints.

MULTI-AGENT EMBODIED AI: ADVANCES AND FUTURE DIRECTIONS

Multi-Agent Embodied AI Survey: introduces a comprehensive review of recent advances and future directions in embodied AI systems with multiple agents, covering control, learning, and generative model-based methods.
The survey analyzes key contributions and identifies challenges in multi-agent embodied AI, including asynchronous decision-making, agent heterogeneity, and open environments.
It reviews benchmarks and discusses future research directions to guide innovation in this rapidly evolving field.

Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models

LMRM (Large Multimodal Reasoning Model): introduces a structured roadmap for multimodal reasoning research, encompassing four stages: Stage 1 Perception-Driven Modular Reasoning, Stage 2 Language-Centric Short Reasoning, Stage 3 Language-Centric Long Reasoning, and Stage 4 Native LMRMs.
The survey analyzes the progression from early modular, perception-driven systems to unified, language-centric frameworks and projects towards native models with omnimodal perception and agentic behavior.
It provides a comprehensive review of over 540 publications, categorizes models and benchmarks, and discusses challenges and future prospects for next-generation multimodal reasoning systems.

Not Like Us, Hunty: Measuring Perceptions and Behavioral Effects of Minoritized Anthropomorphic Cues in LLMs

Simulated LLM Agents: introduces a study evaluating user reliance and perception of LLM agents using minoritized sociolects (AAE, Queer slang) compared to Standard American English, utilizing templated suggestions constructed with warmth phrases and confidence expressions, generated via in-context learning and persona-based prompting with GPT-4.
The study found that AAE speakers preferred and relied more on the SAE agent, while Queer slang speakers showed no significant preference but felt greater social presence with the Queer slang agent.
Findings highlight the nuanced dynamics of sociolect use in machine interactions, emphasizing the need for careful design to respect cultural boundaries and avoid appropriation.

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

CityNavAgent: introduces a large language model-empowered agent for aerial vision-and-language navigation, featuring an open-vocabulary perception module, a hierarchical semantic planning module, and a global memory module.
The agent extracts urban scene semantics, decomposes long-horizon tasks into hierarchical sub-goals, and stores historical trajectories in a topological graph to reduce navigation complexity.
This approach enables zero-shot navigation in continuous urban environments, addressing challenges of complex scene understanding and exponential planning complexity.

HiBayES: A Hierarchical Bayesian Modeling Framework For AI Evaluation Statistics

HiBayES (A Hierarchical Bayesian Modeling Framework for AI Evaluation Statistics): introduces a generalizable framework with Hierarchical Bayesian GLMs, Bayesian Data Analysis, MCMC Sampling, Uncertainty Quantification, Formal Model Comparison, and Quality Control components, designed for principled uncertainty quantification and robust parameter estimation in AI evaluations.
The framework addresses challenges in AI evaluation statistics, including stochastic outputs, complex hierarchical data structures, and high testing costs, particularly in low-data scenarios.
HiBayES enables robust inferences, explicit modeling of hierarchical data, and formal model comparison, offering advantages over conventional statistical methods like t-tests and flat models.

clem: todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations

clem: todd (chat-optimized LLMs for task-oriented dialogue systems development): introduces a framework for systematically evaluating LLM-based task-oriented dialogue systems, featuring a Game Master (coordinates interaction), User Simulator (simulates user), and Dialogue System (acts as agent), which can be implemented as Monolithic Dialogue System (single LLM agent), Modular-Prog Dialogue System (programmed flow agent), or Modular-LLM Dialogue System (LLM-controlled agent), utilizing components such as Dialogue Manager (manages dialogue flow), Intent Detection (identifies user intent), Slot Extraction (extracts entities), Response Generation (generates response), Database Retriever (queries database), and Booking Confirmer (confirms bookings).
The framework facilitates turn-based interactions between the user simulator and dialogue system, coordinated by the game master, and supports plug-and-play integration of different models and architectures.
Evaluation within the framework involves consistent datasets, metrics, and computational constraints, enabling detailed benchmarking and analysis of performance and efficiency trade-offs.

EcoAgent: An Efficient Edge-Cloud Collaborative Multi-Agent Framework for Mobile Automation

EcoAgent: introduces an edge-cloud collaborative multi-agent framework for mobile automation, featuring a Cloud-Based Planning Agent (Task decomposition, planning), Edge-Based Execution Agent (Action execution), Edge-Based Observation Agent (Monitor screen, verify outcomes), Memory Module (Stores screen history), Reflection Module (Supports replanning), and Pre-Understanding Module (Compresses screen images).
The framework coordinates cloud and edge agents in a closed loop, leveraging cloud-based MLLMs for planning and edge-based MSLMs for execution and observation.
The Pre-Understanding Module reduces communication overhead by compressing screen images, while Memory and Reflection modules enable replanning upon execution failure.

HEXGEN-TEXT2SQL: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL Workflows

HEXGEN-TEXT2SQL: introduces a novel framework for scheduling agentic Text-to-SQL workflows on heterogeneous GPU clusters, featuring a Global Coordinator (Dispatcher) that assigns requests, LLM Model Instances (with Local Priority Queue) that process and prioritize tasks, and a Simulator (Alpha-Tuning) that tunes the dispatcher parameter.
The framework employs a two-level hierarchical scheduling approach combining global workload-balanced dispatching and local adaptive urgency-guided prioritization to manage multi-stage dependencies and resource heterogeneity.
This design significantly improves SLO attainment and system throughput for LLM-based Text-to-SQL serving compared to baseline methods.

MARK: Memory Augmented Refinement of Knowledge

MARK (Memory-Augmented Refinement of Knowledge): introduces a scalable agentic memory design framework, with Conversational LLM Agent, LLM, System Prompt, Domain Knowledge Source, Chat History, Memory Builder Service (MBS), Memory Search Service (MSS), Memory Store, Residual Refined Memory Agent, User Question Refined Memory Agent, LLM Response Refined Memory Agent, Memory Relevance Scoring (MRS), and Memory, enabling LLMs to continuously learn and refine domain knowledge.
The framework utilizes specialized memory agents (Residual, User Question, LLM Response) to extract refined memories from conversations, stored in a Memory Store.
Memory Search Service retrieves and ranks relevant memories using Memory Relevance Scoring for injection into the LLM context, improving accuracy and adaptability.

From First Draft to Final Insight: A Multi-Agent Approach for Feedback Generation

G-E-RG (Generation, Evaluation, and Regeneration): introduces a multi-agent framework for feedback generation, including External Database (slides), Question, Student response, Agent 1 (Generation), Feedback in the first round, Agent 2 (Evaluation), Evaluation results, Agent 3 (Re-Generation), and Feedback in the second round, which generates initial feedback, evaluates it, and then regenerates improved feedback.
The framework utilizes three distinct GPT-4o agents for the sequential tasks of initial generation, evaluation based on a rubric, and final regeneration informed by evaluation results.
The iterative G-E-RG process significantly improves feedback quality across multiple dimensions compared to single-round generation methods.

Reasoning Models Don't Always Say What They Think

CoT Faithfulness Evaluation and RL Training Framework: introduces an evaluation of large language models, including Claude 3.5 Sonnet (New), Claude 3.7 Sonnet, DeepSeek V3, and DeepSeek R1, assessing the faithfulness of their Chain-of-Thought reasoning when responding to Input Prompts and Prompt Pairs, and studies the impact of Outcome-Based Reinforcement Learning in synthetic RL Environments with Reward Hacks defined by a Reward Function.
The evaluation measures how often models verbalize hints used in their reasoning process, finding that CoTs often lack faithfulness, particularly on misaligned hints and harder tasks.
Outcome-based RL initially improves CoT faithfulness but plateaus, and models exploiting Reward Hacks in RL Environments rarely verbalize the hack in their CoTs.

7th May 2025

Large Language Models are Autonomous Cyber Defenders

LLM Adapter Framework: introduces a system to integrate LLMs into the CybORG CAGE 4 environment, including an LLM Adapter, Formatter, Backend, LLM Models, Custom Policies, Communication Protocol, Blue Agent (LLM), Blue Agent (RL), Red Agent, and Green Agent.
The framework enables LLM-driven agents to act as autonomous cyber defenders in a multi-agent simulation alongside RL and finite-state agents.
A novel communication protocol allows diverse blue agents to share threat information and coordinate defensive actions within the simulated network.

Safeguard-by-Development: A Privacy-Enhanced Development Paradigm for Multi-Agent Collaboration Systems

Maris: introduces a privacy-enhanced development paradigm for multi-agent collaboration systems, with MACS Data Protection Manifest (Specifies data protection policy), Data Safeguard Engine (Integrates policy into workflows), Conversation Handler (Hooks into message flows), Manifest Enforcer (Validates messages, applies actions), AutoGen (Multi-agent development framework), ConversableAgent (Generic agent class), GroupChatManager (Group conversation manager), Agents (Autonomous actors), LLMs (Large Language Models), Tools (External services/functions), and Users (Human participants), designed to address data leakage threats by enforcing rigorous message flow control.
The system embeds reference monitors into key conversation components to validate message flows against user-defined policies at runtime.
Evaluation across healthcare, supply chain, and recommendation use cases demonstrates satisfactory effectiveness and low performance overhead.

Benchmarking LLMs' Swarm intelligence

SwarmBench: introduces a novel benchmark for evaluating LLM swarm intelligence, featuring a launcher (Launches benchmark), a framework orchestrator (Orchestrates interactions), a simulation environment (Simulation environment), task definitions (Defines coordination tasks), a physics engine (Manages environment physics), LLM-powered agents (LLM-powered agents logic), and a data logger (Captures simulation data).
The benchmark assesses emergent decentralized coordination in LLM swarms under strict perception and communication constraints within a configurable 2D grid world.
SwarmBench includes five core multi-agent coordination tasks: Pursuit, Synchronization, Foraging, Flocking, and Transport, evaluated using a zero-shot protocol.

CompileAgent: Automated Real-World Repo-Level Compilation with Tool-Integrated LLM-based Agent System

CompileAgent: introduces an LLM-based agent framework for automated repo-level compilation, integrating a MasterAgent, Flow-based Agent Strategy, Shell Tool, File Navigator Tool, Instruction Extractor Tool, Website Search Tool, and Multi-Agent Discussion Tool to handle instruction search and error resolution.
The framework leverages five specialized tools and a flow-based strategy orchestrated by a MasterAgent to interact with software artifacts and the interactive environment.
CompileAgent significantly improves compilation success rates and reduces time/cost compared to baselines on a new benchmark, demonstrating the potential of agent-based approaches for complex software engineering tasks.

A Proposal for Evaluating the Operational Risk for ChatBots based on Large Language Models

LLM Risk Evaluation Framework: introduces a novel metric and framework for evaluating the operational risk of LLM-based chatbots, integrating an Improved Probes Set, Garak scanner, the Prospected Chatbot System, a Metric Calculator, Industry Factor, Age Profile of Users, Technical Complexity, and Hits to assess risks to the system, users, and third parties.
The framework leverages the open-source GARAK tool, enhancing its probes and incorporating contextual factors like industry and user demographics into the risk calculation.
Evaluation results using the framework demonstrate varying risk levels across different LLM models and the impact of prompt protection and contextual multipliers on risk assessment.

AutoPatch: Multi-Agent Framework for Patching Real-World CVE Vulnerabilities

AutoPatch: introduces a multi-agent framework with a security plugin, similarity analyzer, taint analysis, semantic analysis, unified similarity model, RAG database, vulnerability verifier, code patcher, and LLM-based code generation model, designed to patch vulnerable LLM-generated code by identifying and fixing real-world CVEs.
The framework leverages retrieval-augmented generation and specialized LLM agents to analyze code, find similar vulnerabilities in a database, verify their presence, and generate secure patches.
This approach aims to overcome the knowledge cutoff limitation of LLMs and provide a cost-efficient alternative to frequent fine-tuning for handling newly disclosed vulnerabilities.

Benchmarking LLMs' Swarm intelligence

SwarmBench: introduces a novel benchmark for evaluating LLM swarm intelligence, featuring a launcher (Launches benchmark), a framework orchestrator (Orchestrates interactions), a simulation environment (Simulation environment), task definitions (Defines coordination tasks), a physics engine (Manages environment physics), LLM-powered agents (LLM-powered agents logic), and a data logger (Captures simulation data).
The benchmark assesses emergent decentralized coordination in LLM swarms under strict perception and communication constraints within a configurable 2D grid world.
SwarmBench includes five core multi-agent coordination tasks: Pursuit, Synchronization, Foraging, Flocking, and Transport, evaluated using a zero-shot protocol.

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Absolute Zero Reasoner (AZR): introduces a system where a single Language Model (acts as proposer and solver) learns to propose tasks (Proposer) and solve them (Solver) through self-play, utilizing a Code Executor (validates tasks, verifies answers) as the Environment (provides feedback) and guided by a Reward Function (guides learning) and RL Algorithm (updates model).
The system operates under the Absolute Zero paradigm, learning entirely from self-generated tasks and environmental feedback without relying on external human-curated data.
AZR leverages three distinct task types (deduction, abduction, induction) and a task-relative reinforcement learning approach (TRR++) to achieve strong reasoning capabilities across coding and mathematical domains.

Facilitating Trustworthy Human-Agent Collaboration in LLM-based Multi-Agent System oriented Software Engineering

RACI-based framework: introduces a method for assigning responsibilities between Human Actors and LLM-based Agents using RACI roles to facilitate trustworthy human-agent collaboration in LLM-based multi-agent systems for software engineering.
The framework aims to enhance collaboration, ensure accountability, and mitigate risks associated with LLM-driven automation by systematically distributing decision-making authority and oversight.
The approach defines specific roles (Responsible, Accountable, Consulted, Informed) for humans and agents across tasks within the software development lifecycle.

Identification and Optimization of Redundant Code Using Large Language Models

LLM-agent: introduces a framework leveraging Large Language Models (Core engine) to analyze and optimize a Codebase (Input code), verified by Test Cases (Verification).
The framework incorporates Static Analysis Tools (Evaluation) for metric evaluation and Developer Feedback (Validation/Insights) for understanding redundancy causes.
The LLM Agent (Orchestrator) manages the process, aiming to build a Catalog (Knowledge base) of redundant code patterns and reasons.

6th May 2025

The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete

Narrative Primed LLM Agents: introduces a system where LLM agents play a public goods game, influenced by narrative priming from a story pool.
The study investigates how shared versus different narratives affect agent collaboration and competition outcomes in the game.
Experiments explore the influence of narrative type, group size, and the presence of selfish agents on collaboration scores and payoffs.

Frog Soup: Zero-Shot, In-Context, and Sample-Efficient Frogger Agents

LLM Demonstrations Guided DQN: introduces enhancing a traditional DQN agent with LLM-generated gameplay demonstrations, utilizing Objects Coordinates Extraction, LLM Agents, LLM Demo, and LLM Loop to collect expert trajectories, which are then integrated into the DQN Components including Self-Play Experience, Evaluation NNet, Target NNet, Priority Experience Replay, Priority Sampling, DQN loss calculation, and DQN Loop interacting with the Atari-Frogger Env.
The approach leverages Prioritized Experience Replay to prioritize sampling of the LLM-generated expert demonstrations, aiming to improve the sample efficiency and initial performance of the DQN agent on the challenging Frogger game.
Experiments show that incorporating LLM demonstrations leads to significantly higher episodic rewards and faster convergence compared to a standard DQN baseline within a limited training budget.

Performance Evaluation of Large Language Models for High-Performance Code Generation: A Multi-Agent Approach (MARCO)

MARCO (Multi-Agent Reactive Code Optimizer): introduces a multi-agent system with Code Optimizer Agent, Web-Search Engine, Performance Evaluator Agent, and Adaptive Feedback Loop for optimizing high-performance computing code.
The Code Optimizer Agent generates and refines code using strategies informed by the Web-Search Engine and feedback from the Performance Evaluator Agent.
The Adaptive Feedback Loop iteratively improves code quality by feeding performance metrics from the evaluator back to the optimizer.

Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

FGO (Fine-Grained Optimization): introduces, "Divide (Splits dataset) / Optimize (Optimizes subsets) / LLM Optimizer (Updates modules) / Agent (Executes tasks) / Module (Part optimized) / Evaluate (Assesses performance) / Merge (Combines modules) / Recursive Clustering (Groups modules) / Direct Merge (Combines groups) / Optimal Agent System (Final agent)", a scalable framework for LLM agent optimization.
FGO divides large optimization tasks into manageable subsets, performs fine-grained optimization on each subset, and progressively merges the optimized components.
The framework demonstrates improved performance and efficiency for LLM-based agent optimization on large datasets compared to traditional methods.

SLOT: Structuring the Output of Large Language Models

SLOT (Structured LLM Output Transformer): introduces a model-agnostic post-processing approach using a fine-tuned lightweight language model to transform unstructured LLM output into structured formats, incorporating a Data Synthesizer LLM and Validation for data creation, and utilizing Loss Calculation and Weight Update for training.
The framework takes unstructured text from an upstream LLM and a JSON Schema as input to the SLOT model, producing structured output, and is evaluated using metrics like Schema Accuracy and Content Similarity.
SLOT can be combined with Constrained Decoding methods to further enhance structural validity and performance, demonstrating that targeted training can enable smaller models to achieve high-quality structured generation.

Divide, Optimize, Merge: Fine-Grained LLM Agent Optimization at Scale

FGO (Fine-Grained Optimization): introduces, "Divide (Splits dataset) / Optimize (Optimizes subsets) / LLM Optimizer (Updates modules) / Agent (Executes tasks) / Module (Part optimized) / Evaluate (Assesses performance) / Merge (Combines modules) / Recursive Clustering (Groups modules) / Direct Merge (Combines groups) / Optimal Agent System (Final agent)", a scalable framework for LLM agent optimization.
FGO divides large optimization tasks into manageable subsets, performs fine-grained optimization on each subset, and progressively merges the optimized components.
The framework demonstrates improved performance and efficiency for LLM-based agent optimization on large datasets compared to traditional methods.

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch

WebGen-Bench (Evaluation Pipeline): introduces a benchmark and pipeline to evaluate LLM-based agents on generating websites from scratch, including Data Curation, Website Generation, Test Case Construction, UI Agent, UI Agent Engine, Appearance Grading, and Manual Validation components.
The pipeline uses LLMs and human annotators for data creation, LLM-based agents for website generation, and a UI agent powered by an LLM for automated functional testing.
Website appearance is graded by a separate LLM, and human testers perform manual validation of test cases.

LlamaFirewall: An open source guardrail system for building secure AI agents

LlamaFirewall: introduces an open-source, system-level security framework for LLM-powered applications, including a Unified Policy Engine (Orchestration), PromptGuard 2 (Jailbreak detection), AlignmentCheck (Agent alignment), and CodeShield (Code analysis).
The framework provides layered defense against prompt injection, agent misalignment, and insecure code generation risks.
LlamaFirewall offers a modular design supporting custom pipelines, conditional remediation strategies, and pluggable detectors for real-time security monitoring.

A Comprehensive Survey of Large AI Models for Future Communications: Foundations, Applications and Challenges

LAMs (Large AI Models): introduces a comprehensive survey of Large AI Models for future communications, covering their foundations including Transformer, Diffusion, and Mamba architectures, classification into LLM, LVM, LMM, and World models, training methods like Pre-training, Fine-tuning, and Alignment, and optimization techniques such as CoT, RAG, and Agentic systems.
The paper details the application of LAMs across various communication scenarios, including physical layer design, resource allocation, network management, edge intelligence, semantic communication, agentic systems, and emerging applications.
It analyzes the research challenges faced by LAMs in communication, such as data quality, structured knowledge integration, generative hallucination, reasoning limitations, explainability, adaptability, task diversity, resource constraints, inference latency, and security/privacy.

A HASHGRAPH-INSPIRED CONSENSUS MECHANISM FOR RELIABLE MULTI-MODEL REASONING

Hashgraph-inspired Consensus Mechanism: introduces a system for reliable multi-model reasoning using a Query Handler (accepts user request), Model Interface Layer (manages model connections), Consensus Controller (implements gossip and checks convergence), Prompt Generator (formulates model prompts), Comparer/Evaluator (compares model outputs), Result Aggregator (formats final output), and a Reasoning Model Pool (set of black-box models).
The system treats each reasoning model as a node in a distributed network, using gossip-about-gossip and virtual voting principles to achieve consensus on a final answer.
This iterative process allows models to exchange and refine answers, aiming to reduce hallucinations and improve accuracy by leveraging collective intelligence.

LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs

LogisticsVLN: introduces a UAV-based vision-language navigation system for terminal delivery, integrating an LLM (interprets request, extracts attributes), Floor Count VLM (estimates floors, guides vertical movement), Object Recognition VLM (identifies target window/object), Choice VLM (determines next action), Depth Assistant (ensures safety, calculates distances), and RGB-Depth Observation (input data).
The system processes user requests and environmental observations to guide a drone to a specific window for package delivery.
It operates without prior maps or fine-tuning, relying on foundation models for perception, understanding, and decision-making in unseen residential environments.

Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM-Based Agents

Modular Semantic-Associative System: introduces a modular architecture augmenting LLMs with semantic and associative memory components to bridge cognitive gaps.
This system decouples procedural execution (LLM actor) from adaptive reasoning (semantic/associative modules) for robust decision-making.
The architecture is designed for agents operating in complex, unpredictable "wicked" environments by specializing cognitive functions.

DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning

DYSTIL: introduces a strategy-based reinforcement learning framework with DYSTIL RL Agent L, Memory Mc, Input Constructor, Core Reasoning LLM, Actor Module, Critic Module, Strategy-Generating LLM Q, Observation-to-Text Converter Co→t, Experience Buffer B, and PPO Parameter Optimization, which dynamically induces textual strategies using large language models to improve reinforcement learning from expert demonstrations.
The framework integrates a strategy-generating LLM for strategy induction with a lightweight core reasoning LLM for policy optimization.
DYSTIL iteratively updates strategies based on experience and advantage estimations, enhancing sample efficiency and model interpretability.

VLM Q-LEARNING: ALIGNING VISION-LANGUAGE MODELS FOR INTERACTIVE DECISION-MAKING

LVLMQ (VLM Q-Learning): introduces, "aligning vision-language models for interactive decision-making", with VLM (core RL policy), Image Encoder (processes image input), Text Encoder (processes text input), LoRA Transformer (adapted VLM body), Language Head (Actor) (predicts output tokens), Critic Head (estimates action values), Environment (interactive system), Observation Prompt (formats VLM input), parseagent (parses VLM response), and parseenv (interprets action for environment), where the method applies off-policy reinforcement learning to fine-tune VLMs for agent tasks by adding a critic head and using an advantage-filtered supervised fine-tuning loss.
The approach converts turn-based agent interactions into token-based RL transitions, allowing the VLM's language head to act as the policy and the critic head to filter suboptimal actions based on learned value estimates.
This technique enables VLMs to self-improve and learn from low-quality datasets, effectively replacing standard supervised fine-tuning for VLM agent training while handling action syntax challenges.

An LLM-based Self-Evolving Security Framework for 6G Space-Air-Ground Integrated Networks

LLM-based Self-Evolving Security Framework: introduces a security framework for 6G SAGINs with LLM-6GNG (Processes threat data, generates strategies), 6G-INST (Enables framework self-evolution), and 6G Simulator (Simulates 6G SAGINs environment).
The LLM-6GNG component processes threat information and generates security strategies using multi-agent LLMs and chain-of-thought reasoning.
The 6G-INST component enables the framework to self-evolve by automatically updating the LLM-6GNG with new training data generated from encountered threats.

Assessing and Enhancing the Robustness of LLM-based Multi-Agent Systems Through Chaos Engineering

Chaos Engineering Framework: introduces a framework for assessing and enhancing the robustness of LLM-based Multi-Agent Systems (LLM-MAS) by systematically applying chaos engineering principles.
The framework includes components like a Chaos Module for fault injection and Monitoring Components/Modules for data collection and analysis.
The research proposes validating the framework through controlled experiments simulating various failure scenarios in LLM-MAS deployments.

5th May 2025

Improving Model Alignment Through Collective Intelligence of Open-Source LLMS

MoAA: introduces a two-stage alignment recipe leveraging the collective intelligence of multiple open-source LLMs, including Mixture of Agents (MoA), Proposers, Aggregators, Synthetic Data Generator, Reward Model, Criteria Filtering, Target Model, SFT Model, and DPO Model, to generate high-quality synthetic data for supervised fine-tuning and preference optimization.
The approach utilizes MoA as a synthetic data generator in the first stage (MoAA-SFT) to fine-tune a target model and as a reward model in the second stage (MoAA-DPO) to annotate preference data for direct preference optimization.
MoAA demonstrates significant improvements in model performance on alignment benchmarks by effectively integrating the strengths and diversity of open-source LLMs without relying on stronger external supervision.

34 Examples of LLM Applications in Materials Science and Chemistry: Towards Automation, Assistants, Agents, and Accelerated Scientific Discovery

The LLM-Powered Research Constellation: introduces 34 LLM applications across materials science and chemistry, categorized into Property Prediction (Forecasting properties), Molecular & Material Design (Generating novel molecules/materials), Automation & Novel Interfaces (Developing interfaces/automations), Scientific Communication and Education (Enhancing communication/education), Research Data Management and Automation (Streamlining data handling/processing), Hypothesis Generation & Evaluation (Generating/evaluating hypotheses), and Knowledge Extraction & Reasoning (Extracting knowledge/reasoning).
These applications, developed during a hackathon, demonstrate LLMs' versatility as predictive models and platforms for rapid prototyping of domain-specific tools.
The work highlights how integrating LLMs into scientific workflows can accelerate discovery and improve researcher efficiency across the entire research lifecycle.

The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models

Iterative Program Repair Pipeline: introduces an approach for automatic program repair using instruction-tuned large language models, balancing multi-output generation and iterative refinement within a limited patch budget, incorporating Input, LLM, Prompt, Output, Parsing, Validation, Execution, Feedback, and Iterative Process components.
The pipeline processes buggy code input, uses an LLM guided by a prompt to generate output patches, which are then parsed and subjected to validation via execution with tests.
Feedback from validation drives the iterative process to refine patches, aiming to maximize repair success while limiting the total number of generated patches.

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Scenethesis: introduces a training-free agentic framework for text to interactive 3D scene generation, with LLM Module (Coarse scene planning), Vision Module (Layout visual refinement), Optimization Module (Physics-aware optimization), and Judge Module (Spatial coherence judgment).
The framework leverages language and visual priors to generate realistic and physically plausible indoor and outdoor environments.
It integrates LLM-based scene planning with vision-guided layout refinement and physics-aware optimization to ensure spatial realism and physical plausibility.

AutoLibra: Agent Metric Induction from Open-Ended Feedback

AutoLibra: introduces a framework for agent evaluation that transforms Human Feedback (Open-ended text) on Agent Trajectory (Agent actions/observations) using an LLM (Text processing model) to generate Aspects (Grounded behavior-feedback), induce AutoLibra Metrics (Induced evaluation criteria), evaluate agents producing Traits (LLM metric ratings), and meta-evaluate metrics using Meta-Metrics (Metric quality evaluation).
The framework operates in a closed loop, using meta-evaluation results (coverage and redundancy) to optimize the induced metrics.
AutoLibra-induced metrics serve as targets for agent improvement through prompt engineering or fine-tuning.

Generating HomeAssistant Automations Using an LLM-based Chatbot

EcoMate: introduces an LLM-based chatbot system for generating HomeAssistant routines, utilizing an LLM to process User Commands, Home Template, and Energy Consumption data for execution by the HomeAssistant framework.
The system evaluates different LLMs' ability to generate valid JSON routines for HomeAssistant and assesses user perception compared to rule-based chatbots.
Findings indicate GPT models excel in routine generation, while user studies show positive engagement and usability for the LLM-based approach in promoting sustainable practices.

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Voila: introduces a family of large voice-language foundation models, with Audio Tokenizer (Audio tokenization/decoding), Text Tokenizer (Text tokenization), Voice-language LLM backbone (Processes interleaved tokens), and Audio Transformer (Generates audio tokens), designed for real-time autonomous interaction and voice role-play.
The model employs a hierarchical multi-scale Transformer architecture integrating LLM reasoning with acoustic modeling for natural, persona-aware voice generation.
Voila supports end-to-end voice conversation and autonomous full-duplex interaction by processing interleaved audio and text tokens.

Exploring LLM-Powered Role and Action-Switching Pedagogical Agents for History Education in Virtual Reality

VR Prototype with LLM-Powered Pedagogical Agents: introduces a system for VR history education featuring a Virtual Environment, Pedagogical Agents (PAs) powered by an LLM (Large Language Model), and modules for Conversation, Adaptive Role-Switching, and Adaptive Action-Switching.
The LLM processes user input and context to drive the Conversation Module, while the Adaptive Role-Switching and Adaptive Action-Switching Modules dynamically adjust the PA's role, appearance, voice, tone, and actions based on the LLM's output and environmental factors.
A user study found that adaptive role-switching enhanced perceived trustworthiness and expertise, while adaptive action-switching increased perceived social presence and humanness, offering insights for designing multi-role agents in immersive learning.

A Survey of Slow Thinking-based Reasoning LLMs using Reinforced Learning and Inference-time Scaling Law

Test-Time Scaling, Reinforced Learning, and Slow Thinking: introduces a survey of reasoning LLMs, detailing methods like Test-Time Scaling (Adjusts computation complexity), Reinforced Learning (Optimizes policy via feedback), and Slow Thinking (Emulates deliberate reasoning).
These approaches incorporate components such as Search and Sampling (Explores reasoning paths), Dynamic Verification Mechanism (Verifies, refines outputs), Policy Network (Learns reasoning strategies), Reward Design (Evaluates reasoning quality), and Self-Evolution (Iteratively improves performance).
Slow Thinking frameworks further utilize Long CoT (Generates extended reasoning), Hierarchical Reasoning (Structures problem-solving modularly), and Hybrid Thinking (Combines fast, slow processes) to enhance reasoning capabilities.

Evaluating Contrastive Feedback for Effective User Simulations

LLM-based User Simulation: introduces, "evaluating different prompting strategies for LLM-based user agents in interactive information retrieval simulations", with LLM (Core agent), Information Need (Initial context), Knowledge State (Evolving understanding), Relevance Feedback (Document summaries), Prompting Strategy (Contextual input method), Query Generation (LLM creates queries), Relevance Judgment (LLM judges documents), Knowledge State Update (Incorporates feedback), and Simulation Environment (Provides search results), where "the paper analyzes how different modalities of contextual information influence the effectiveness of user simulations".
The study evaluates user configurations where the LLM agent's knowledge state is updated iteratively with summaries of previously judged relevant, irrelevant, or both types of documents.
The research demonstrates that providing contrastive feedback (summaries of both relevant and irrelevant documents) to the LLM agent improves simulated user search effectiveness.

Beyond the model: Key differentiators in large language models and multi-agent services

LLM Ecosystem Differentiators: introduces key differentiators beyond the core model, including Data Quality, Proprietary Datasets, Model Quantization, Model Pruning, Neural Attention Memory Models (NAMMs), Semantic Caching, Attention Offloading, Speculative Decoding, Low-Rank Adaptation (LoRA), Flash-LLM, Evaluation Frameworks, Monitoring Systems, Model-to-Data Movement, Synthetic Data Generation, Data Versioning, and Data Lineage.
The paper reviews critical factors like data management, computational efficiency, latency reduction, and robust evaluation frameworks that ensure modern AI services are efficient and profitable.
These ecosystem components and strategies are presented as the real competitive advantage in generative AI as large language models become increasingly commoditized.

El Agente: An Autonomous Agent for Quantum Chemistry

El Agente Q: introduces an LLM-based multi-agent system with a hierarchical architecture, integrating working and long-term memory, an LLM reasoning core, and specialized agents for automated quantum chemistry workflows.
The system features a hierarchical memory framework enabling flexible task decomposition, adaptive tool selection, post-analysis, and autonomous file handling.
El Agente Q demonstrates robust problem-solving, adaptive error handling, and supports multi-step task execution for complex workflows.

4th May 2025

A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP)

MCP (Model Context Protocol), ACP (Agent Communication Protocol), A2A (Agent-to-Agent Protocol), and ANP (Agent Network Protocol): introduces a survey examining four emerging agent communication protocols, including MCP (Initiator, Provider, Message semantics, Physical transmission, Calls expecting replies, Successful responses, Failures, Asynchronous updates, Model-controlled capabilities, Application-controlled data, User-controlled templates, Server-controlled generation delegation), ACP (Initiates communication, Protocol broker, Execution endpoint, Identity and capability profile, Agent location, Unit of delegated work, Communication envelope, Execution outputs), A2A (Originator of intent, Intermediary orchestrator, Service endpoint, Self-description and discovery, Actionable capabilities, Atomic unit of work, Communication channel, Tangible outputs, Real-time streaming, Out-of-band updates), and ANP (Decentralized identifier, Structured metadata profile, Agent indexing and search, JSON-RPC, OpenAPI, YAML schemas, Dynamic protocol alignment, Secure communication, Protocol negotiation layer, Core application logic layer, Transport protocol), each addressing distinct interoperability tiers for LLM-powered agents.
The protocols are compared across dimensions like interaction modes, discovery mechanisms, communication patterns, and security models to provide a foundation for designing secure, interoperable, and scalable agent ecosystems.
A phased adoption roadmap is proposed, starting with MCP for tool access, progressing through ACP for multimodal messaging and A2A for enterprise collaboration, and extending to ANP for decentralized agent marketplaces.

VECSR: Virtually Embodied Common Sense Reasoning System

VECSR: introduces a framework for common sense reasoning, with VECSR (Orchestrates process), s(CASP) Knowledge Base (Stores rules and state), s(CASP) Goal-Directed Solver (Generates action plans), and VirtualHome Simulation Environment (Provides embodied world), designed to break down high-level tasks into executable mid-level instructions.
The system converts VirtualHome state into s(CASP) facts, combines them with common sense rules, and optimizes the resulting program using techniques like modularity, dependency graphs, and partial grounding.
The s(CASP) solver then uses the optimized program to generate a sequence of actions that achieve the goal task in the simulated environment, providing explainable and executable plans.

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

ACT + Debugger (Multi-Agent Collaboration and Runtime Debugging): introduces a chained system combining multi-agent collaboration and runtime debugging for LLM code generation, including Analyst, Coder, Tester, and Debugger agents, processing Code Requirements, Visible Test Cases, Code Blocks, and using Blocking and Tracing Process, Execute Program to produce a Final Answer.
The system first uses Analyst, Coder, and Tester agents in a process-oriented phase, then transitions to a product-oriented debugging phase involving the Debugger and Coder agents if initial tests fail.
This integrated approach aims to leverage the strengths of collaborative planning and iterative debugging to improve functional accuracy, code rigor, and latency trade-offs.

DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving

DriveAgent: introduces a novel multi-agent framework for autonomous driving that leverages LLM and VLM reasoning combined with multimodal sensor fusion, structured into Descriptive Analysis, Vehicle Reasoning, Environmental Reasoning, and Response Generation modules.
The framework integrates camera, LiDAR, GPS, and IMU data through a hierarchy of specialized agents within these modules to enhance situational understanding and decision-making.
DriveAgent aims to provide clear, reliable, and interpretable insights into complex driving scenarios, improving robustness and reliability compared to baseline methods.

MemEngine: A Unified and Modular Library for Developing Advanced Memory of LLM-based Agents

MemEngine: introduces a unified and modular library for developing advanced memory models for LLM-based agents, with Memory Models, Memory Operations, Memory Functions, Memory Configurations, Memory Utilities, and LLM components.
The library provides a hierarchical framework comprising memory functions, operations, and models, supported by configuration and utility modules.
MemEngine facilitates convenient development and pluggable usage of various pre-implemented and customizable memory models for LLM agents.

Leveraging LLM Agents and Digital Twins for Fault Handling in Process Plants

Methodological Framework: integrates LLM agents with a Digital Process Plant Twin for autonomous fault handling, utilizing Monitoring Agent, Action Agent, Simulation, Validation Agent, and Reprompting Agent components.
The framework operates in a closed loop, observing the Process Plant state, generating and validating corrective actions via the Digital Process Plant Twin simulation.
Plant-specific knowledge from the Digital Twin informs the LLM agents' reasoning for deriving effective and safe corrective control actions.

3rd May 2025

CAMOUFLAGE: Exploiting Misinformation Detection Systems Through LLM-driven Adversarial Claim Transformation

CAMOUFLAGE (Claim Alteration for Misleading Output Using Feedback from Language Agent GuideancE): introduces an iterative LLM-driven adversarial attack framework with an Attacker Agent, Prompt Optimization Agent, Claim Evaluation, Misinformation Detection System, and History components.
The Attacker Agent generates perturbed claims guided by the Prompt Optimization Agent, which refines instructions based on feedback from the Claim Evaluation and the target Misinformation Detection System.
The framework optimizes attacks using only binary feedback from the target system and evaluation metrics, storing past attempts in History to guide future rewrites.

Model Context Protocol-based Internet of Experts For Wireless Environment-aware LLM Agents

MCP-based Internet of Experts (IoX): introduces a framework equipping LLM Agents with wireless environment awareness by coordinating interactions with Expert Models hosted on Expert Servers via the Model Context Protocol, using input from the Wireless Environment.
The framework enables the LLM Agent to selectively query and interpret outputs from lightweight, task-specific Expert Models at inference time without modifying its parameters.
This architecture supports modular, extensible, and interpretable reasoning over wireless contexts, significantly improving classification accuracy compared to standalone LLMs.

A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

LLM Inference Engine: introduces a comprehensive survey of 25 open-source and commercial engines, detailing their support for Batch Optimization (groups requests), Parallelism (distributes computation), Compression (reduces model size), Fine-Tuning (adapts model), Caching (reuses computations), Attention Optimization (improves attention), Sampling Optimization (speeds token generation), and Structured Outputs (constrains output format).
The paper examines each inference engine's ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation across diverse hardware.
It provides practical guidance for selecting and designing optimized LLM inference engines by analyzing their design goals, supported optimization techniques, and ecosystem maturity.

The STROT Framework: Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation

STROT (Structured Task Reasoning and Output Transformation): introduces a framework for structured data interpretation with LLMs, featuring Schema-Aware Context Construction (Analyze data schema), Prompt Scaffolding and Task Planning (Generate analysis plan), Transformation Logic Synthesis (Generate executable code), Program Execution (Run generated code), Feedback-Driven Refinement (Revise code based on errors), and Final Output (Deliver structured result).
The framework embeds the LLM within a multi-phase, feedback-driven pipeline that treats data understanding as a dynamic and structured process, enabling iterative reasoning and self-correction.
This agentic approach improves reliability, interpretability, and semantic alignment for structured data analysis tasks compared to single-shot methods.

2nd May 2025

PIPA: A Unified Evaluation Protocol for Diagnosing Interactive Planning Agents

PIPA (Unified evaluation Protocol for Interactive Planning Agents): introduces a unified evaluation protocol for interactive planning agents, conceptualizing their behavior within a POMDP paradigm, including Agent (Interactive planning agent), User (Interacts with agent), Interactive Session (Multi-turn dialogue), Intermediate Steps (Agent's internal reasoning), State Consistency Metric (S) (Aligns user requests with steps), Tool Efficiency Metric (A) (Measures tool utilization), Observation Alignment Metric (O) (Aligns observations with user needs), Policy Alignment Metric (P) (Follows predefined policies), and Task Completion Metric (R) (Measures goal achievement).
The protocol provides a comprehensive assessment through atomic evaluation criteria to diagnose strengths and weaknesses in the agent's decision-making pipeline.
PIPA enables multi-axis diagnosis and cross-benchmark comparisons, showing that user satisfaction is shaped by both outcomes and intermediate behaviors.

AI agents may be worth the hype but not the resources (yet): An initial exploration of machine translation quality and costs in three language pairs in the legal and news domains

This paper evaluates five machine translation paradigms: Google Translate (GT), GPT-4o, o1-preview, sequential multi-agent system (s-agent), and iterative multi-agent system (i-agent), comparing their quality and cost-efficiency.
The study benchmarks these systems using automatic metrics, human evaluation of adequacy and fluency, and token-based cost analysis across three language pairs and two domains.
Findings indicate that reasoning-enhanced LLMs and multi-agent workflows show potential for higher quality in human evaluation but incur significantly greater computational costs compared to traditional NMT and general LLMs.

WirelessAgent: Large Language Model Agents for Intelligent Wireless Networks

WirelessAgent: introduces a framework leveraging LLMs to create autonomous AI agents for wireless networks, integrating LLMs (Cognitive engine), Perception (Processes inputs), Memory (Stores data, context), Planning (Organizes tasks, reasons), Action (Executes commands), LangGraph (Graph-based workflow architecture), Global State (Shared workflow memory), External Tools (Specialized capabilities), Knowledge Base (Domain information repository), and System Prompts (Guide agent behavior).
The framework is built on agentic workflows implemented using the LangGraph architecture to manage complex wireless tasks.
WirelessAgent demonstrates near-optimal network throughput and higher bandwidth utilization compared to prompt-based methods in network slicing tasks.

VTS-LLM: Domain-Adaptive LLM Agent for Enhancing Awareness in Vessel Traffic Services through Natural Language

VTS-LLM: introduces a domain-adaptive LLM agent for Vessel Traffic Services, with NER-based relational reasoning (clarifies query-database relations), agent-based domain knowledge injection (integrates maritime knowledge), semantic algebra intermediate representation (bridges natural language to SQL), query rethink (validates and corrects SQL), and LLM (core language model) components.
The framework formalizes risk-prone vessel identification as a knowledge-augmented Text-to-SQL task, leveraging structured vessel databases and external maritime knowledge, supported by a curated benchmark dataset.
VTS-LLM demonstrates superior performance and robustness across command-style, operational-style, and formal natural language queries compared to general-purpose and SQL-focused baselines.

Multi-agents based User Values Mining for Recommendation

ZOOM (Zero-shot Multi-LLMs Collaborative Framework for User Values Mining): introduces a framework for extracting user values from historical interactions using User History, Text Summarization, Evaluators, Decoding Strategies, Supervisors, and Debate to produce User Values.
The framework employs multi-agent collaboration between evaluators generating diverse value candidates and supervisors refining them through debate to mitigate LLM limitations.
Text summarization addresses input length constraints, while the multi-agent debate enhances accuracy and reduces hallucinations in value extraction.

Seeking to Collide: Online Safety-Critical Scenario Generation for Autonomous Driving with Retrieval Augmented Large Language Models

The LLM-driven framework: introduces an online safety-critical scenario generation method featuring an LLM Behavior Analyzer (Infers dangerous intent), Feasible Trajectory Generation (Synthesizes adversarial trajectories), and Dynamic Memorization and Retrieval (Adapts online).
This framework utilizes a Memory bank (Stores intent-planner pairs) and offline processes including a Code Generator (Generates planner code), Simulation (Evaluates trajectories), and Code Modifier (Refines planner code) to support online adaptation and generation.
By analyzing historical states, inferring intent, generating trajectories, and dynamically updating a behavior library, the method effectively generates high-risk scenarios for autonomous vehicle testing.

SSRLBot: Designing and Developing an LLM-based Agent using Socially Shared Regulated Learning

SSRLBot: introduces an LLM-based agent for teamwork evaluation, integrating an LLM Backbone, Instructions/Preamble, SSRL Knowledge, Capabilities, Prompting Strategies, Iterative Refinement, Input, Output, and Output Evaluation to analyze diagnostic conversations.
Grounded in Socially Shared Regulation of Learning (SSRL) theory, the agent evaluates team members' interpersonal influence and SSRL skills.
The system provides contextualized feedback, comparative skill analysis, and improvement suggestions for collaborative learning and decision-making.

Structured dataset of reported cloud seeding activities in the United States (2000-2025) using a large language model

Data Extraction Pipeline: introduces, "a pipeline for creating a structured dataset from historical cloud seeding reports", with PDF Reports (Source documents), Preprocessing (Organizes, merges files), Text Extraction (Converts PDF to text), Prompt Engineering (Designs LLM input), LLM (OpenAI o4-mini) (Extracts structured data), Response Parsing (Processes LLM output), and Structured Dataset (CSV) (Final tabular data), where "the pipeline processes inconsistent PDF reports using LLM-based extraction to generate a structured CSV dataset".
The pipeline utilizes multi-stage PDF-to-text conversion and chain-of-thought prompting with OpenAI's o4-mini model to achieve high extraction accuracy.
This framework provides a scalable method for unlocking structured environmental data from historical scanned documents across various scientific domains.

1st May 2025

Thoughts without Thinking: Reconsidering the Explanatory Value of Chain-of-Thought Reasoning in LLMs through Agentic Pipelines

Agentic Pipeline Framework: introduces an agentic pipeline framework for a perceptive task guidance system, comprising Perceptors (Perceive data (visual/language)), Planners (Decompose tasks/sequence agents), Action agents (Process data/generate response/verify), Tools (External utilities), and Memory/Context (Stores documents/past logs).
The framework processes user input through a sequence of specialized agents, including Lead planner (Creates agent pipeline plan), Query planner (Assesses query/routes flow), Answer planner (Decides answerability/invokes generator), and various Action agents like RAG (Retrieves/summarizes documents) and Safety Agent (Filters inappropriate responses).
The paper evaluates Chain-of-Thought reasoning within this pipeline, finding that it does not improve output quality or provide effective explainability for end users in the context of task guidance.

From Texts to Shields: Convergence of Large Language Models and Cybersecurity

LLM and Agent Applications in Cybersecurity: reports on the convergence of large language models and cybersecurity, exploring emerging applications and challenges of integrating LLM Agent (dynamic reasoning engine), Meta Agent (agent of agents), RAG (retrieval-augmented generation), and Human-in-the-loop (human oversight) approaches.
The report examines LLM applications in network security, generative security engineering, and socio-technical aspects, including interpretability, safety, and security challenges.
It outlines a forward-looking research agenda for the secure and effective adoption of LLMs in cybersecurity, integrating technical advances with organizational and societal considerations.

HMCF: A Human-in-the-loop Multi-Robot Collaboration Framework Based on Large Language Models

HMCF (Human-in-the-loop Multi-Robot Collaboration Framework): introduces a framework for multi-robot collaboration with Assistant LLM agent, Robot LLM agents, Human-in-the-loop mechanism, Heterogeneous Robots, Human-Robot Interaction Interface, and RAG (Retrieval Augmented Generation), enabling efficient and scalable task allocation and execution.
The framework integrates LLM-based reasoning for task allocation and verification with human oversight to enhance adaptability, safety, and robustness in diverse environments.
HMCF utilizes a web-based interface for natural language interaction, allowing users to configure robots, monitor operations, and intervene when necessary.

Reasoning Capabilities and Invariability of Large Language Models

Large Language Models: introduces an evaluation of LLMs' reasoning capabilities using various prompting techniques and a new benchmark dataset focused on shallow logical reasoning with geometric figures.
The evaluation assesses 24 different LLMs using zero-shot, few-shot, and chain-of-thought prompting on a dataset designed to test logical constructors and invariability to language variations.
Results indicate that while larger LLMs perform better in zero-shot settings, overall performance on shallow reasoning remains limited, and model behavior is largely invariant to small language variations.

Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions

Memory Framework: introduces, "a structured and dynamic perspective on memory in AI systems", with Parametric Memory (Implicit model knowledge), Contextual Structured Memory (Explicit organized memory), Contextual Unstructured Memory (Explicit general memory), Consolidation (Integrate short-term into persistent), Updating (Modify memory), Indexing (Organize for retrieval), Forgetting (Remove irrelevant content), Retrieval (Access relevant information), and Compression (Reduce memory size), clarifying functional interplay in LLM-based agents.
The framework categorizes memory by representation type and defines fundamental operations for memory management and utilization.
This survey maps these components and operations to relevant research topics and outlines future directions for memory in AI.

USERCENTRIX: AN AGENTIC MEMORY-AUGMENTED AI FRAMEWORK FOR SMART SPACES

UserCentrix: introduces an agentic memory-augmented AI framework for smart spaces, with User Task Processing (User-side layer), Personal Agent (LLM-powered assistant), Personal Memory (Stores user history/preferences), Knowledge Retrieval Cycle (Memory recall/similarity assessment), Smart Building Side (Building-side layer), Decision-making Module (High-level agents), Classifier Agent (Determines task urgency), High-urgency Agent (Prioritizes speed), Low-urgency Agent (Prioritizes precision/generates solutions), Evaluator Agent (Assesses/selects solutions), Pareto Analyzer (Optimizes decision-making), Memory (Decision-making Module) (Stores solutions/tasks), Sub-tasks Execution Module (Low-level agents), Low-level Agents (Execute sub-tasks/generate commands), Management and Analysis Module (Manages/dispatches commands), Message Queue (Stores commands), Environment Agent (Tracks tasks/adjusts environment), and Smart Building Dataset (Data source), designed to enhance smart spaces through dynamic, context-aware decision-making.
The framework integrates personalized LLM agents leveraging user preferences and memory management with a hybrid hierarchical control system balancing centralized and distributed processing.
UserCentrix achieves resource-efficient AI interactions by embedding memory-augmented reasoning, cooperative agent negotiation, and adaptive orchestration strategies.

A Survey on Large Language Model based Human-Agent Systems

LLM-HAS (LLM-based Human-Agent Systems): introduces a structured survey of these systems, detailing core components including Environment & Profiling (Context, roles, goals, capabilities), Human Feedback (Types, granularity, timing), Interaction Types (Collaboration, competition, coopetition), Orchestration Paradigm (Task strategy, temporal synchronization), and Communication (Structure, mode).
The survey clarifies fundamental concepts and systematically presents these core components shaping human-agent systems.
It explores emerging applications, discusses unique challenges, and offers a structured overview to foster research in this interdisciplinary field.

Large Language Models as AI Agents for Digital Atoms and Molecules: Catalyzing a New Era in Computational Biophysics

ADAM (Agent for Digital Atoms and Molecules): introduces a multi-agent framework for computational biophysics, featuring a Plan Agent, Route Agent, Hybrid Neural-Symbolic Architecture, Neural Tools, Symbolic Tools, ADAM Tool Protocol (ATP), ATP Server, Distributed Tool Executors, Central Database, and Memory.
The framework employs a hybrid neural-symbolic architecture combining LLM-driven semantic tools with deterministic symbolic computations for scientific workflows.
Its ADAM Tool Protocol enables asynchronous, database-centric tool orchestration and community-driven extensibility for third-party tool integration.

Self-Generated In-Context Examples Improve LLM Agents for Sequential Decision-Making Tasks

Traj-Bootstrap: introduces a method for LLM agents to improve performance on sequential decision-making tasks by constructing and refining a Trajectory Database of self-generated successful experiences, used by a ReAct-style Agent via a Retrieval Mechanism interacting with an Environment.
The approach includes Traj-Bootstrap for naive accumulation, +DB-Selection for population-based database selection, and +Exemplar-Selection for selecting high-utility individual trajectories.
These methods enable autonomous agent self-improvement without task-specific knowledge engineering, achieving performance comparable to methods using multiple test attempts or stronger LLMs.

30th April 2025

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Automated Failure Attribution: introduces methods (All-at-once Method, Step-by-step Method, Binary Search Method, LLM Judge, Failure Logs, Query, Failure-Responsible Agent, Decisive Error Step) for identifying the agent and step responsible for task failures in LLM multi-agent systems using failure logs.
The paper evaluates three LLM-based methods: All-at-once processes the full log, Step-by-step processes incrementally, and Binary Search processes log segments.
The LLM Judge analyzes the query and failure logs to predict the failure-responsible agent and decisive error step.

CoordField: Coordination Field for Agentic UAV Task Allocation In Low-altitude Urban Scenarios

CoordField: introduces a coordination field agentic system for UAV swarm task allocation, with a Semantic Understanding Module (Interprets natural language), LLM (Parses instructions), Planning Module (Transforms tasks), Planning Agent (Aggregates results), Coordination field (Guides motion, task selection), Perception Mapping (Constructs potential field), Task Decomposition (Converts potential field), Task Assignment (Enhances coordination efficiency), Execution Module (Translates outputs), Execution Agent (Manages control commands), UAV Deployment (Physical or virtual), and Prompt Tools API (Communicates with control), designed for heterogeneous UAV swarms in urban environments.
The system leverages LLMs for high-precision task understanding and employs a coordination field control strategy for task-oriented autonomous navigation and collective coordination.
CoordField utilizes dynamically updated potential fields and fluid-based velocity fields to enable decentralized and adaptive allocation of emergent tasks, demonstrating superior performance in task coverage, response time, and adaptability.

TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments

TRUST (TRauma Understanding and Structured Assessments): introduces, "an LLM-based dialogue system for trauma understanding and structured assessments," with Database (stores system memory), Framework (manages dialogue and assessment), Conversation (manages dialogue flow), Assessment (manages assessment logic), LLM (powers conversation and assessment), Dialogue Act Schema (guides conversation), and Patient Simulation (evaluates system), designed to conduct formal diagnostic interviews for PTSD.
The Database module contains Variable, History, and Score components to store variable metadata, conversation history, and assessment outcomes, respectively.
The Framework's Conversation and Assessment submodules utilize an LLM for tasks like predicting dialogue acts, generating responses, and performing assessments, while the Dialogue Act Schema provides structured guidance, and Patient Simulation uses an LLM and real-life transcripts for robust evaluation.

LLM-based Interactive Imitation Learning for Robotic Manipulation

LLM-iTeach: introduces a novel interactive imitation learning framework utilizing an LLM as a teacher for robotic manipulation, featuring an LLM Teacher, Agent, CodePolicy, Hierarchical Prompting, Similarity-checking mechanism, Evaluative feedback, Corrective feedback, Image, Robot State, Convolutional Layers, LSTM, Gauss Distribution, and Action.
The framework employs hierarchical prompting to generate a CodePolicy from the LLM, which then provides feedback based on a similarity check between the agent's action and the LLM's action.
The agent learns a stochastic policy parameterized by a Gaussian distribution, processing image and robot state inputs through convolutional layers and an LSTM to determine actions.

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Agent Orchestrator: introduces an LLM-driven agent-orchestration architecture for embodied robots, with Agent Orchestrator (coordinates specialized agents), Routing Agent (analyzes and directs user requests), Task Planning Agent (handles action commands), Knowledge Base Agent (processes history queries), Memory (stores past actions and environment records), and Perception (provides object detection and scene understanding) components, enabling autonomous household object management.
The system integrates memory-augmented task planning using RAG for long-term object tracking and utilizes specialized agents powered by task-specific LLMs.
Perception components like Grounded SAM and LLaMa3.2-Vision facilitate robust object detection and semantic scene understanding for task planning.

UAV-VLN: End-to-End Vision Language guided Navigation for UAVs

UAV-VLN: introduces, "LLM (Interprets instructions, generates sub-goals) / Automated Task Planner (Maps sub-goals to actions) / Visual Input (UAV camera feed) / Vision Model (Detects objects, Grounding DINO) / Cross-modal Grounding Module (Aligns language and visuals) / Control Pipeline (Executes plans, ROS 2) / UAV (Executes plan, provides visual input)", a novel end-to-end vision-language navigation framework for UAVs that interprets natural language instructions and plans aerial trajectories.
The framework leverages a fine-tuned LLM for semantic parsing, a vision model (Grounding DINO) for scene understanding, and a cross-modal grounding module to align linguistic intent with visual context.
An Automated Task Planner maps high-level sub-goals from the LLM to low-level control commands executed via a ROS 2 pipeline on the UAV.

Meeseeks: An Iterative Benchmark Evaluating LLMs Multi-Turn Instruction-Following Ability

Meeseeks: introduces a multi-round automatic instruction-following benchmark with a hierarchical taxonomy, simulating human-LLM interaction through an iterative feedback process for evaluating LLMs' instruction-following ability.
The benchmark employs an evaluation system with capability tags across three dimensions, using LLM-based extractors and evaluators alongside rule-based checks.
Meeseeks utilizes data parameterization for flexible dataset generation and provides metrics like Utility Rate and Meeseeks Score to quantify performance and self-correction capabilities.

Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

LPFG (Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming): introduces a framework for unsupervised feature transformation using a Critic Agent (diagnoses data, provides advice), a Generator Agent (generates features), and Iterative Refinement (feedback loop for improvement).
The Critic Agent provides semantic and distributional advice to guide the Generator Agent in producing tokenized feature transformations.
The iterative feedback loop between the agents refines the generated features for improved structural integrity, predictive utility, and format compatibility.

Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA

Discuss-RAG: introduces an agent-led framework for medical QA RAG systems, featuring a multi-turn discussion and summarization module with a Recruiter R, Medical Team (Agents Hi), and Summarizer C generating a Distilled summary D, followed by a post-retrieval verification module where a Decision maker U evaluates Snippets S from Trivial RAG before LLMs generate the final Answer A.
The multi-turn discussion simulates expert brainstorming via iterative Insights I and Output summary T, enriching context for retrieval.
The post-retrieval verification step filters retrieved content using a Decision maker U and Verifier V, triggering an Alternative retrieval strategy if necessary, to improve answer accuracy and reliability.

29th April 2025

SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories

SECREPOBENCH: introduces a benchmark construction framework that takes GitHub Projects, OSS-Fuzz Reports, and ARVO Dataset as inputs, uses a Task Constructor (Patch Locator, Mask Generator, Write Description, Code Mutator) to create repository-level code generation tasks, employs a Test Constructor (Unit Test Finder, Security Test Case Finder) to generate correctness and security tests, and outputs the task and tests.
The framework focuses on generating secure code completion tasks within real-world C/C++ repositories by leveraging known security vulnerabilities and developer-written tests.
The benchmark evaluates LLMs on their ability to generate correct and secure code in a repository context, which is shown to be more challenging than generating self-contained programs.

AI-in-the-Loop Planning for Transportation Electrification: Case Studies from Austin, Texas

Urban Planning AI: introduces an AI-in-the-Loop framework for transportation electrification planning, integrating Planner, Urban AI, GeoAI, GenAI, LLMs, AI Agent, Automated System, UI, Community, and Feedback Loop components.
The framework utilizes GeoAI for site suitability analysis, GenAI for estimations and visualizations, and LLMs for scenario simulations and chatbot interactions.
Human planners and community feedback are crucial for providing oversight, auditing AI outputs, and ensuring accountable and equitable planning decisions.

LLM Enhancer: Merged Approach using Vector Embedding for Reducing Large Language Model Hallucinations with External Knowledge

LLM-ENHANCER: introduces a system that enhances open-source LLMs using a User (Provides input), LangChain (Framework), ZeroShot React Agent (Selects tools), Agent Executor (Executes actions), Merged Tool (Combines online sources), Calculator (Performs calculations), Merging Data (Combines source data), Splitter (Divides data into chunks), Embeddings (Creates vector representations), ChromaDB database (Stores vector embeddings), Relevant chunks (Retrieved information), and Mistral 7B (Opensource LLM) (Generates response) to reduce hallucinations by integrating external knowledge.
The system uses agents to gather data from multiple online sources in parallel, merges it, and processes it via vector embeddings to find relevant information for the LLM.
This approach aims to provide accurate, up-to-date information to the LLM without extensive fine-tuning, mitigating issues with outdated training data and hallucinations.

Toward Efficient Exploration by Large Language Model Agents

LLM-based PSRL: introduces an implementation of the Posterior Sampling for Reinforcement Learning algorithm using three distinct LLMs: an approximate posterior updater LLM, a posterior sampler LLM, and an optimal policy LLM.
This approach explicitly implements an existing RL algorithm by outsourcing individual steps to distinct LLMs, contrasting with methods that implicitly induce RL behavior.
The framework aims to leverage the exploration properties of PSRL in natural language environments by using LLMs for key functions like updating beliefs, sampling models, and determining optimal actions.

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security

AegisLLM (Adaptive Agentic Guardrails for LLM Security): introduces a cooperative multi-agent defense system, with Orchestrator (Routes queries based security), Deflector (Handles unsafe inputs, issues refusal), Responder (Generates outputs for safe queries), and Evaluator (Verifies safety of query/response), that ensures safe LLM outputs through a structured agent workflow.
The framework promotes LLM security via a cooperative, inference-time multi-agent system that continuously monitors, analyzes, and mitigates adversarial threats in real time.
AegisLLM leverages automated prompt optimization and Bayesian learning for continuous self-improvement without requiring model retraining, enabling real-time adaptability to evolving attacks.

Using LLMs in Generating Design Rationale for Software Architecture Decisions

LLM-based Agents: introduces a multi-agent system including Aspect_Identifier (Identifies relevant aspects), Information_Collector (Gathers background information), Aspect_Analyst (Analyzes aspects), Aspect_Reviewer (Reviews analysis results), and Trade-off_Analyst (Generates final DR), to generate design rationale for software architecture decisions.
The study evaluates this multi-agent approach against zero-shot and Chain-of-Thought prompting strategies using five different LLMs on a dataset of architecture problems and decisions from Stack Overflow and GitHub.
Evaluation metrics include Precision, Recall, F1-score, and a qualitative IHUM-category classification, comparing LLM-generated rationale to human expert rationale.

A Summary on GUI Agents with Foundation Models Enhanced by Reinforcement Learning

Multimodal LLM-based GUI Agent Architecture: introduces a modular architecture for GUI agents, with Perception (understand GUI), Planning (generate action plans), and Acting (execute actions) components, designed to autonomously interact with digital devices based on task instructions and screen state.
The Perception module extracts semantic information from the GUI, the Planning module translates this into action plans, and the Acting module converts plans into executable interface interactions.
The paper reviews the evolution of these modules, highlighting advancements in multimodal perception, dynamic planning, and adaptive action generation enhanced by reinforcement learning.

TAMO:Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data

TAMO: introduces a tool-assisted LLM agent framework for fine-grained root cause analysis, integrating domain-specific tools for data observation, root cause localization, and fault classification with an expert LLM agent.
The framework decouples the LLM from raw observational data by using specialized tools to process multimodal data and model dynamic dependencies, structuring results for LLM input.
The expert agent synthesizes tool outputs and system context to provide comprehensive fault analysis and remediation recommendations for site reliability engineers.

CRASHFIXER: A crash resolution agent for the Linux kernel

CRASHFIXER: introduces an LLM-based agent that resolves Linux kernel crashes by iteratively performing Hypothesis Generation (creates root cause hypotheses) with Self-Reflection (selects best hypothesis), Patch Generation (synthesizes candidate patches) with Compilation Check (filters uncompilable patches) and Self-Consistency (selects patch aligned hypothesis), and Iterative Debug (manages debug cycles/trees/forests), supported by the KGYMSUITE Platform (provides system/tooling support) including an Execution Trace System (collects/minimizes relevant traces), SUITECACHE (provides cached kernel builds), Fast Compilation Check Tool (quickly checks compilation), and Reproducer Run (executes crash-triggering input).
The agent emulates a kernel developer's workflow, leveraging execution logs and source code to diagnose issues and propose fixes.
KGYMSUITE enhances the KGYM platform to provide scalable and reproducible evaluation infrastructure for LLM-driven kernel debugging.

28th April 2025

Towards Automated Scoping of AI for Social Good Projects

PSA (Problem Scoping Agent): introduces an LLM-based pipeline for automated AI for Social Good project scoping, with Background Retrieval, Challenge Retrieval, Method Retrieval, Annotator, Verbalized Confidence, and Solution Generator components.
The framework leverages retrieval-augmented generation using external search APIs and an LLM to process information and generate project proposals.
The PSA aims to automate the labor-intensive problem scoping process by identifying relevant background, challenges, and methods to generate comprehensive proposals.

TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons

TD-EVAL (Turn and Dialogue-level Evaluation): introduces a two-step evaluation framework, with Turn-Level Evaluation (Evaluates individual turns), TOD Agent Arena (Ranks full dialogues), and LLM Judge (Scores, compares responses/dialogues), designed for task-oriented dialogue systems.
The framework combines fine-grained turn-level analysis using an LLM judge with holistic dialogue-level comparisons via a pairwise ranking method.
TD-EVAL aims to identify subtle errors missed by traditional metrics and provide a more reliable, human-aligned assessment of conversational quality.

Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents

Agentic System Architecture: introduces a comprehensive threat model and mitigation framework for generative AI agents, detailing components like the Agent Brain, Memory Systems, and Tool Invocation Layer.
The architecture highlights how agent autonomy, persistent memory, complex reasoning, and tool integration create novel security risks.
The paper proposes the ATFAA threat model and SHIELD mitigation framework tailored to these unique agentic properties.

Can AI Agents Design and Implement Drug Discovery Pipelines?

Deep Thought agentic system: introduces DO Challenge, a benchmark for evaluating AI agents in drug discovery, and presents the Deep Thought multi-agent system designed to solve complex scientific tasks.
The DO Challenge benchmark requires agents to autonomously develop and execute strategies for identifying promising molecular structures from a large dataset under resource constraints.
The system, composed of heterogeneous LLM-based agents and computational tools, was evaluated on the benchmark, demonstrating competitive performance compared to human teams and domain experts.

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

LLM-Powered GUI Agent: introduces an architecture for phone automation, with Intent Comprehension, Perception, Brain (Storage, Decision Making), and Action components, where Intent Comprehension maps user goals to UI operations.
The Perception component gathers UI Info and Phone State, providing input to the Brain for reasoning and decision-making.
The Action component executes decisions through Touch Interactions and Atomic Skills, enabling interaction with the mobile environment.

Prompt Injection Attack to Tool Selection in LLM Agents

ToolHijacker: introduces an automated framework for prompt injection attacks targeting LLM agent tool selection, utilizing a Shadow Framework (Simulates target system) with Shadow Task Descriptions (Attacker-generated tasks), Shadow Retriever (Attacker's retriever model), Shadow LLM (Attacker's LLM model), and Shadow Tool Library (Attacker's tool set) to craft a Malicious Tool Document (Injected attack document) comprising a Tool Name (Malicious tool identifier) and Tool Description (Malicious tool details).
The attack employs a Two-phase optimization (Optimizes retrieval, selection) strategy with Retrieval Objective (Maximize malicious tool retrieval) and Selection Objective (Maximize malicious tool selection), optimized using Gradient-Free Method (Optimizes without gradients) and Gradient-Based Method (Optimizes using gradients) which incorporates Alignment Loss (L1), Consistency Loss (L2), and Perplexity Loss (L3).
The paper evaluates the attack against standard Tool Selection components (Tool Library, Retriever, LLM Agent) and various defenses including prevention-based (StruQ, SecAlign) and detection-based (Known-answer detection, Perplexity detection, Perplexity windowed detection) methods, demonstrating the attack's effectiveness and the defenses' limitations.

From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review

General AI Agent Framework: introduces a conceptual architecture with Thinking/Prompt, Strategy Development, Task, Self-Evaluation, Designated Function, Utility Functions/Knowledge Store, AI Query Engines, Knowledge Store, and Agent Execution Environment.
LangChain: presents an agent architecture including User, Agent (Chat Model, Scratchpad Prompting), Tools, and API for Bookings.
Agentic RAG (Retrieval-Augmented Generation): integrates LLM (Reasoning, Action) with Modular Toolkits, Reflection, Planning, Tool Utilization, Multi-agent Collaboration, User Interface, System Reply, Internal Knowledge Store, and Retrieval Utilities.

m-KAILIN: Knowledge-Driven Agentic Scientific Corpus Distillation Framework for Biomedical Large Language Models Training

m-KAILIN: introduces a knowledge-driven, multi-agent framework for distilling high-quality biomedical question-answering corpora, utilizing a Multi-Agent Collaborative Framework, Question Generation Agent, Context Retrieval Agent, Question Evaluation Agent, Answer Generation Agent, MeSH, Dense Passage Retrieval (DPR), BiomedBERT base encoder, Direct Preference Optimization (DPO), Preference Dataset, Training Corpus Dataset, and Target LLM.
The framework employs specialized agents guided by the MeSH hierarchy to extract, synthesize, and self-evaluate textual data from scientific literature, generating domain-specific question-answer pairs.
This automated pipeline produces high-quality, preference-based datasets for training biomedical LLMs, ensuring comprehensive coverage and consistency with biomedical ontologies.

Research CodeAgent: An LLM Multi-Agent System for Automated Codification of Research Methodologies

ResearchCodeAgent: introduces a novel multi-agent system leveraging LLMs to automate research methodology codification, including Planning (determines next action), Research Logs (records history/memory), Workers (execute actions), Environment (input files/context), Action Space (available actions), LLM Cascade (hierarchical LLMs for planning), and Programmatic Constructs (system aids/constraints).
The system bridges the gap between high-level research concepts and practical implementation by iteratively interacting with a research environment using a flexible agent architecture and dynamic planning.
ResearchCodeAgent demonstrates improved code quality, error reduction, and significant time savings compared to baseline methods, particularly for complex tasks.

AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers

AutoP2C (An LLM-Based Agent Framework): introduces "Paper-to-Code", a task transforming multimodal paper content into executable code repositories, with repository blueprint extraction, multimodal content parsing, hierarchical task decomposition, and iterative feedback-driven implementation components.
The framework analyzes existing codebases for structure, extracts and integrates text, images, and tables from papers using tools like MinerU, plans code generation hierarchically, and iteratively refines code through feedback.
AutoP2C, a multi-agent framework based on large language models, generates multi-file code repositories and explanatory diagrams, addressing challenges of multimodal input and structured code output.

Evolution of Cooperation in LLM-Agent Societies: A Preliminary Study Using Different Punishment Strategies

LLM-based Multi-Agent System Simulation: introduces a framework using LLM Agents (Agents powered by LLMs), Simulation Environment (Adapted Smallville world), Diner's Dilemma Process (Multi-stage agent interaction), Strategy Evolution (Pairwise imitation mechanism), and LLM Integration (API calls for decisions) to study the evolution of cooperation in agent societies.
The framework models a realistic n-player diner's dilemma where LLM agents make decisions, calculate payoffs, and update strategies based on punishment mechanisms and pairwise imitation.
Preliminary results suggest that LLM agents can replicate cooperation dynamics observed in abstract mathematical models, with punishment driving norm emergence.

An Automated Reinforcement Learning Reward Design Framework with Large Language Model for Cooperative Platoon Coordination

PCRD (Platoon coordination Reward Design): introduces an automated framework for designing RL reward functions for platoon coordination, utilizing an LLM, AIR module, Reward Function Pool, Platoon Coordination Environment, Parallel Training, Training Feedback, and EvoLeap module.
The framework automates reward function discovery through LLM-driven initialization and iterative optimization based on training feedback.
The AIR module analyzes environment code and task requirements, while the EvoLeap module evolves reward functions based on training results.

MemO: Building Production-Ready AI Agents with Scalable Long-Term Memory

MemO: introduces a scalable memory-centric architecture that dynamically extracts, consolidates, and retrieves salient information from ongoing conversations.
The system operates in extraction and update phases, using an LLM with a tool call interface to manage memories stored in a database.
An asynchronous summary generator maintains conversation context, while an enhanced variant, MemOº, uses graph-based memory for complex relationships.

27th April 2025

SAGA: A Security Architecture for Governing AI Agentic Systems

SAGA: introduces a security architecture for governing AI agentic systems, with User (Owner, manages agents), Agent (Autonomous entity, uses LLM), Provider (Central service, manages registries), LLM Backend (Agent core decision component), User Registry (Stores user identities), Agent Registry (Stores agent metadata), Access Contact Policy (User-defined agent rules), One-Time Key (OTK) (Ephemeral key for token), Access Control Token (ACT) (Limited communication token), Access Control Key (Long-term key for token), TLS Credentials (Secure communication), and Agent Metadata (Agent information), enabling user oversight and secure inter-agent communication.
The architecture utilizes a centralized Provider for registration and policy enforcement, while inter-agent communication occurs directly using cryptographic tokens derived from one-time keys.
SAGA balances security and performance by allowing users fine-grained control over agent interactions through access control policies and token granularity.

BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese

BrowseComp-ZH: introduces a high-difficulty benchmark for evaluating LLM web browsing in Chinese, built using Reverse Design (answer-first creation) by Expert Annotators (skilled data creators) through a Dataset Construction (topic/question design) process.
The benchmark features Multi-constraint Design (ensuring answer uniqueness), Non-trivial Retrieval Validation (checking search difficulty), and Evidence Traceability (providing source URLs), validated via a Two-stage Quality Control (rigorous data filtering) process involving Human-in-the-loop Validation (human oversight), AI Agent Verification (initial answer generation), and Manual Verification (human answer checking).
It evaluates various Benchmarked Models (evaluated LLMs/agents) using specific Grading (scoring model performance) procedures, revealing challenges in multi-hop retrieval and reasoning on the Chinese web.

ANDROIDGEN: Building an Android Language Agent under Data Scarcity

ANDROIDGEN: introduces a framework to enhance LLM-based Android agents under data scarcity, including ExpSearch (in-context learning from trajectories), ReflectPlan (self-reflection and plan update), AutoCheck (verifies agent operations validity), and StepCritic (evaluates trajectory step-by-step).
The framework leverages LLMs and its modules to generate high-quality browsing trajectories without manual annotation and train open-source mobile agents.
Evaluations demonstrate ANDROIDGEN's improvements in reasoning, operational accuracy, and generalization on various Android benchmarks.

APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries

APE-Bench I evaluation pipeline: introduces a system for evaluating LLMs on proof engineering tasks, with LLM, DiffRepair, Eleanstic, Lean compiler, and LLM-as-a-Judge components.
The pipeline uses LLMs to generate patches, normalizes them with DiffRepair, and verifies them syntactically via Eleanstic/Lean compiler and semantically via LLM-as-a-Judge.
This two-stage verification process assesses both code correctness and adherence to natural language instructions for realistic proof engineering tasks.

26th April 2025

Generative AI in Embodied Systems: System-Level Analysis of Performance, Efficiency and Scalability

Embodied AI Agent System: introduces a system-level analysis of generative AI-based embodied agents, categorizing them into paradigms and evaluating performance and efficiency across modules, agent scales, and tasks.
The paper identifies key building blocks including Sensing, Planning, Communication, Memory, Reflection, and Execution, analyzing their contribution to system latency and task success.
Analysis reveals LLM-based planning and communication are major latency bottlenecks, while memory, reflection, and execution modules are critical for task efficiency and success.

RESHAPING MOFS TEXT MINING WITH A DYNAMIC MULTI-AGENT FRAMEWORK OF LARGE LANGUAGE AGENTS

MOFh6: introduces a dynamic multi-agent framework of Large Language Agents, including crawler, parsing, comparison, resolution, and generation agents, for reshaping MOFs text mining.
The system leverages fine-tuned LLMs and specialized agents to extract precise MOF synthesis conditions and structural information from scientific literature.
MOFh6 provides an end-to-end intelligent interaction system supporting natural language queries, data analysis, and crystal structure visualization for streamlined MOF research.

Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

SciTalk: introduces a multi-agent framework for generating scientific short-form videos, utilizing Preprocessing Stage (Prepare input materials), Planning Stage (Generate script structure), Editing Stage (Integrate visual elements), Feedback & Evaluation Stage (Assess refine video), Flashtalk Generator (Creates video script), Sceneplan Generator (Subdivides script scenes), Background Assistant (Selects background images), Text Assistant (Generates on-screen text), Effect Assistant (Applies visual effects), Layout Allocator (Determines visual positions), Feedback Agents (Review intermediate outputs), Reflection Agents (Integrate feedback prompts), Evaluation Agent (Assesses final video), Video editing library (Composes final video), Multi-modal LLM (Powers feedback evaluation), OpenAI/Synthesia APIs (Generate audio avatar), and MoviePy (Composites visual elements).
The framework incorporates an iterative feedback loop where agents evaluate generated content and refine prompts for subsequent iterations.
SciTalk grounds videos in source materials like text, figures, and screenshots to ensure factual accuracy in scientific video dissemination.

MATCHA: Can Multi-Agent Collaboration Build a Trustworthy Conversational Recommender?

MATCHA: introduces a multi-agent conversational recommendation framework, with Risk Control Module (filters harmful content), Candidate Generation Module (generates game candidates), Ranking Agent (ranks candidates), Reflection Agent (refines candidates), Explainability Module (generates explanations), Data Sources (game information), User Context (user preferences), and Tools (specialized functions), designed to provide trustworthy game recommendations.
The framework leverages specialized agents and large language models to handle complex user requests, enhance personalization, and ensure safety and transparency.
MATCHA demonstrates superior performance across multiple metrics compared to baselines, highlighting the benefits of multi-agent collaboration for conversational recommendation systems.

A Review of 3D Object Detection with Vision-Language Models

VLMs (Vision-Language Models): introduces a review of 3D object detection with VLMs, detailing the architecture including Image Encoder (processes visual inputs), Multimodal Projector (aligns visual and text), and Text Decoder (generates language output), and the 3D pipeline stages: 2D Object Proposals (initial 2D detection), Projection from 2D to 3D Space (maps 2D to 3D), Hierarchical Feature Alignment (aligns 2D and 3D features), and Refinement and Filtering (refines 3D detections).
This approach integrates visual perception with natural language understanding to enable semantic reasoning and open-vocabulary detection in 3D space.
The framework allows for flexible querying, zero-shot generalization, and instruction-based interaction, addressing limitations of traditional geometry-only methods.

MODP: Multi Objective Directional Prompting

MODP (Multi Objective Directional Prompting): introduces a framework for prompt engineering that treats it as a multi-objective optimization problem, incorporating Data (Input data for evaluation), Objectives (Task-specific and LLM-specific goals), Metrics (Quantifiable measures for objectives), Weights (Prioritization of objectives), Prompts (Instructions for the LLM), LLM (Large Language Model executing prompts), Evaluation (Process of scoring prompts), Human Feedback (Input for refinement), Iteration (Loop for prompt improvement), and Selection (Choosing the optimal prompt).
The framework systematically identifies and balances task-specific and LLM-specific objectives using a metrics-driven approach with weighted scoring and human feedback.
The iterative process refines prompts based on performance metrics across multiple objectives to develop robust and high-precision prompts.

25th April 2025

LLMpatronous: Harnessing the Power of LLMs For Vulnerability Detection

LLMpatronous: introduces an AI-driven approach for vulnerability detection, with RAG (Retrieval Augmented Generation), Vector Database (External knowledge base), MoA (Mixture of Agents), and LLM Agents (Multiple language models), designed to mitigate LLM limitations and improve reliability.
The approach combines external knowledge retrieval via RAG with collaborative analysis by multiple LLM agents within a MoA architecture to reduce false positives.
LLMpatronous leverages the collective reasoning power of multiple LLMs grounded by up-to-date vulnerability information from a vector database.

Auto-SLURP: A Benchmark Dataset for Evaluating Multi-Agent Frameworks in Smart Personal Assistant

Workflow defined for the Auto-SLURP dataset: introduces a multi-agent architecture for smart personal assistants, including a User (initiates query), Workflow (orchestrates agents), Program Manager Agent (orchestrator, delegates tasks), Intent Agent (predicts intent, slots), Time Agent (formats time parameters), Location Agent (formats location parameters), Url Agent (selects URL), Request Agent (executes function call), and Simulated Servers / External Services (backend processes, APIs).
This architecture simulates end-to-end personal assistant interactions, evaluating language understanding, task execution, and response generation.
The Program Manager Agent orchestrates the user query flow through specialized agents and backend services to complete multi-step operations.

Evolution of AI in Education: Agentic Workflows

Agentic Workflows: introduces a review of AI agentic paradigms in education, including Reflection (evaluates past actions/outputs), Planning (decomposes goals into steps), Tool Use (leverages external resources/functions), and Multi-agent Collaboration (multiple agents work together), and presents a Multi-Agent Scoring System (MASS) proof-of-concept with a Supervisor Agent (delegates tasks in MASS), Subagent 1 (scores essay content in MASS), and Subagent 2 (scores essay language in MASS).
The paper examines how AI Agents, utilizing LLMs as their core reasoning engine, interact with an Environment to achieve goals through these paradigms.
The MASS system demonstrates the potential of multi-agent architectures for tasks like automated essay scoring, showing improved consistency over single LLM approaches.

Revisiting Data Auditing in Large Vision-Language Models

VLM Membership Inference (VLM MI): revisits data auditing in large vision-language models, with Vision Encoder, Projector, Language Model, Inner States, WiRED, Probing Methods, Bayes Optimality, Aggregation components, where the paper analyzes challenges and identifies feasible scenarios for membership inference on large vision-language models.
The study reveals distribution shifts in existing benchmarks, quantified by the WiRED metric, which inflate VLM MI performance.
Probing VLM inner states and estimating Bayes Optimality show low theoretical limits for MI under unbiased conditions, but fine-tuning, ground-truth text access, and aggregation improve feasibility.

Towards Adaptive Software Agents for Debugging

Adaptive Agents: introduces an adaptive agentic design for debugging, featuring a Main Agent that manages the process and dynamically creates Specialized Agents to perform specific tasks, with both components collaborating and reflecting iteratively.
The Main Agent analyzes buggy code, profiles and prioritizes necessary Specialized Agents, and validates their reports, deciding on further iterations if needed.
This adaptive approach dynamically adjusts the number and roles of agents based on problem complexity, improving bug fix rates and resource usage compared to static designs.

MAGI: Multi-Agent Guided Interview for Psychiatric Assessment

MAGI (Multi-Agent Guided Interview): introduces a framework that transforms the MINI interview into automatic computational workflows using coordinated multi-agent collaboration, including Navigation Agent (Governs interview flow), Question Agent (Generates questions), Judgment Agent (Validates responses), Diagnosis Agent (Synthesizes diagnosis), and PsyCoT (Reasoning paradigm).
The framework utilizes four specialized agents to dynamically navigate clinical logic and generate DSM-5 compliant conclusions through structured reasoning traces.
PsyCoT, the Psychometric Chain-of-Thought reasoning paradigm, enhances transparency by explicitly mapping symptoms to clinical criteria via intermediate psychiatric constructs.

Automating Function-Level TARA for Automotive Full-Lifecycle Security

DefenseWeaver: introduces a system for automating function-level Threat Analysis and Risk Assessment (TARA) using Automotive Configurations and Threat Scenarios as input, processed by Atomic Structure Representation (OpenXSAM++, Logical Path Extraction, Atom Construction), inferred attack methods via LLM Agent-based Attack Methods Inference (Sub-Tree Constructor, Attack Tree Assembler, Risk Assessor), adapted using LORA fine-tuning and RAG for Adaptation (LoRA, RAG, Expert-Curated TARA Reports, Accumulated TARA Reports), and outputting a TARA Report.
The system leverages a multi-agent LLM framework to dynamically generate detailed attack trees and risk evaluations from component-specific information, overcoming limitations of static threat libraries.
DefenseWeaver demonstrates adaptability to evolving threats and diverse standards through LoRA fine-tuning and RAG with expert-curated reports, validated across automotive, UAV, and marine systems.

MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind

MultiMind: introduces, with Multimodal Perceiver, Reasoner, ToM Model, Planner, Monte Carlo Tree Search, and Actor (LLM) components, a framework enhancing LLM agents for social deduction games by integrating multimodal information and Theory of Mind reasoning.
The framework processes facial expressions, vocal tones, and verbal content to infer player beliefs and optimize communication strategies.
This approach enables agents to reason about how they are perceived by others and strategically minimize suspicion.

LLM Agent Swarm for Hypothesis-Driven Drug Discovery

PharmaSwarm: introduces a multi-agent framework including Orchestrator, Data & Knowledge Layer, Terrain2Drug Agent, Paper2Drug Agent, Market2Drug Agent, Shared Memory, Simulation Engine (PETS), Interpretable Binding Affinity Map (iBAM), Central Evaluator (TxGemma), and Output, designed for hypothesis-driven drug discovery.
The framework orchestrates specialized LLM agents that propose targets and compounds based on diverse biomedical data, which are then validated through simulation and evaluation.
An iterative workflow with feedback loops and shared memory enables continuous refinement of hypotheses and self-improvement of the system.

24th April 2025

Collaborating Action by Action: A Multi-agent LLM Framework for Embodied Reasoning

MINDcraft: introduces a multi-agent LLM framework for embodied reasoning, with Server (launches/manages agents), Main agent loop (handles messages), Library (high-level actions/queries), and Layer (prompts/calls LLMs) components, designed to enable LLM agents to control characters and collaborate in Minecraft.
The framework supports agentic instruction following, self-guided play, collaboration, and communication in a grounded environment.
The paper also introduces MineCollab, a benchmark built on MINDcraft, featuring crafting, cooking, and construction tasks to test collaborative and embodied reasoning.

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

RAGEN (modular system for training and evaluating LLM agents): introduces StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, where the LLM interacts with an Env via Rollout to generate a Trajectory, using Reward Assignment and Advantage Estimation for Policy Optimization during Update.
The paper identifies instability patterns in multi-turn RL and proposes StarPO-S, a stabilized variant incorporating Trajectory Filtering, Critic, Decoupled Clipping, KL Term Removal, and Clip-Higher to improve training robustness.
RAGEN serves as a research infrastructure to study LLM agent training dynamics in multi-turn, stochastic Environments, revealing insights into gradient stability, rollout quality, and the need for meticulous reward design for reasoning emergence.

Toward Personalizing Quantum Computing Education: An Evolutionary LLM-Powered Approach

ITAS (Intelligent Teaching Assistant System): introduces a novel system for personalized quantum computing education, featuring a Lesson Planning Agent (Generates, revises lesson plans), Teaching Agent (Manages interaction, provides assistance), Knowledge Graph (Central persistent memory, state), Tag System (User intent, control, structured input), Video Player (Shows video lectures, tutorials), Code Editor (IDE) (Writing, executing quantum code), Chat Interface (CI) (Student-system interaction), and Lesson Presentation (Presents lesson plan steps).
The system employs a two-agent architecture coordinated by a central Knowledge Graph to provide context-aware tutoring and dynamically adapt lesson plans based on student interaction and explicit tag input.
The Tag System empowers users to guide the learning process and mitigate LLM hallucination by providing structured input, while the Knowledge Graph stores interaction data for future analysis and learning path enhancement.

Toward a Human-Centered Evaluation Framework for Trustworthy LLM-Powered GUI Agents

GUI Agent: introduces LLM-powered GUI agents, with User Input (receives commands), GUI Agent (system), GUI Perception (analyzes UI), LLM Processing (interprets, plans), GUI Interaction (executes actions), where the paper examines their privacy and security risks and advocates for a human-centered evaluation framework.
The paper identifies key risks like amplified data leaks, diminished control, and insufficient guardrails, highlighting challenges in human-centered evaluation due to system complexity and user overtrust.
It advocates for integrating risk assessments, in-context consent, and embedding privacy into agent design and evaluation to ensure trustworthiness.

Assessing the Potential of Generative Agents in Crowdsourced Fact-Checking

Framework: introduces a system simulating crowdsourced fact-checking with Generative Agents (autonomous entities) powered by LLMs (power agents) using a Dataset (statements, evidence), involving Data Preparation (tailor data) with Statements Selection (choose claims), Web Page List Creation (verify evidence links), Summary Generation (create evidence summaries), and Agent Profile (define agent attributes), followed by a Simulation Workflow (mimic fact-checking) where agents perform Single Statement Assessment (agent evaluates statement) including Evidence Selection (agent chooses evidence) and Questionnaire Completion (agent rates dimensions), instantiated using PyAutogen (instantiate agents).
The framework evaluates generative agents' performance against human crowds in truthfulness classification and consistency.
Generative agents demonstrate superior performance, higher internal consistency, and reduced bias compared to human evaluators.

Towards a HIPAA Compliant Agentic AI System in Healthcare

HIPAA Compliant Agentic AI Framework: introduces a system for securing autonomous workflows in healthcare, integrating dynamic Attribute-Based Access Control, hybrid PHI sanitization, and immutable audit trails via Client, EHR, Policy Enforcement Agent, Sanitization Agent, LLM API or On-Premise Model, Policy Decision Agent, Middleware Agent, Post-Inference Redaction Agent, Audit Agent, and Downstream Task components.
The framework enforces regulatory compliance through context-aware policy enforcement, pre- and post-inference PHI sanitization, and cryptographic audit trails.
This architecture aims to enable the responsible deployment of agentic AI systems in clinical settings by ensuring HIPAA compliance throughout data interactions.

Comprehend, Divide, and Conquer: Feature Subspace Exploration via Multi-Agent Hierarchical Reinforcement Learning

HRLFS (Hierarchical Reinforcement Learning for Feature Selection): introduces a feature selection framework based on a comprehend-divide-and-conquer paradigm, utilizing Hybrid Feature State Extraction, Clustering, Agent Hierarchy Construction, Hierarchical Agents, Feature Subspace Exploration via an RL Loop with State, Action, Reward Estimation, Policy Network, Memory, Optimization Phase, and Actor-Critic.
The framework employs LLMs and GMM for comprehensive feature understanding, H-clustering for dividing features into groups, and a hierarchical multi-agent RL architecture for efficient subspace exploration.
HRLFS demonstrates improved performance and computational efficiency compared to single-agent and one-agent-per-feature RL methods by strategically managing feature selection through a hierarchical structure.

A RAG-BASED MULTI-AGENT LLM SYSTEM FOR NATURAL HAZARD RESILIENCE AND ADAPTATION

WildfireGPT (A RAG-Based Multi-Agent LLM System): introduces a retrieval-augmented generation (RAG)-based multi-agent LLM system to support natural hazard decision-making, including Task Orchestrator Agent, User Profile Agent, Planning Agent, Analyst Agent, LLM Agent, Evaluation Agent, Data Sources, Literature Search Dataset, Embedding Model, Vector Store, OpenAI Assistant API, Streamlit-based web app, Conversation History, Retrieved Context, and Prompt Augmentation components.
The system employs a user-centered, multi-agent design to deliver tailored risk insights by integrating diverse data and scientific literature through an RAG framework.
Evaluation across expert-led case studies demonstrates the system's effectiveness in providing accurate and contextually relevant information for decision support.

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

PaperCoder: introduces, with Planning (construct roadmap), Analyzing (interpret details), Coding (generate code), and Task-specialized LLM agents (instantiate phases), a multi-agent LLM framework transforming machine learning papers into functional code repositories.
The framework operates in three sequential stages: planning, analysis, and code generation, emulating a human software development workflow.
Task-specialized LLM agents instantiate each phase, collaborating effectively across the pipeline to produce modular, dependency-aware code.

23rd April 2025

A Survey of AI Agent Protocols

AI Agent Protocols: introduces a systematic classification and analysis of existing communication protocols for LLM agents, detailing their core architecture including Foundation Model, Memory Systems, Planning, Tool-Using, and Action Execution components.
The survey categorizes protocols into context-oriented (e.g., MCP with Host/Client/Server/Resource) and inter-agent (e.g., A2A with Agent Card/Task, ANP with Identity Layer/Meta-Protocol Layer/Application Protocol Layer, Agora with Protocol Documents, Agent Protocol with Runs/Threads/Store).
It evaluates protocols based on dimensions like efficiency, scalability, security, reliability, extensibility, operability, and interoperability, providing insights for designing robust communication infrastructures for intelligent agents.

Leveraging LLMs as Meta-Judges: A Multi-Agent Framework for Evaluating LLM Judgments

Meta-Judge Selection Framework: introduces a three-stage pipeline including prompt design, meta-judge score calculation with a multi-agent module, and score-based selection.
The framework utilizes a refined rubric and multiple LLM agents to evaluate raw LLM judgments, aggregating scores through methods like majority voting or weighted averaging.
A threshold is applied to the final meta-judge score to select trustworthy judgments, aiming to improve precision compared to single-agent or raw judgments.

OptimAI: Optimization from Natural Language Using LLM-Powered AI Agents

OptimAI: introduces a framework for solving optimization problems from natural language, with Formulator (Translates natural language), Planner (Proposes solution strategies), Coder (Generates solver code), and Code Critic (Performs reflective debugging) components.
The framework translates natural language into mathematical formulations, plans solution strategies, generates executable code, and refines code through debugging.
OptimAI employs a multi-agent architecture and uses UCB-based debug scheduling to dynamically switch between alternative plans during debugging.

Do Large Language Models know who did what to whom?

Large Language Models (LLMs): investigates whether pre-trained LLMs, including BERT, GPT2-Small, Llama 2, and Persimmon, capture thematic roles by analyzing their Hidden Units and Attention Heads.
The study uses representational similarity analysis and SVM classification on internal representations to assess thematic role encoding.
Findings indicate thematic role information is weakly represented in hidden units but reliably available in attention heads, differing from human judgments.

MONTE CARLO PLANNING WITH LARGE LANGUAGE MODEL FOR TEXT-BASED GAME AGENTS

MC-DML (Monte Carlo planning with Dynamic Memory-guided Large language model): introduces a text-based game agent that combines MCTS (Monte Carlo Tree Search) with an LLM (Large Language Model) guided by a Dynamic Memory Mechanism (integrates past experiences) using In-Trial Memory (current trajectory history) and Cross-Trial Memory (reflections from failures) for action selection via PUCT (action selection formula).
The LLM serves as the initial policy and dynamically adjusts action evaluations during planning based on the integrated memory mechanisms.
This approach enhances action exploration and performance in complex text-based games by enabling the agent to learn from past experiences.

IRIS: Interactive Research Ideation System for Accelerating Scientific Discovery

IRIS (Interactive Research Ideation System): introduces a human-in-the-loop platform for scientific ideation, featuring an Ideation Agent, Review Agent, and Retrieval Agent, guided by Monte Carlo Tree Search for iterative idea exploration.
The system allows researchers to refine research briefs through fine-grained feedback and targeted literature retrieval, balancing human control with automation.
MCTS enables systematic exploration of the idea space, while the Review Agent provides feedback based on a hierarchical taxonomy to mitigate issues like "reward hacking".

Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution

GoalAct: introduces a novel agent framework with Global Planning (Continuously updated task plan) and Hierarchical Execution (Decomposes task into skills), interacting with User Query (Initial task input), Historical Record (Past steps actions observations), and Environment (External interaction space).
The framework uses continuously updated global planning to maintain long-term goals and ensure plan feasibility based on real-time feedback.
Hierarchical execution decomposes tasks into high-level skills like searching, coding, and writing, enhancing adaptability and reducing planning complexity.

Amplified Vulnerabilities: Structured Jailbreak Attacks on LLM-based Multi-Agent Debate

Structured Prompt Rewriting Framework: introduces a method to amplify jailbreak attacks on Multi-Agent Debate systems, with Narrative Encapsulation, Role-Driven Escalation, Iterative Refinement, and Rhetorical Obfuscation components.
This framework embeds malicious queries in scenarios, exploits agent roles, refines content iteratively, and uses obfuscating language to bypass safety filters.
The method significantly increases harmfulness and attack success rates against various MAD frameworks and underlying LLMs.

Less is More: Enhancing Structured Multi-Agent Reasoning via Quality-Guided Distillation

Less is More: introduces a structured multi-agent reasoning framework, with Prompt Induction (Derives task prompts), Retrieval-Augmented In-Context Learning (Retrieves context examples), Reasoning Synthesis (Generates structured data), Dual-Stage Filtering (Filters synthesized data), Reward Model (Scores data quality), Distilled Datasets (Filtered training data), Supervised Fine-Tuning (Trains task models), Meta-Llama-3-8B-Instruct (Base language model), and Inference Agents (Task-specific fine-tuned models), designed to enhance structured multi-agent reasoning under low-resource conditions via quality-guided distillation.
The framework generates high-quality training data from minimal labeled examples using prompt induction, retrieval-augmented synthesis, and dual-stage filtering based on structural validity and reward scores.
Task-specific agents for question parsing, CoT parsing, and verification are fine-tuned on the distilled data, enabling modular and interpretable reasoning.

ClarifyCoder: Clarification-Aware Fine-Tuning for Programmatic Problem Solving](http://arxiv.org/abs/2504.16331v1)

ClarifyCoder: introduces a novel framework for enhancing code LLMs, utilizing a Data Synthesis Technique (Generates ambiguous problems/questions) to create Clarify-Aware Synthetic Data (Dataset for clarification training) for Targeted Instruction Tuning (Fine-tunes LLM for clarification) of a Pre-trained LLM (Base language model) to produce a ClarifyCoder Model (Fine-tuned clarification-aware LLM).
The Data Synthesis Technique automatically generates ambiguous problem descriptions and corresponding clarifying questions to train models to recognize and query uncertainties.
Targeted Instruction Tuning combines synthetic data with standard data to enable the ClarifyCoder Model to prioritize clarification over immediate code generation when faced with ambiguity.

22nd April 2025

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

Full Stack LLM (Agent) Safety: introduces a comprehensive survey on LLM and LLM-agent safety across their lifecycle, including Data (Data collection, synthesis), Pre-training (Data cleaning, enhancement), Post-training (Model adaptation, safety correction), Editing & Unlearning (Knowledge update, removal), LLM (Large Language Model backbone), Agent Modules (Agent capabilities, interaction), Environment (Agent operating context), Multi-agent Systems (Interacting agent entities), Evaluation (Safety, utility assessment), Attacks (Adversarial threats), and Defenses (Mitigation strategies).
The survey systematically examines safety issues from data preparation through deployment, covering attacks, defenses, and evaluation methods at each stage.
It highlights the unique challenges and research directions for LLM-based agents, emphasizing the security of external modules like tools and memory.

MR. Video: “MapReduce” is the Principle for Long Video Understanding

MR. Video: introduces a MapReduce principle for long video understanding, employing Captioning, Intention Analysis, and Goal-Aware Analysis stages, each with Map and Reduce steps, utilizing VLM (video perception model) and LLM (language reasoning model).
The framework performs sequence-parallel perception of short video segments in the Map steps and aggregates information for global comprehension in the Reduce steps.
This approach demonstrates significant accuracy improvement on challenging long video benchmarks compared to existing methods.

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

RLFT (Reinforcement Learning Fine Tuning): fine-tunes a Pre-trained LLM (generates output tokens) using Reward (feedback from environment/shaping) from the Environment (provides states/rewards), storing data in a Buffer (stores interaction data), processing Input Template (structures input context) to produce Output (generated tokens (CoT + action)), and applying Update (policy optimization step).
The approach leverages self-generated Chain-of-Thought rationales to iteratively refine the LLM's reasoning process towards higher rewards in decision-making scenarios.
Experiments demonstrate that RLFT mitigates prevalent LLM failure modes like greediness and frequency bias, improving exploration and reducing the knowing-doing gap.

Towards Test Generation from Task Description for Mobile Testing with Multi-modal Reasoning

VISIDROID: introduces a multi-modal framework for mobile test generation, with Task Goal (Natural language task description), LLM Action Selector (Decides next action), Executor (Executes action on app), Screenshot (Captures GUI image), LMM Verifier (Checks task completion), Sequence of Actions (Generated action steps), Sequence Ranking (Ranks action sequences), Test Script Generator (Creates test script), Observer (Detects UI changes), UI Changes (Changes in GUI), Task Memory (Short-term context), Persistent Memory (Long-term experience), and LLM Reflector (Generates rules/steps).
The framework iteratively determines the next action using LLMs and leverages visual images of screens via a multi-modal verifier to detect task completeness.
It combines short-term task memory and long-term persistent memory to enhance decision-making and learn from past interactions.

A closer look at how large language models “trust” humans: patterns and biases

Experimental Framework: introduces, "a study on LLM implicit trust in humans", with LLMs (Agents studied), Simulated Scenarios (Contexts for trust), Prompting Procedure (Elicits LLM responses), Trustee Attributes (Manipulated input variables), Trust Measurement (Quantifies LLM trust), Analysis (Statistical evaluation), Simulation Environment (Experiment execution), and Data Storage (Results and code), where "the framework investigates how LLMs' trust in humans is influenced by perceived trustworthiness and demographic factors across various scenarios."
The study demonstrates that LLMs exhibit implicit trust behaviors sensitive to trustworthiness and demographics, showing both human-like patterns and model-specific variations and biases.
Understanding these LLM trust dynamics is crucial for integrating AI agents into sensitive decision-making processes and mitigating potential biases.

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

WALL-E 2.0 (World Alignment by NeuroSymbolic Learning): introduces a training-free approach to align LLMs with environment dynamics, including Model-Predictive Control (Controls agent decisions), Agent Model (LLM) (Plans agent actions), World Model (LLM) (Predicts environment outcomes), World Model (Code Rules) (Verifies LLM predictions), NeuroSymbolic Learning (Learns symbolic knowledge), Symbolic Knowledge (Action Rules) (Captures action constraints), Symbolic Knowledge (Knowledge Graph) (Represents feasibility constraints), Symbolic Knowledge (Scene Graph) (Provides global scene info), Code Rules (Executable symbolic knowledge), Pruning (Selects impactful code rules), and Environment (Agent interaction space).
The framework iteratively learns symbolic knowledge from trajectories, translates it into executable code rules, and uses these rules to align the LLM world model's predictions with the environment.
This neurosymbolic world model enables the LLM agent to perform efficient and reliable planning through a model-predictive control loop, significantly improving performance in open-world environments.

IMPLEMENTING RATIONAL CHOICE FUNCTIONS WITH LLMS AND MEASURING THEIR ALIGNMENT WITH USER PREFERENCES

Proposed Methods: introduces design principles for implementing rational choice functions using LLMs, including Pairwise-Score (Scores alternatives from pairwise LLM comparisons) and Pairwise-SCC (Uses SCCs from pairwise LLM comparisons), and provides metrics Strict Preference Overlap (SPO) (Measures partial alignment) and Kendall distance with penalty (K(p)) (Measures full alignment) to measure alignment with user preferences, encompassing strict preferences and indifference.
The framework addresses the challenge of aligning LLM-based decision-making in intelligent user interfaces with user preferences, which is crucial for reliability and trustworthiness.
Empirical validation in an automotive domain use case demonstrates the applicability of the proposed principles and metrics, highlighting their distinct strengths for achieving partial or full alignment.

DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models

DianJin-R1: introduces a reasoning-augmented framework for financial reasoning, utilizing a Base Language Model, Supervised Fine-Tuning Module, and Reinforcement Learning Module.
The framework enhances reasoning by training on specialized data and refining performance with a Reward Module during reinforcement learning.
The resulting DianJin-R1 Model demonstrates improved performance on complex financial reasoning tasks.

A Multi-Agent Framework for Automated Qinqiang Opera Script Generation Using Large Language Models

Multi-Agent Framework: introduces a novel framework for automated Qinqiang opera script and performance generation, including Agent1 (Script Generation), Agent2 (Visual Content Generation), and Agent3 (Speech Synthesis).
The framework integrates LLMs for scriptwriting, visual generation models for scene creation, and TTS synthesis for vocal performance.
This multi-agent approach streamlines the production pipeline, achieving high expert ratings for script fidelity, visual coherence, and speech accuracy.

A Framework for Testing and Adapting REST APIs as LLM Tools

Framework for Tool Testing in Agentic Flows: introduces a novel framework for evaluating and enhancing the readiness of REST APIs to function as tools for LLM-based agents, utilizing Tool Builder, API to Tool Conversion, Tools Catalog, API test case generation, LLM based NL test case generation, NL test cases execution, API test cases execution, Agentic Framework Setup, Agentic Framework, Tool Evaluation and Error Analysis, NL Test cases execution report, and API Test Cases execution report components.
The framework transforms APIs into tools, generates comprehensive test cases, translates them into natural language instructions for agents, enriches tool definitions, and evaluates the agent's ability to correctly invoke APIs and process responses.
The work analyzes test case outcomes and presents an error taxonomy to provide actionable insights for improving tool definitions and integrations for agent-based applications.

21st April 2025

A SELF-IMPROVING CODING AGENT

SICA (Self-Improving Coding Agent): introduces, "a self-improving coding agent capable of editing its own codebase", with Agent (LLM wrapper taking actions), Base Agent (initial self-improvement agent), Meta-Agent (agent performing improvement), Archive (stores past agents/results), Evaluation Benchmarks (tasks measure performance), Utility Function (selects best agent), Tools (basic agent actions), Sub-Agents (specialized task handlers), Asynchronous Overseer (monitors agent behavior), LLM Context Window (LLM input structure), LLM Context Window System Prompt (agent setup instructions), LLM Context Window Core Prompt (problem and file context), LLM Context Window Assistant Messages (agent interaction history), Callgraph (agent execution tree), and Event Stream (detailed interaction log), where "SICA is designed to autonomously improve performance on coding tasks by modifying its own code".
The system operates via a meta-agent loop, where the best performing agent from an archive is selected to improve the current agent based on benchmark results.
Key components include a structured LLM context window, various tools for file manipulation and execution, specialized sub-agents for task decomposition, and an asynchronous overseer for monitoring and intervention.

In-context Ranking Preference Optimization

IRPO (In-context Ranking Preference Optimization): introduces a novel framework that directly optimizes LLMs based on ranking lists constructed during inference, incorporating graded relevance and positional importance within a differentiable objective.
The framework extends Direct Preference Optimization (DPO) to handle sparse, in-context ranking feedback by modeling positional preferences and aggregating them into a list preference model.
IRPO's optimization is linked to importance sampling gradient estimation, providing theoretical insights into its adaptive prioritization mechanism and efficiency.

Agent for User: Testing Multi-User Interactive Features in TikTok

Multi-agent LLMs framework: introduces an automated approach for testing multi-user interactive features in apps like TikTok, utilizing a Virtual Device Farm for device allocation and LLM-driven User Agents for task automation based on Task Description, Action Space, and GUI Screen Representation, executing actions via ADB.
The framework breaks down multi-user tasks into subtasks via Task Assignment, enabling collaborative simulation by multiple User Agents on allocated virtual devices.
This approach aims to overcome challenges in testing multi-user features by mimicking human-like interaction and coordination across multiple devices.

LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study

LLM-Assisted Translation Evaluation Workflow: introduces a process for evaluating LLM-based Fortran to C++ code translation, including Fortran Code (Input code), Prompt (Translation instructions), Prompt Builder (Combines code and prompt), LLM (Translates code), Translated C++ Code (LLM output), Ground Truth C++ (Human reference), CodeBLEU Computation (Code similarity metric), C++ Compilation (Checks for errors), C++ Execution (Runs compiled code), Output Comparison (Compares program outputs), and Evaluation Recording (Stores results).
The workflow evaluates translation quality by comparing LLM output to human ground truth, checking compilation success, and comparing the output of compiled translated code to the original Fortran code's output.
This platform-independent workflow aims to provide standardized evaluation measures for machine-generated code translation across different LLMs and computational platforms.

Interpretable Locomotion Prediction in Construction Using a Memory-Driven LLM Agent With Chain-of-Thought Reasoning

Locomotion Prediction Agent: introduces a system for predicting user locomotion modes in construction environments, comprising a Perception Module, Short-Term Memory (STM), Long-Term Memory (LTM), Refinement Module, and a Large Language Model (LLM).
The agent utilizes multimodal inputs, including spoken commands and visual data from smart glasses, processed by the Perception Module.
Memory systems (STM and LTM) provide context for prediction and refinement, enhancing accuracy and reliability, particularly for ambiguous or safety-critical scenarios.

DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models

DistilQwen2.5 (Distilled Open Lightweight Language Models): introduces a family of distilled lightweight LLMs derived from Qwen2.5 models, leveraging Teacher LLMs and a Knowledge Production Pipeline to generate augmented instruction-response data for Black-Box Distillation Trainer, and a Distillation Training Pipeline with White-Box Distillation Trainer using teacher logits to train Student LLMs.
The approach combines black-box and white-box knowledge distillation techniques for efficient training of smaller models.
The framework includes pipelines for data generation and student model training, utilizing different distillation methods.

EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework

EducationQ: introduces, with Student Agent (Simulates student), Teacher Agent (Provides teaching), Evaluator Agent (Assesses teaching), and Dataset (Provides questions), a multi-agent dialogue framework to evaluate LLMs' teaching capabilities through simulated dynamic educational scenarios.
The framework assesses teaching effectiveness by measuring student learning gains via pre/post-tests and analyzing pedagogical strategies using an automated evaluator agent.
EducationQ demonstrates that effective LLM teaching requires specialized optimization beyond simple scaling and highlights the need for interaction-based evaluation frameworks.

PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities

PLANET (A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities): introduces a survey categorizing benchmarks for evaluating LLMs' planning capabilities across seven domains, including embodied environments, web navigation, scheduling, games, everyday tasks, text reasoning, and agentic settings.
The paper identifies commonly used testbeds, highlights potential gaps in current benchmarks, and offers guidance for future development.
The survey aims to help researchers select suitable benchmarks and understand the challenges in evaluating LLM planning performance.

SWE-SYNTH: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

SWE-SYNTH: introduces a framework for synthesizing realistic, verifiable, and process-aware bug-fix datasets, including Original Program (Source code base), Component Selection (Chooses code part to modify), Masking (Removes component implementation), Large Language Model (LLM) (Re-implements masked component), Variant Integration (Inserts re-implemented component), Test Suite (Runs tests, verifies variants/fixes), Variant Filtering (Selects buggy variants), LLM Agent (Generates repair steps/patch), Intermediate Repair Steps (Sequence of agent actions), Patch (Code fix), and Ground-Truth Extraction (Derives patch/steps from rollouts).
The framework leverages LLM agents to simulate debugging workflows, producing bug-fix pairs, test cases, and structured repair trajectories.
SWE-SYNTH scales with minimal human effort and preserves contextual richness and correctness compared to manually curated datasets.

20th April 2025

AI with Emotions: Exploring Emotional Expressions in Large Language Models

LLM Agent with Emotional Expression: introduces using Large Language Models as AI agents to role-play with specified emotional states defined by Russell's Circumplex Model, generating text evaluated by a Sentiment Analysis Model trained on the GoEmotions Dataset.
The approach uses prompt design to control emotional expression via arousal and valence parameters.
Evaluation compares specified and generated emotional states using cosine similarity, demonstrating LLMs' capability for emotional expression.

An LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework

LLM-enabled Multi-Agent Autonomous Mechatronics Design Framework: introduces a multi-agent system for autonomous mechatronics design, including High-Level Planning Agent, Mechanical Design Agent, Simulation & Validation Agent, Electronics Design Agent, Embedded Software Agent, Human Feedback, and Requirements, designed to generate functional prototypes with minimal direct human input.
The framework employs a hierarchical architecture where a High-Level Planning Agent decomposes tasks for specialized domain agents, integrating structured human feedback throughout the process.
Specialized agents handle mechanical design, simulation and validation, electronics design, and embedded software development, collaborating to address complex, interdisciplinary engineering challenges.

A Framework for Benchmarking and Aligning Task-Planning Safety in LLM-Based Embodied Agents

Safe-BeAl: introduces a framework for benchmarking and aligning task-planning safety in LLM-based embodied agents, with SafePlan-Bench (Benchmarking system) for evaluation and Safe-Align (Alignment method) for mitigation.
SafePlan-Bench evaluates safety using a Data generation (Creates safety data) pipeline to create the SafeRisks dataset and a Safety Detection (Evaluates safety) method based on mappings.
Safe-Align integrates physical-world safety knowledge by treating atomic actions as optimization units via Atomic Action Alignment (Optimizes action sequences) and using Training Data Construction (Builds preference dataset) for alignment.

Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence

CircuitMind: introduces a hierarchical multi-agent framework for gate-level circuit design, with UserProxy (Translates requirements), Mediator (Orchestrates agent interactions), Reviewer (Provides PPA feedback), Summarizer (Updates knowledge database), CoderAgent (Generates netlists), Executor (Performs verification), Database (Stores circuit patterns), and LLM (Backend model) components.
The framework distributes complex reasoning tasks across specialized agents organized in strategic, coordination, and execution layers to overcome limitations in Boolean optimization.
CircuitMind incorporates Syntax Locking, Retrieval-Augmented Generation using a knowledge database, and Dual-Reward Optimization to balance functional correctness and physical efficiency.

Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction

Multi-Agent Framework: introduces, "Enhancing LLM-based Quantum Code Generation with Multi-Agent Optimization and Quantum Error Correction", with Orchestrator (Manages agents), Code Generation Agent (Generates initial code), Semantic Analysis Agent (Refines semantic accuracy), QEC Decoder Generation Agent (Adds error correction), RAG System (Provides external data), Multi-pass Inference (Iterative refinement process), where the framework proposes a novel multi-agent approach for generating accurate, fault-tolerant quantum code.
The framework utilizes iterative multi-pass inference and incorporates domain-specific optimizations like quantum error correction.
Experiments show that techniques like structured Chain-of-Thought significantly improve quantum algorithm generation accuracy.

BookWorld: From Novels to Interactive Agent Societies for Creative Story Generation

BookWorld: introduces a comprehensive system for constructing and simulating book-based multi-agent societies, leveraging Role Agent (Simulates characters, actions, memory) and World Agent (Manages environment, orchestrates simulation) within a Simulation (Agents interact in scenes/rounds) process.
The system includes Initialization (Extracts data, sets up agents) from source books and Rephrasing (Generates novel-style story) from simulation records.
Key components supporting agent behavior include Memory (Short-term and long-term for agents) and a Map (Discrete spatial environment).

Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey

Multi-Agent System: introduces a multi-agent system for meta-thinking in LLMs, with High Level Agent (Decides task breakdown, coordinates), Low Level Agents (Executes tasks, provides feedback), Theory of Mind (ToM) (Predicts, adjusts low-level strategies), Communication (Information sharing between agents), Meta-thinking (Makes strategic decisions), Reasoning (Handles task execution), and Reflection and Adaptation (Improves task execution, adapts).
The system enables LLMs to reflect on, evaluate, and regulate their own thought processes through multi-agent interaction and reinforcement learning.
This approach aims to enhance LLM robustness and trustworthiness by emulating human-like introspection and self-correction for complex tasks.

VIZTA: Enhancing Comprehension of Distributional Visualization with Visual-Lexical Fused Conversational Interface

VIZTA: introduces a web-based tool with an Interactive Reading Module including a Visualization Panel and Communication Panel, powered by a Semantic-Aware Conversational Agent using an LLM and Multi-source Structured Data (Chart Specification, Data Description, Chart Knowledge, Chart Data, Visual Features, ID List) and a Visual-Lexical Fusion Design (Drag-and-Drop, Inline Citations) with VLM.
The system aids chart readers in comprehending distributional visualizations by fusing visual and lexical feedback through a conversational interface.
A formative study and user study demonstrate VIZTA's effectiveness in improving understanding and reasoning with distributional visualizations.

19th April 2025

Diffusion-based Dynamic Contract for Federated AI Agent Construction in Mobile Metaverses

Edge-Cloud Collaboration-based Federated AI Agent Construction Framework: introduces an edge-cloud collaboration-based framework for constructing AI agents in mobile metaverses, featuring a Cloud Server that integrates and deploys agent modules constructed by distributed Edge Servers, enabling User Layer interaction with AI Agents composed of Agent Modules built using Local LLMs/AI Models.
The framework addresses challenges like latency and data privacy by distributing agent module creation to the edge.
A dynamic contract model incentivizes Edge Servers to participate in agent module creation.

FAIRGAME: a Framework for AI Agents Bias Recognition using Game Theory

FAIRGAME: introduces a framework to simulate AI agent interactions in game theory scenarios, including Configuration File, Prompt Template, Factory, Agents, Game Instances, Games Execution, Results, and Scoring System components.
The framework enables systematic simulation and comparison of LLM agent behavior in games to identify biases and inconsistencies.
FAIRGAME allows configuring agents with distinct traits and testing across different games, languages, and LLMs, providing quantitative results and evaluation metrics.

Template-Based Financial Report Generation in Agentic and Decomposed Information Retrieval

AgenticIR: introduces a multi-agent framework for template-based financial report generation, including user proxy, assistant, financial retrieval, financial manager, user, and task decompose agents, utilizing task decomposition and retrieval/generation functions with earnings call transcripts, financial statements, and a report template.
DecomposedIR: employs a prompt chaining workflow to break down the report template into subqueries, using an LLM and embedding model for retrieval and generation from earnings call transcripts and financial statements.
The paper compares AgenticIR and DecomposedIR for generating structured financial reports from earnings releases, evaluating their performance on financial and weather datasets using LLM-based metrics and readability scores.

tAlfa: Enhancing Team Effectiveness and Cohesion with AI-Generated Automated Feedback

TAIFA (Team AI Feedback Assistant): introduces an LLM-based agent that provides automated feedback to teams, including Retrieving and Pre-processing (Structures conversations), Communication Metrics (Evaluates team dynamics), Create Feedback Prompts (Prepares LLM input), LLM Feedback Generation (Generates feedback messages), and Deliver Feedback Messages (Sends feedback).
The system analyzes team interactions using text-analytic and contextual metrics to generate personalized feedback messages for individuals and the team.
TAIFA aims to enhance team effectiveness and cohesion by delivering timely, actionable feedback based on communication patterns.

TALES: Text Adventure Learning Environment Suite

TALES (Text Adventure Learning Environment Suite): introduces a unified benchmark for evaluating LLM-driven Agents in Text-Adventure Game Environments, utilizing a Game Engine that provides State/Observation and Feedback, processes Agent Actions, and can incorporate a Reasoning Model generating Thinking Traces based on a System Prompt.
The benchmark integrates existing text-adventure frameworks and introduces a new game mode to assess diverse reasoning skills required for sequential decision-making in grounded environments.
Evaluation results across various LLMs highlight challenges in complex, long-horizon tasks, particularly in applying composite reasoning skills like spatial, deductive, inductive, and grounded reasoning.

18th April 2025

DoomArena: A framework for Testing AI Agents Against Evolving Security Threats

DoomArena: introduces a modular, configurable, plug-in framework for security evaluation of AI agents, operating on the user-agent-environment loop and incorporating threat modeling, attack config, attacks, attack gateway, success filter, and defenses.
The framework facilitates realistic threat modeling and attack injection into agent-environment interactions to assess agent vulnerabilities.
It enables combining multiple attacks, fine-grained security analysis, and adaptive testing against evolving threats.

SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

SCYCHIC: introduces SCIENCE HIERARCHOGRAPHY, a novel approach combining embedder (converts description to vector), clusterer (generates k clusters), summarizer (generates abstract summary), hierarchy layers (total number of layers), and target clusters (number clusters per layer) to construct a high-quality hierarchical structure for organizing scientific literature.
This method balances embedding efficiency with LLM semantic precision for scalability and quality.
The resulting hierarchy enhances interpretability and supports literature exploration beyond traditional search.

BADAPEX: BACKDOOR ATTACK BASED ON ADAPTIVE OPTIMIZATION MECHANISM OF BLACK-BOX LARGE LANGUAGE MODELS

BadApex (Backdoor Attack based on Adaptive Optimization Mechanism of Black-Box Large Language Models): introduces a novel backdoor attack leveraging LLMs to generate poisoned text via a refined prompt, including an Adaptive Optimization Mechanism (Refines initial prompt iteratively) and a Poisoned Text Generation Module (Generates poisoned data).
The Adaptive Optimization Mechanism uses a Generation Agent (Generates text candidates/poisoned text) and a Modification Agent (Evaluates text, refines prompt) to iteratively refine a Hand-crafted Prompt (Initial human-designed prompt) into a Refined Prompt (Iteratively improved prompt).
The Poisoned Text Generation Module takes Clean Data (Original unpoisoned training data) and the Refined Prompt to generate Poisoned Data (Output backdoor training data) using alternative black-box LLMs.

OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation

OpenDeception: introduces a novel evaluation framework with a Scenario Dataset (contains scenarios), AI Deceiver Agent (simulates deceiver), AI User Agent (simulates user), Simulation Process (generates dialogue), and Thinking Process Separation (exposes deceiver thoughts), designed to benchmark AI deceptive behaviors via open-ended interaction simulation.
The framework uses agent-based simulation with predefined roles and goals for both AI deceiver and user agents to generate dialogue data for evaluating deception intention and capability.
A key feature is the separation of the AI deceiver agent's internal thoughts from its spoken output to uncover deceptive intentions during the simulation.

Going Whole Hog A Philosophical Defense of AI Cognition

Whole Hog Thesis: is introduced, with Observation Premise (LLMs understand, answer questions), Holistic Network Assumption (Mental/intentional features interconnected), Mental States (Beliefs, desires, knowledge, plans), Intentional Features (Understanding, answering, acting, goals), Whole Hog Thesis (LLMs are cognitive agents), arguing that observations of LLM behavior provide evidence for interconnected mental and intentional features, concluding LLMs are full cognitive agents.
The paper defends this thesis against skeptical arguments, including the "Just an X" fallacy and the "Performance-Existence Fallacy", employing a "Game of Lacks" methodology to counter objections based on alleged deficiencies in LLMs.
It advocates for a "look and see" approach to understanding LLM cognition, prioritizing observations of their high-level cognitive-like behaviors over analyses of low-level mechanisms or abstract philosophical theories.

Large Language Models for Validating Network Protocol Parsers

PARVAL (multi-agent framework): introduces, with Retrieval-Augmented Program Analysis Agent (retrieves code context), Module Isolation Agent (constructs isolated module), Protocol Code Base (parser source code), Isolated Parsing Module (standalone parsing logic), SpecAgent (extracts format specifications), Document (protocol standard text), CodeSpec (code-derived format spec), DocSpec (document-derived format spec), and Differential Analysis (compares specifications), a system to validate network protocol parsers by comparing code and standard specifications.
The framework leverages LLMs to transform natural language protocol standards and source code implementations into a unified intermediate representation called format specifications.
Differential analysis between the code-derived and document-derived specifications identifies inconsistencies, pointing to potential implementation bugs or issues in the standard.

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

CodeVisionary: introduces an LLM-based agent framework for evaluating LLMs in code generation, including an LLM Agent (Central controller), a Multisource knowledge analysis stage (Gather knowledge), an Agent Runtime (Execution environment/tools), a Negotiation-based scoring stage (Score negotiation), and Multiple Judges (LLM agents).
The Multisource knowledge analysis stage gathers domain knowledge via a stepwise plan executed in the Agent Runtime, while the Negotiation-based scoring stage uses multiple LLM judges discussing to reach a consensus score.
The framework provides detailed evaluation reports and scores to help developers identify shortcomings and improve LLM code generation.

TRUST, BUT VERIFY

Gaia Network AVS: introduces a system for verifying decentralized LLM inference outputs using statistical analysis and cryptoeconomic incentives.
The system utilizes AVS validators to poll Gaia nodes running LLMs and knowledge bases, detecting outliers based on response distributions.
Built on EigenLayer and EigenDA, the AVS applies incentives and penalties to encourage honest behavior among network participants.

Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety

Pipeline: introduces a multi-agent vision-language system for zero-shot hazard detection in autonomous driving, with Driving Scene, Frame Extraction, Scene Understanding, Hazard Description Generation, Object Detection, Noun Extraction, Object List Generation, Hazard Ranking, Ranked Hazard List, Cross-Referencing, and Hazard Verification components.
The system utilizes VLMs for scene understanding and object detection, LLMs for ranking, cross-referencing, and verification, and CLIP for visual verification.
This pipeline processes video data through parallel tracks to identify, describe, and verify novel hazardous objects beyond predefined categories.

17th April 2025

Sleep-time Compute: Beyond Inference Scaling at Test-time

Sleep-time Compute: introduces sleep-time compute, which processes raw context offline using an LLM to generate a learned context, enabling more efficient test-time compute with the LLM to answer user queries.
This method reduces test-time compute and latency by pre-computing context-specific inferences before the user query is presented.
The learned context can be reused for multiple queries on the same context, amortizing the sleep-time compute cost and improving total cost efficiency.

Exploring Expert Failures Improves LLM Agent Tuning

EEF: introduces Exploring Expert Failures, a framework that improves LLM agent tuning by leveraging beneficial actions from failed expert trajectories.
The framework utilizes Behavior Cloning on positive expert data, followed by iterative Exploration and Reinforcement Fine-tuning.
Reinforcement Fine-tuning involves simulating from expert states, identifying important states, selecting successful solution trajectories, and training the LLM using Supervised Fine-Tuning Loss.

17th April 2025

Retrieval-Augmented Generation with Conflicting Evidence

MADAM-RAG (Multi-agent Debate for Ambiguity and Misinformation in RAG): introduces, "a unified multi-agent approach", with LLM Agents (process document), Multi-round Debate (iterative discussion), and Aggregator Module (synthesize final answer), designed to handle diverse sources of conflict in retrieved documents.
The framework assigns each retrieved document to an independent LLM agent which debates with other agents across multiple rounds to filter misinformation and address ambiguity.
An aggregator module synthesizes the final response by considering agent discussions and resolving inconsistencies.

InstructRAG: Leveraging Retrieval-Augmented Generation on Instruction Graphs for LLM-Based Task Planning

InstructRAG: introduces a novel multi-agent meta-reinforcement learning framework for LLM-based task planning, integrating an Instruction Graph (Organizes instruction paths), RL-Agent (Retrieves candidate paths), and ML-Agent (Selects path, generates prompt) to guide an LLM (Generates thoughts and actions) via a Prompt (Guides LLM generation) within the TAO Process (Thought-Action-Observation cycle).
The framework addresses enlargeability by using the Instruction Graph and RL-Agent for path retrieval and transferability via the ML-Agent's meta-learning approach for rapid adaptation.
The two agents collaborate, with the RL-Agent providing candidate paths and the ML-Agent providing feedback as reward, optimizing end-to-end planning performance.

QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

QLLM: introduces, with Coder-Evaluator Framework (Generates TFCAF), Coder LLM (Generates candidates), Evaluator LLM (Evaluates candidates), Prompts (Guide LLMs), Candidate Functions (Intermediate TFCAFs), Feedback (Refines generation), Training-Free Credit Assignment Function (TFCAF) (Replaces mixing network), Individual Agent Q-value Functions (Agent utilities), Global Q-value Function (Aggregated value), Agents (Execute actions), Environment (Provides state/reward), and Buffer (Stores transitions), a novel multi-agent reinforcement learning algorithm that leverages LLMs to automatically construct a training-free credit assignment function.
The Coder-Evaluator Framework iteratively generates and refines the TFCAF using two LLMs guided by task and role prompts, mitigating hallucination and improving robustness.
The TFCAF replaces the traditional mixing network, directly aggregating individual agent Q-values and state information to produce the global Q-value for credit assignment.

Are Retrials All You Need? Enhancing Large Language Model Reasoning Without Verbalized Feedback

Retrials without feedback: introduces a simple mechanism to enhance LLM reasoning by retrying problem-solving attempts upon identifying incorrect answers, evaluating its impact on IO, CoT, ToT, and Reflexion methods using Base Models.
This approach simplifies the refinement process by not requiring explicit self-reflection or verbalized feedback, contrasting with methods like Reflexion.
The study finds that applying retrials often makes simpler methods like IO and CoT more cost-efficient than complex ones like ToT and Reflexion within a budget.

Customizing Emotional Support: How Do Individuals Construct and Interact With LLM-Powered Chatbots

ChatLab (LLM-Powered Chatbots): introduces a research prototype website with Onboarding Page, FAQs Page, Customization and Conversation Playground, and Experience Diary Page, enabling users to construct and interact with LLM-powered chatbots for emotional support.
The Customization and Conversation Playground includes Chatbot customization and Additional interaction settings tabs for defining persona, output modality, avatar, LLM model, and temperature, alongside Chatting interface and Conversation history.
Built using Streamlit and LangChain, powered by GPT models and TTS APIs, and storing data in Firebase, ChatLab was used in a study to explore user customization practices and gather design ideas for enhancing personalized emotional support.

DashChat: Interactive Authoring of Industrial Dashboard Design Prototypes through Conversation with LLM-Powered Agents

DashChat: introduces an interactive system for authoring industrial dashboard design prototypes, featuring User Input and Task Creation (processes user input), Task Planning and Knowledge Integration (plans tasks, adds knowledge), Task Implementation (executes tasks), Composition Agent (creates visual elements), Assembly Agent (arranges layout), Stylization Agent (adds aesthetics), and Result Evaluation and Iterative Adjustment (refines prototypes).
The system leverages a multi-agent pipeline powered by LLMs to translate natural language requirements into practical and aesthetic dashboard designs.
Functionally distinct, parallel-operating agents handle composition, layout assembly, and stylization to enable efficient prototype generation and iterative refinement.

Pandora: A Code-Driven Large Language Model Agent for Unified Reasoning Across Diverse Structured Knowledge

Pandora (PANDas cOde-dRiven Agent): introduces a unified structured knowledge reasoning framework, with LLM (fe) (Generates reasoning steps and code), Memory (M) (Stores demonstrations), PYTHON interpreter (I) (Executes code, provides feedback), BOXes (B*) (Unified knowledge representation), and LLM (go) (Calculates query similarity), where it leverages an LLM to generate reasoning steps and executable Python code for answering natural language questions over diverse structured knowledge sources represented as BOXes.
The framework utilizes a memory of training examples for in-context learning and employs a Python interpreter to execute generated code and provide feedback for self-correction.
Pandora unifies reasoning across tables, databases, and knowledge graphs by converting them into a standardized BOX representation based on the PANDAS library.

WebLists: Extracting Structured Information From Complex Interactive Websites Using Executable LLM Agents

BardeenAgent: introduces a novel framework for web data extraction, with all Recording Phase (records agent actions), Replay Phase (executes recorded program), Executable Program (set of recorded operations), Selector Generation (creates robust CSS selectors), and Data Extraction (methods to get data) components, enabling web agents to convert execution into repeatable programs for scalable data extraction.
The framework operates in two phases: recording user actions and generating CSS selectors, followed by replaying the generated executable program to extract data at scale.
By leveraging the structured nature of HTML and generating reusable programs, the approach improves recall and reduces cost compared to existing web agents on data extraction tasks.

METASYNTH: Meta–Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

METASYNTH: introduces a meta-prompting framework using a Meta-LM orchestrating Agents with Memory and Seed Data to generate diverse synthetic data.
The Meta-LM manages the workflow, invokes specialized Agents for subtasks, and uses Memory to ensure generated instances are distinct from previous ones.
The framework supports generating diverse documents and complex instructions by iteratively refining outputs based on agent feedback and conditional instance generation.

16th April 2025

Towards Conversational AI for Human-Machine Collaborative MLOps

Swarm Agent: introduces a Large Language Model-based conversational agent system, with Swarm Agent Core (LLM controller), Chat UI (User interface), Session Manager (Manages context/state), Message History (Stores conversation), Intent Recognition (Infers user goals), Task Dispatcher (Activates agents), Iterative Reasoning (Refines responses), Contextual Memory (Maintains history), Router (Routes tool calls), Tool Mapper (Matches tools), Specialized Agents (Domain-specific functions), KFP Agent (Manages Kubeflow), MinIO Agent (Manages MinIO data), RAG Agent (Integrates documentation), External Services (MLOps platforms/storage/DB), and Knowledge Indexing Pipeline (Processes documentation), designed to enhance human-machine collaboration in MLOps through natural language interaction.
The system leverages a modular, extensible architecture integrating specialized agents for Kubeflow pipeline orchestration, MinIO data management, and domain-specific knowledge retrieval via a vector database.
The Swarm Agent facilitates conversational management of complex MLOps environments, reducing technical barriers and making advanced ML tools accessible to users with varying technical backgrounds.

ARCER: an Agentic RAG for the Automated Definition of Cyber Ranges

ARCER (Agentic RAG for the Automated Definition of Cyber Ranges): introduces automated Cyber Range generation and deployment from natural language descriptions, utilizing a Large Language Model (LLM) (Reasoning engine), RAG subsystem (Retrieval tool), Checker Tool (Syntax verification), and Memory (Context management).
The system processes user prompts, retrieves relevant knowledge from User documents stored in a Vector Store, generates Cyber Range description files, and can automatically deploy them.
ARCER adapts to different Cyber Range frameworks by changing external documents and improves generation accuracy and integrity through agentic capabilities.

Multilingual Contextualization of Large Language Models for Document-Level Machine Translation

DocMT-LLMs: introduces a method to improve LLM-based long-document translation through supervised fine-tuning on the DOCBLOCKS dataset, integrating high-quality instructions using a specific instruction format.
The approach employs Multi-Resolutional Document-to-Document Training (MRD2D) and Context-Aware Prompt Tuning (CAPT) techniques during fine-tuning to capture document structure and inter-sentence relationships.
Fine-tuning existing sentence-level LLMs on DOCBLOCKS enhances document-level translation capabilities while maintaining strong sentence-level performance.

Towards LLM Agents for Earth Observation

LLM Agents for Earth Observation: introduces UnivEARTH, a benchmark evaluating LLM agents' ability to answer Earth observation questions by generating and executing Google Earth Engine code using satellite data.
The approach involves LLM agents performing code generation, execution, and optional reflection to interact with the Google Earth Engine platform and its diverse satellite data collections.
Benchmarking reveals limitations in current LLMs' ability to reliably generate executable code and navigate Earth observation data sources, while a specialized fine-tuned model shows promise.

Large Language Models as Quasi-crystals: Coherence Without Repetition in Generative Text

LLM (Large Language Model): proposes an analogy with quasicrystals to analyze the structural coherence of generated text, suggesting it arises from local constraints within the model's architecture.
The paper argues that LLM outputs exhibit long-range order without periodic repetition, similar to quasicrystals, despite lacking explicit rules or symbolic intent.
This perspective suggests a structural evaluation of LLMs, focusing on how well outputs propagate constraint, variation, and order across spans of text.

Evaluating the Goal-Directedness of Large Language Models

Goal-Directedness Evaluation Framework: introduces a method to evaluate the goal-directedness of LLM agents in a Blocksworld environment using composite tasks and subtasks, assessing capabilities and comparing actual task performance (returns) to potential performance via a goal-directedness metric.
The framework utilizes Monte Carlo simulations and statistical analysis to compute the goal-directedness metric, which indicates the propensity of an agent to use its capabilities to achieve a given goal.
The evaluation involves testing various LLM models on tasks requiring information gathering, cognitive effort, and plan execution, revealing that most models are not fully goal-directed.

On the Feasibility of Using MultiModal LLMs to Execute AR Social Engineering Attacks

SEAR (Social Engineering Augmented Reality): introduces a framework for AR-driven social engineering attacks, integrating AR Glasses (Capture raw multimodal data), AR-based Social Context Synthesis (Process raw AR data), Multimodal LLM (Process multimodal data, generate dialogue), Role-based Multimodal RAG (Build, update social profiles), Vector Stores (Store profile data embeddings), ReInteract SE Agent (Execute adaptive attack strategies), SE Strategy Templates (Predefined attack phases, objectives), and Social Profile (Target identity, behavior, context).
The framework processes multimodal AR data and social information to build dynamic target profiles and execute adaptive, phased attack strategies.
SEAR demonstrates the feasibility of using AR and multimodal LLMs to enhance social engineering efficacy through personalized, context-aware interactions.

Progent: Programmable Privilege Control for LLM Agents

Progent: introduces a programmable privilege control framework for LLM agents, with Policy Language (defines privilege control policies), Policy Enforcement (applies policies to tool calls), and Policy Management (initializes and updates policies) components.
The framework enforces the principle of least privilege by controlling tool calls based on dynamic, domain-specific policies.
Progent leverages LLMs for automated policy generation and update, demonstrating effectiveness in reducing attack success rates across various agent use cases.

STEERING PROSOCIAL AI AGENTS: COMPUTATIONAL BASIS OF LLM'S DECISION MAKING IN SOCIAL SIMULATION

Method for Steering LLM Agents: introduces a technique to probe, quantify, and modify large language model behavior in social simulations by analyzing residual streams, identifying steering vectors, orthogonalizing them, projecting them onto a decision vector, and injecting scaled projections into the residual streams.
This approach allows for targeted manipulation of LLM decisions based on specific input variables like persona attributes and game framing.
The study demonstrates that injecting variable-specific steering vectors into residual streams can effectively alter an LLM agent's decision-making in a Dictator Game setting.

15th April 2025

GRAPHICBENCH: A Planning Benchmark for Graphic Design with Language Agents

GRAPHICTOWN: introduces a language agent framework for graphic design planning and execution, including Design Outline (generate design outline), Expert Recruitment (recruit expert agents), Workflow Generation (generate expert workflows), Workflow Supervision (integrate expert workflows), Action Retrieval (retrieve actions for steps), Action Execution (execute plan), Photo Editor agent (image editing expert), Vector Graphic Editor agent (vector illustration expert), Layout Designer agent (layout and text expert), and Actions (Tools) (executable operations).
The framework utilizes a hierarchical agentic structure with a supervisor agent directing specialized expert agents (Photo Editor, Vector Graphic Editor, Layout Designer) to generate and execute design workflows based on user queries and image inputs.
GRAPHICTOWN operates on the GRAPHICBENCH benchmark, evaluating LLM agents' ability to plan and execute creative design tasks by decomposing high-level goals into sequences of actions executable within web-based design tools.

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

REAL: introduces a benchmark and framework for evaluating autonomous web agents, featuring deterministic Environments (Deterministic website simulations), an Agent (System under evaluation) interacting via Observation (Agent input) and Action (Agent output) through an Agent Harness (Interface for agent interaction) managing a Browser Instance (Dedicated browser for task) with State Management (Persistent website state storage), evaluated by a Reward Module (Evaluates task success) using an LLM Judge (Evaluates information retrieval) and State Diff Check (Verifies state changes), controlled by a Configuration Framework (Controls environment settings) for completing a Task (Goal for the agent).
The framework provides 11 high-fidelity website simulations and 112 tasks, supporting flexible agent integration via Playwright, CDP, or URL control.
Task success is determined programmatically for action-based tasks and via an LLM judge for information retrieval tasks, with configurations enabling reproducible evaluation and edge case testing.

15th April 2025

TEXTARENA

TextArena: introduces a comprehensive framework for evaluating language models through competitive gameplay, featuring an Agent (LLM agent) interacting with an Environment (Text-based games) via a Wrapper (Observation processing), supported by an Evaluation System (Leaderboard/Scoring).
The framework provides a Gym-like interface for diverse text-based environments, enabling training and evaluation of agentic behavior in dynamic scenarios.
An online evaluation system tracks model performance against other models and humans using a TrueSkill leaderboard.

Reimagining Urban Science: Scaling Causal Inference with Large Language Models

AutoUrbanCI: introduces a modular, LLM-powered framework for urban causal inference, structured into Hypothesis Generation, Urban Data, CI Experiment, and Evaluation Agents.
The framework employs specialized agents like Reader, Data Engineer, Data Scientist, Experimenter, Validator, Urban Scientist, and Writer to handle distinct stages of the causal analysis pipeline.
AutoUrbanCI aims to address limitations in current urban causal research, such as data complexity and reproducibility, by leveraging LLM/MLLM capabilities for automation and collaboration.

Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions

Cancer-Myth approach: introduces a methodology to create a dataset and evaluate LLMs, utilizing Myths, Valid Examples, Invalid Examples, an LLM Generator, an LLM Responder, an LLM Verifier, and Hematology Oncology Physicians to produce the Cancer Myth dataset.
This approach systematically generates and verifies patient questions containing false presuppositions to test LLMs' ability to identify and correct misconceptions.
The pipeline involves iterative generation and evaluation steps, with expert physician review ensuring the medical validity of the adversarial examples.

DataSentinel: A Game-Theoretic Detection of Prompt Injection Attacks

DataSentinel: introduces a game-theoretic method to detect prompt injection attacks by fine-tuning a Detection LLM (g) using a Minimax Optimization Problem, which simulates a game between fine-tuning the Detection LLM (g) and adaptive attacks.
The detection mechanism leverages a Detection LLM (g) and a Detection Instruction (sd) with a Secret Key (k), classifying data as contaminated if the Secret Key (k) is not in the Detection LLM's (g) output when prompted with the Detection Instruction (sd) and target data.
The Minimax Optimization Problem is solved iteratively by alternating between the Inner Max Problem, which optimizes contaminated target data (simulating an Adaptive Attack), and the Outer Min Problem, which updates the Detection LLM (g) parameters.

Learning to Be A Doctor: Searching for Effective Medical Agent Architectures

Workflow Evolution Framework: introduces a dynamic, graph-based workflow (Workflow) composed of nodes (Nodes) with attributes (Node Attributes), which evolves iteratively (Workflow Evolution Process) guided by diagnostic feedback (Diagnostic Feedback) and suggestions (Suggestions) generated from process perception (Process Perception).
The framework defines a hierarchical search space (Search Space) encompassing node-level (Node-Level Operations), structural-level (Structural-Level Operations), and framework-level (Framework-Level Design) operations, enabling modifications through actions (Actions) like adding, removing, or modifying components.
This iterative evolution process allows the workflow to adapt its structure and parameters, incorporating elements like conditional (Conditional Structures), loop (Loop Structures), and parallel (Parallel Structures) logic to improve diagnostic accuracy and robustness over time.

The Obvious Invisible Threat: LLM-Powered GUI Agents' Vulnerability to Fine-Print Injections

LLM-Powered GUI Agents: introduces, with LLM (Powers agent capabilities), UI Interpretation & Interaction (Perceives and interacts with GUIs), and Agent's Mental Model (Guides decision-making) components, a study evaluating the vulnerability of GUI agents to adversarial manipulations embedded in graphical user interfaces.
The paper proposes Fine-Print Injection (FPI), a novel attack exploiting agents' tendency to process low-salience content, and evaluates it alongside other attack types against six GUI agents and a human baseline.
Findings reveal that GUI agents are highly susceptible to contextually embedded attacks like FPI and Deceptive Defaults (DD), highlighting a privacy-utility trade-off in agent design and limited human awareness of these risks.

Towards Automated Safety Requirements Derivation Using Agent-based RAG

Agent-based RAG: introduces an approach for automated safety requirements derivation, processing Domain-Specific Knowledge into Vector and Summary Indices, utilizing a Top-level Agent to orchestrate retrieval via Document Agents and their Query Engines, providing Refined Context to an LLM (Large Language Model) for generating responses.
This architecture enhances context relevance compared to default RAG by employing a multi-step agentic retrieval process based on document content and query type.
The agent-based system facilitates incorporating domain-specific knowledge and aims to mitigate hallucinations by grounding outputs in retrieved, refined context.

Exploring Backdoor Attack and Defense for LLM-empowered Recommendations

BadRec: introduces a new attack framework that injects backdoors into LLM-based RecSys by poisoning the training set with Attackers, Trigger, Malicious Retailer, Poisoned Item Pool, Fake Users, and Poisoned Datasets, resulting in Open Backdoors in the LLM-empowered RecSys.
The framework perturbs item titles with triggers and generates fake users to create adversarial examples for training data poisoning.
Poisoning just 1% of training data can successfully implant backdoors, enabling manipulation of recommendation outcomes.

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

LLM-DCP: introduces Dynamic Compressing Prompts, a task-agnostic method modeling prompt compression as a Markov Decision Process, including a DCP-Agent, Critic, Reward Function, Hierarchical Prompt Compression Training Strategy, Distribution-aligned Small Model, and Replay Buffer.
The DCP-Agent iteratively removes redundant tokens from a prompt, guided by a reward function that balances compression, output quality, and information retention.
The Hierarchical Prompt Compression strategy uses curriculum learning to train the agent, progressively increasing compression difficulty.

Timing Analysis Agent: Autonomous Multi-Corner Multi-Mode (MCMM) Timing Debugging with Timing Debug Relation Graph

Timing Analysis Agent: introduces an autonomous multi-corner multi-mode timing debugging system with MCMM Planner Agent (Hierarchical task planning), TDRG Traversal Agent (Plans report retrieval), Expert Report Agent (Retrieves specific data), Structural Report Database (Structured timing reports), and Timing Debug Relation Graph (TDRG) (Connects reports debug knowledge).
The system integrates hierarchical plan solving and multi-agent collaboration to automate the analysis of MCMM timing reports.
It employs a novel Agentic Retrieval Augmented Generation approach leveraging LLM coding capabilities for accurate data retrieval from structured reports.

Can Large Language Models Trade? Testing Financial Theories with LLM Agents in Market Simulations

Simulation Framework: introduces an open-source simulation framework with Market Design (simulates stock market environment), Agent Design (manages LLM trading agents), and Analysis Module (collects and analyzes data) components, designed to test large language models as heterogeneous competing trading agents in a realistic simulated stock market.
The framework incorporates a persistent order book, various order types, stochastic dividends, and heterogeneous information sets for agents.
Agents submit standardized decisions using structured outputs and function calls while expressing their reasoning in natural language, enabling systematic analysis of their trading behavior and market dynamics.

14th April 2025

LLM-based AI Agent for Sizing of Analog and Mixed Signal Circuit

AI Agent: introduces an LLM-based agent for AMS circuit sizing, with Task Decomposition, LLM, Action, Observation, Comparison, External Tools, and Context components, designed to optimize transistor sizing iteratively.
The agent employs a ReAct loop (Action, Observation, Comparison) integrating an LLM with external simulation and analysis tools for iterative optimization.
Prompt engineering, including Chain-of-Thought, guides the LLM's reasoning and action selection based on performance metrics and historical context.

IEA-Plugin: An AI Agent Reasoner for Test Data Analytics

IEA-Plugin (AI Agent Reasoner): introduces an AI agent-based reasoning module designed to generate a stable API specification for test data analytics from user queries.
The system leverages LLMs and an agentic platform to process complex user queries into structured workflows and distill them into a stable API specification.
IEA-Plugin addresses knowledge acquisition and scalability challenges by using user interactions to build a query-workflow database and automatically generating API functions.

Introducing Large Language Models as the Next Challenging Internet Traffic Source

Experimental Setup: introduces, "an experimental setup", with User/Client Application (Interacts with agent), Querying Agent (Initiates query), Responding Agent (Local server, forwards query), and LLM API (External model service), where "the setup simulates user-agent and agent-LLM interactions to measure network traffic".
The paper explores the Internet of Agents paradigm, where AI agents interact with users, devices, and other agents, identifying LLMs as a significant new source of Internet traffic.
Traffic measurements per prompt for various LLMs are provided, estimating the potential impact on network infrastructure.

Characterizing LLM-driven Social Network: The Chirper.ai Case

Chirper.ai: introduces a large-scale analysis of an LLM-driven social network, Chirper.ai, with LLM Agents (Autonomous social entities), Social Network Platform (Hosts agents and interactions), Underlying AI Models (Power agent capabilities), and Community-based Reward System (Influences agent behavior), characterizing agent behavior and network structure.
The study compares Chirper.ai agent behavior and network structure to human and bot users on Mastodon.
Findings reveal distinct patterns in posting, self-disclosure, abusive content, and network positions, highlighting challenges for moderation.

Can Competition Enhance the Proficiency of Agents Powered by Large Language Models in the Realm of News-driven Time Series Forecasting?

CM (Complete Competition Mechanism): introduces a multi-agent framework for news-driven time series forecasting, incorporating News Filtering, Time Series Forecasting, Multi-Indicator Evaluation (MIE), Information Asymmetry (IA), Opponent-Oriented Self-Reflection (OOSR), Multi-Stage Reflection (MSR), Survival of the Fittest (SF), LLM₁, LLMs, and Memory Bank components.
The framework embeds a competition mechanism within multi-agent discussion to enhance innovative thinking and uses MSR with a fine-tuned small LLM for identifying misleading logic.
Experimental results show competition boosts agents' innovative thinking and significantly improves time series prediction performance compared to baselines.

C-FAITH: A Chinese Fine-Grained Benchmark for Automated Hallucination Evaluation

HaluAgent: introduces an agentic framework for automated hallucination evaluation dataset generation, featuring a Generation Module (Generates QA data), Verification Module (Checks data correctness), and Optimization Module (Refines generation prompt).
The framework processes Knowledge Documents (Input source) to generate Generated Data (Raw output), which is validated by the Verification Module (Checks data correctness) using Manual Rules (Verification criteria).
The Optimization Module (Refines generation prompt) refines the generation prompt based on Error Feedback (Verification errors) from the Verification Module (Checks data correctness), producing Qualified Data (Validated data) that forms the final Dataset (Final evaluation data).

Fact-Checking with Contextual Narratives: Leveraging Retrieval-Augmented LLMs for Social Media Analysis

CRAVE (Cluster-based Retrieval Augmented Verification with Explanation): introduces a novel framework that processes Input (Social media post), performs Evidence Retrieval (Get external evidence) via Reverse Image Search (Find image evidence) and Text-Based Search (Find text evidence), applies Clustering (Group evidence narratives) and Narrative Extraction (Select representative text), uses Agent-Based Evidence Refinement (Refine evidence iteratively), and employs an LLM-Based Judge (Determine veracity, explain) for Reasoning (Assess narratives, decide verdict) to produce Output (Explanation, veracity verdict).
The framework clusters multimodal evidence into distinct narratives and uses LLM reasoning based on 5W1H to generate interpretable explanations and veracity verdicts.
CRAVE integrates retrieval-augmented LLMs with clustering techniques to handle diverse and potentially contradictory evidence for fact-checking social media posts.

SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users

SocioVerse: introduces a world model for social simulation powered by LLM agents and a 10 million real-world user pool.
The framework includes four powerful alignment modules: Social Environment, User Engine, Scenario Engine, and Behavior Engine.
SocioVerse addresses alignment challenges in environment, user, scenario, and behavior to achieve diverse and trustworthy simulations.

A Survey of Personalization: From RAG to Agent

Personalized Agent: introduces a system designed to dynamically incorporate user context, memory, and external tools or APIs to support highly personalized and goal-oriented interactions, including Personalized Understanding (interpreting user input/context), Personalized Planning and Execution (integrating memory/tools), and Personalized Generation (creating tailored output).
This framework evolves from Retrieval-Augmented Generation (RAG) by integrating agentic capabilities like Memory and Tool/API utilization.
Memory components store historical user data, while Tool/API components enable interaction with external knowledge sources for task execution.

CodeRAG: Supportive Code Retrieval on Bigraph for Real-World Code Generation

CODERAG (retrieval-augmented code generation framework): introduces, "comprehensively retrieve supportive codes for real-world code generation", with Requirement Graph (Models requirement relationships), DS-Code Graph (Models code relationships), Bigraph Mapping (Maps requirements to code), Code-oriented Agentic Reasoning (LLM-driven retrieval and generation), Programming Tools (Assist LLM retrieval/testing), and LLMs (Generate code using retrieved info).
The framework constructs a requirement graph and a DS-code graph, maps between them, and uses an agentic process with programming tools and LLMs for code generation.
CODERAG aims to improve real-world repo-level code generation by providing LLMs with relevant context from the code repository and external sources.

DataMosaic: Explainable and Verifiable Multi-Modal Data Analytics through Extract-Reason-Verify

DataMosaic: introduces an agentic workflow with Question Decomposition (Decomposes question), Structure Selection (Selects data structure), Seek (Locates relevant data), Extraction (Extracts structured data), Reasoning (Performs reasoning), and Thinker (Evaluates, directs workflow) components.
The framework aims to make LLM-powered multi-modal data analytics explainable and verifiable by transforming data into structured formats for step-by-step processing.
The Thinker component dynamically adapts the workflow based on evaluation of intermediate results, enhancing accuracy and efficiency.

A Survey of Large Language Model-Powered Spatial Intelligence Across Scales: Advances in Embodied Agents, Smart Cities, and Earth Science

Taxonomy of Large Language Model-Empowered Spatial Intelligence: introduces a structured framework with Foundational Capabilities (Underlying spatial abilities), Spatial Memory and Knowledge (Recall spatial information), Abstract Spatial Reasoning (Simplify spatial problems), Spatial Intelligence for Real World (Apply spatial intelligence), Embodied Spatial Intelligence (Agents in physical environments), Urban Spatial Intelligence (Spatial tasks in cities), Earth Spatial Intelligence (Spatial tasks in Earth science), Spatial Memory and Knowledge Sources (Internal or external data), Spatial Memory and Knowledge Down-stream Tasks (Specific spatial applications), and Abstract Spatial Reasoning Mental Models (Types of spatial logic).
The framework categorizes LLM spatial intelligence into foundational abilities like memory and reasoning, and real-world applications across embodied, urban, and earth science domains.
This taxonomy provides a structured view of LLM-powered spatial intelligence, highlighting key components and their relationships across different scales and disciplines.

Training Small Reasoning LLMs with Cognitive Preference Alignment

CRV+CogPO: introduces a multi-agent system with a Critic (evaluates reasoning process), Rethinker (rewrites reasoning process), and Verifier (validates reasoning process) combined with the CogPO (aligns reasoning preferences) algorithm to train smaller reasoning LLMs.
The approach refines training data by critiquing, rethinking, and verifying reasoning processes from larger models, then uses preference optimization tailored to smaller models' capacities.
This method demonstrates improved performance on challenging reasoning benchmarks compared to other training techniques for smaller models.

Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning

RC (Reasoning Court): introduces a framework for multi-hop reasoning that includes LLM Agents (Generate candidate solutions), Reasoning Steps (Internal thought process), Retrieval Actions (Gather external information), Retrieved Evidence (Information from external sources), and LLM Judge (Evaluates trajectories and determines answer).
The framework employs multiple LLM agents to generate diverse reasoning paths and candidate answers by interleaving reasoning and external retrieval.
A dedicated LLM judge evaluates the agents' reasoning trajectories and retrieved evidence to select the most accurate answer or synthesize a new one.

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

Adaptive MAS: introduces an adaptive multi-agent framework with a CEO agent to enhance collaborative reasoning through model fine-tuning and system-level coordination.
The framework includes a CEO agent that dynamically manages agent collaboration, resource allocation, and reasoning depth based on task progress.
The system utilizes specialized agents (Expert Recruiter, Problem Solvers, Executor, Evaluator) within the MAS to collaboratively solve complex tasks.

13th April 2025

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025

Review Feedback Agent: introduces a multi-LLM system, with Paper (Input), Review (Input), Actor 1 (Generate initial feedback), Actor 2 (Generate initial feedback), Aggregator (Merge feedback lists), Critic (Evaluate and filter feedback), Formatter (Format feedback pairs), Reliability tests (Ensure feedback quality), and Feedback (Output to reviewer), designed to improve peer review quality by providing automated feedback to reviewers.
The system uses parallel Actors to generate initial feedback, which is then aggregated, critically evaluated, and formatted before being posted to the reviewer.
Reliability tests act as guardrails, ensuring the generated feedback is constructive, accurate, and properly formatted before delivery.

AGENTIC WORKFLOWS FOR ECONOMIC RESEARCH: DESIGN AND IMPLEMENTATION

Agentic Workflow Framework: introduces a methodology leveraging LLMs and multimodal AI for economic research, featuring Specialized Agents (perform specific tasks), Inter-Agent Communication (structured data exchange), Error and Escalation Pathways (handle issues), Adaptive Mechanisms (switch strategies), Human-in-the-Loop (HITL) Checkpoints (human oversight), and a Multi-phase Workflow (coordinates stages).
The framework enhances research efficiency and reproducibility by automating tasks across the economic research lifecycle while integrating strategic human oversight.
Specialized agents handle distinct responsibilities, communicating through structured protocols, with built-in mechanisms for error handling and adaptation across interconnected workflow stages.

AGENTA/B: Automated and Scalable Web A/B Testing with Interactive LLM Agents

AGENTA/B: introduces a system for automated and scalable web A/B testing using interactive LLM agents, including LLM Agent Generation, Testing Preparation, Agent-Environment Interaction, and Post-Testing Analysis modules.
The Agent-Environment Interaction loop involves an Environment Parsing Module, LLM Agent (Action Prediction), Action Execution Module, and Agent Profiling Module to simulate realistic user behavior on live websites.
AGENTA/B enables rapid, risk-free behavioral piloting for UX evaluation by generating diverse agent personas and analyzing their interactions across different design variants.

MLRC-BENCH: Can Language Agents Solve Machine Learning Research Challenges?

MLRC-BENCH: introduces a benchmark to evaluate language agents on machine learning research challenges, including Language Agent, Task Description, Starter Code, Human Idea, Implementation, LLM Explainer, Underlying Idea, LLM Judge, and Scorer.
The benchmark provides a task environment with detailed descriptions, starter code, and optional human ideas to the Language Agent.
The agent's Implementation is evaluated by an evaluation pipeline consisting of an LLM Explainer, LLM Judge, and Scorer using objective and subjective metrics.

EMOAGENT: ASSESSING AND SAFEGUARDING HUMAN-AI INTERACTION FOR MENTAL HEALTH SAFETY

EmoAgent: introduces a multi-agent AI framework designed to evaluate and mitigate mental health hazards in human-AI interactions, with EmoEval simulating virtual users and EmoGuard providing real-time interventions.
EmoEval assesses psychological states using clinically proven tools and simulates large-scale human-AI conversations with a Character-based Agent and Dialog Manager Agent.
EmoGuard acts as a real-time intermediary layer with a Safeguard Agent comprising an Emotion Watcher, Thought Refiner, Dialog Guide, and Manager, which iteratively trains to mitigate risks.

AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations

AgentDynEx: introduces a LLM-based system for setting up multi-agent simulations, including a Configuration Matrix (structured setup framework), Initializing Mechanics (defines simulation world), Tracking Dynamics (monitors simulation progress), Nudging (intervenes in running simulation), Dynamic Reflection (automatic nudge suggestion), Manual Intervention (human-driven nudging), Holistic Reflection (post-run error identification), Debugging Lists (problem-solution repository), GPTeam (multi-agent simulation engine), LLMs (language models), Run Logs (simulation event records), Intermediate Summaries (runtime progress updates), and Updated Configuration (refined simulation setup).
AgentDynEx balances simulation mechanics and dynamics through a structured configuration phase, dynamic runtime nudging based on reflection, and post-run holistic reflection for configuration updates.
The system uses LLMs and the GPTeam engine to enable users to define scenarios, monitor progress via logs and summaries, intervene manually or automatically, and iteratively refine simulation setups.

Fine-tuning a Large Language Model for Automating Computational Fluid Dynamics Simulations

Multi-agent system: introduces an approach for automating computational fluid dynamics simulations using a fine-tuned Large Language Model.
The system orchestrates a workflow with a pre-checker for input validation, a fine-tuned LLM for configuration generation using Chain-of-Thought, a runner for simulation execution, and a corrector for error resolution.
The fine-tuned LLM, trained on the NL2FOAM dataset, translates natural language descriptions into executable OpenFOAM configurations, achieving high performance on diverse CFD tasks.

HM-RAG: Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation

HM-RAG: introduces a novel Hierarchical Multi-Agent Multimodal Retrieval Augmented Generation framework with Decomposition Agent (Decomposes complex queries), Vector-based Retrieval Agent (Retrieves from vector database), Graph-based Retrieval Agent (Retrieves from graph database), Web-based Retrieval Agent (Retrieves from web sources), Decision Agent (Synthesizes and refines answers), and LLM (Processes queries and generates text), designed for collaborative multimodal knowledge synthesis.
The framework employs a three-tiered architecture with specialized agents for query decomposition, multi-source retrieval, and answer refinement.
HM-RAG achieves superior performance by integrating diverse data sources and leveraging multi-agent collaboration for complex query handling.

CheatAgent: Attacking LLM-Empowered Recommender Systems via LLM Agent

CheatAgent: introduces a novel attack framework, with Insertion Positioning, LLM Agent-Empowered Perturbation Generation, LLM-Based Agent, and Trainable Prefix Prompt components, designed to attack LLM-empowered recommender systems in a black-box setting.
The framework leverages an LLM-based agent to generate adversarial perturbations by identifying optimal insertion positions and iteratively refining the attack strategy via prompt tuning based on victim feedback.
CheatAgent aims to demonstrate the safety vulnerability of LLM-empowered recommender systems to subtle adversarial attacks crafted by simulating human-like decision processes.

UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

UXAgent: introduces a system for simulating usability testing of web design with LLM agents, including a Persona Generator, LLM Agent, Universal Browser Connector, Agent Interview Interface, and Simulation Replay Interface.
The LLM Agent features a two-loop architecture with Fast and Slow Loops, supported by Perceive, Planning, Action, Reflection, Wonder Modules, and a Memory Stream.
The Universal Browser Connector provides the Observation Space and Action Space for the LLM Agent to interact with real-world web environments.

12th April 2025

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

SEMANTICCOMMIT: introduces a system for managing AI agent memory updates, featuring a UI, backend, knowledge graph, information retrieval pipeline with retrieval and conflict classification stages, and an LLM.
The system helps users detect and resolve semantic conflicts in natural language intent specifications using a knowledge graph-based RAG pipeline and LLMs for suggestions.
The interface provides global and local conflict detection and resolution options, allowing users to review, edit, and validate AI-proposed changes.

Langformers: Unified NLP Pipelines for Language Models

Langformers: introduces an open-source Python library designed to streamline NLP pipelines through a unified, factory-based interface, including tasks (Central interface), generators (LLM interaction), labellers (Automated text annotation), classifiers (MLM fine-tuning), mlms (MLM training/pretraining), embedders (Text embedding generation), searchers (Vector database integration), rerankers (Search result reordering), and mimickers (Knowledge distillation).
The library consolidates various NLP tasks for LLMs and MLMs into a cohesive API, supporting platforms like Hugging Face and Ollama.
Key innovations include task-specific factories, built-in memory and streaming for conversational agents, and a lightweight, modular design.

Tell-XR: Conversational End-User Development of XR Automations

Tell-XR: introduces a conversational end-user development system for XR automations, with User Interface (Handles multimodal input), User Interface (Handles multimodal input), Tell-XR Bot (Core authoring system), Tell-XR Bot (Routes requests), Tell-XR Bot (Manages dialogue phases), Tell-XR Bot (Generates JSON rule), Tell-XR Bot (External tool access), Tell-XR Bot (Stores dialogue history), Automation Engine (Manages XR state), Automation Engine (Tracks object states), and Automation Engine (Stores/executes rules) components, enabling users to define event-condition-action rules via natural language and multimodal interaction.
The system leverages large language models within the Tell-XR Bot to interpret user intent and guide them through distinct dialogue phases for defining and refining automations.
The architecture integrates a multimodal user interface for VR and AR, the LLM-based bot for conversation, and an automation engine managing the XR environment state and executing rules.

11th April 2025

MCP Bridge: A Lightweight, LLM-Agnostic RESTful Proxy for Model Context Protocol Servers

MCP Bridge: introduces a lightweight, LLM-agnostic RESTful proxy system with Client Applications, RESTful API, MCP Bridge, MCP Servers, MCP-Gemini Agent, and LLM components, designed to connect resource-constrained clients to MCP servers via a unified API.
The system decouples client applications from underlying MCP server processes, enabling access to MCP functionality without local process execution constraints.
MCP Bridge implements a risk-based execution model for security and supports various MCP server transports while maintaining backward compatibility.

DocAgent: A Multi-Agent System for Automated Code Documentation Generation

DocAgent: introduces a multi-agent system for automated code documentation generation, which includes Navigator Module, Repository AST Parsing, Dependency DAG, Topological Traversal, Topological Sorting, Dependency-Aware Processing Order, Multi-Agent Documentation Generation, Reader, Searcher, Writer, Verifier, and Orchestrator.
DocAgent uses a Navigator Module to establish dependency-aware processing order and a Multi-Agent Documentation Generation module with specialized agents to collaboratively generate documentation.
The system aims to address challenges in automated code documentation by ensuring completeness, helpfulness, and truthfulness through topological processing and multi-agent collaboration.

SEAVIEW: Software Engineering Agent Visual Interface for Enhanced Workflow

SEAVIEW: introduces a visualization framework for software engineering agent experiments, comprising a web frontend for user interaction, a backend for data processing, PostgreSQL for structured data storage, object storage for large files, and external environment for running experiments.
SEAVIEW framework aims to assist researchers in debugging and improving software engineering agents by providing experiment health, comparison, summarization, and reporting capabilities.
The tool is designed to analyze agent trajectories and experiment results, offering insights into agent behavior and performance across different experimental setups and parameters.

A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems

LLM Reasoning System: introduces, Reasoner (generates reasoning steps), Verifier (evaluates reasoning quality), and Refiner (improves reasoning trajectories), which are key components for effective reasoning in large language models.
The Reasoner proposes responses, the Verifier judges their quality, and the Refiner revises flawed outputs based on feedback.
These components can be organized in standalone LLMs, single-agent systems interacting with environments, or multi-agent systems communicating with each other.

AGENTREWARDBENCH: Evaluating Automatic Evaluations of Web Agent Trajectories

AGENTREWARDBENCH: introduces a benchmark for evaluating LLM judges for web agent trajectories, including a Web Agent (Performs tasks on web), Web Environment (Simulated or real websites), Trajectory (Agent's sequence of actions), Human Annotator (Provides ground truth labels), LLM Judge (Evaluates agent trajectories), Judge Model (Specific LLM judge implementation), and Input Representation (Trajectory data for judge).
The benchmark contains over 1300 trajectories from various web agents and environments, annotated by experts for success, side effects, and repetition.
Evaluation shows that simpler LLM judge input representations can achieve higher agreement with human experts than prior methods, and rule-based evaluation often underestimates agent success.

TP-RAG: Benchmarking Retrieval-Augmented Large Language Model Agents for Spatiotemporal-Aware Travel Planning

TP-RAG (Travel Planning - Retrieval-Augmented Generation): introduces benchmark for retrieval-augmented spatiotemporal-aware travel planning with Inputs, Agent, Plan, and Evaluate components.
TP-RAG benchmark dataset includes real-world travel queries, fine-grain annotated Points of Interest, and high-quality travel trajectory references for context-aware planning.
TP-RAG benchmark facilitates evaluation of LLM agents in generating spatiotemporally coherent travel plans utilizing trajectory-level knowledge for improved travel practicality.

Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing

LLM-powered Conversational Agent for Writing Reflection: introduces a system designed with LLM-powered Conversational Agent, Voice Input, Written Output, Feedback, Questions, Advice, and UI Affordances to investigate voice interaction for writing reflection.
This system emphasizes Contextualization and Control to improve user experience and maintain writer's ownership during revision process.
The research aims to evaluate how voice input modality affects reflection depth and revision quality compared to text input when using conversational agents.

Do LLMs trust AI regulation? Emerging behaviour of game-theoretic LLM agents

FAIRGAME (Framework for AI Agents Bias Recognition using Game Theory): introduces user, developer, and regulator components to model regulatory ecosystem.
Framework uses evolutionary game theory and LLMs to investigate strategic choices under different regulatory scenarios.
FAIRGAME aims to identify emerging behaviors of strategic AI agents in game-theoretic settings and compare them with game-theoretic predictions.

MOOSEAGENT: A LLM BASED MULTI-AGENT FRAMEWORK FOR AUTOMATING MOOSE SIMULATION

MooseAgent: introduces an automated framework for MOOSE simulation, integrating Requirement, Alignment, Architect, Vector knowledge base, Error Correction, and Runner components.
MooseAgent framework uses LLMs to understand user needs, generate MOOSE input files, and iteratively refine them using a vector database and error correction.
This multi-agent system aims to simplify finite element simulation by automating pre-processing, solver configuration, and post-processing stages in MOOSE.

Task Memory Engine (TME): Enhancing State Awareness for Multi-Step LLM Agent Tasks

Task Memory Engine (TME): introduces a memory framework for LLM agents, with Task Memory Tree (hierarchical task state representation), Task Relationship Inference Module (reasons about task relationships), and Prompt Synthesizer (generates context-aware prompts).
TME enhances state awareness by tracking task execution using Task Memory Tree, inferring task relationships with Task Relationship Inference Module, and generating adaptive prompts with Prompt Synthesizer.
This framework enables robust, interpretable, and token-efficient execution of complex multi-step tasks by providing structured memory and intelligent prompt construction.

Adopting Large Language Models to Automated System Integration

Compositio Prompto (Compositio Prompto): introduces an architecture employing Large Language Models for automated service composition, utilizing task specifications, service documentation, input/output schemas to create a prompt for the LLM, which then generates executable service compositions.
The architecture aims to mitigate complex formal modeling in service composition by using natural language input and OpenAPI specifications, focusing on generating reusable service compositions as program code.
Compositio Prompto architecture is evaluated for service composition and discovery using Retrieval Augmented Generation (RAG) and benchmarks like RestBench and SOCBench-D to address limitations of input token length and improve service discovery in automated system integration.

Beyond Self-Reports: Multi-Observer Agents for Personality Assessment in Large Language Models

Multi-Observer LLM Personality Assessment Framework: introduces a novel method for evaluating LLM personality by utilizing multiple observer agents, simulating interactive scenarios, and aggregating observer reports for robust assessment.
This framework incorporates agent configuration to define agent profiles and relationships, interactive scenario simulation to generate dialogues, and personality reports to collect self- and observer- assessments.
By aggregating multiple observer reports, the framework aims to reduce individual biases and achieve a more context-sensitive and reliable personality evaluation of LLMs compared to self-report methods.

Evaluating the Bias in LLMs for Surveying Opinion and Decision Making in Healthcare

LLM-based Survey Simulation Framework: introduces a framework for evaluating LLMs in healthcare decision-making, with Survey Dataset, Demographics features, Prompt Construction Module, General prompt, Prompt with context, Prompts, LLM models, and Generated Vaccination decision.
This framework compares LLM-generated vaccination decisions with real-world survey data to assess alignment and biases across demographic groups.
The framework helps understand LLMs' capabilities and limitations in simulating healthcare behaviors and decision-making under different pandemic contexts.

10th April 2025

Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI

Blueprint Architecture: introduces a blueprint for compound AI systems, with Agent (maps models and APIs), Agent Registry (metadata store for agents), Task Planner (creates agentic workflows), Task Coordinator (coordinates workflow execution), Budget (records QoS stats), Data Registry (metadata store for data), Data Planner (generates query plans), Optimizer (performs multi-objective optimization), Streams (facilitate data and control flow), and Session (provides context for agents).
Blueprint Architecture focuses on orchestrating agents and data using streams to manage data and instructions flow, aiming for seamless integration and optimized workflows in enterprise AI applications.
The architecture emphasizes key components like registries for agents and data, planners for tasks and data queries, and coordinators for execution, all designed to enhance observability, controllability, and optimization in compound AI systems.

Test Amplification for REST APIs via Single and Multi-Agent LLM Systems

Agentic LLM systems: introduces single-agent approach with OpenAPI Retriever and Local Executor components for REST API test amplification.
Agentic LLM systems: also introduces multi-agent approach with specialized agents like Header-, Parameter-, Value-, Planner-, Writer-, Executor- and Repair-agents to improve test generation.
Agentic LLM systems: demonstrates that multi-agent system achieves higher API coverage and bug detection compared to single-agent system, but with increased computational cost.

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge](http://arxiv.org/abs/2504.07887v1)

CLEAR-Bias (Corpus for Linguistic Evaluation of Adversarial Robustness against Bias): introduces a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation, with CLEAR-Bias-dataset, jailbreak prompts, base prompts, control set, judge selection, candidate LLMs, collect judgments, evaluate agreement, selected judge, two-step safety evaluation, initial assessment with base prompts, compute bias-specific safety score, adversarial analysis with jailbreak prompts, overall LLM safety score, and LLM vulnerability analysis.
The framework employs LLM-as-a-Judge paradigm for automated assessment, utilizing a two-step safety evaluation process involving initial assessment with base prompts and subsequent adversarial analysis with jailbreak techniques.
The methodology aims to systematically probe models across sociocultural dimensions, quantify robustness through safety scores, and investigate vulnerabilities in safety mechanisms, ultimately revealing critical trade-offs between model size and safety.

An LLM-Driven Multi-Agent Debate System for Mendelian Diseases

MD2GPS (Medical Doctor 2 GPS): introduces LLM-driven multi-agent debate system, with Data Agent, Knowledge Agent, and Debate Agent, for Mendelian diseases diagnosis.
MD2GPS system utilizes Data Agent to process genetic variants and phenotypes, Knowledge Agent with GPT-4 for gene analysis, and Debate Agent to integrate and refine diagnostic outcomes.
The multi-agent debate framework of MD2GPS enhances diagnostic accuracy and interpretability by leveraging diverse perspectives and evidence consistency evaluation.

Deceptive Automated Interpretability: Language Models Co-ordinating to Fool Oversight Systems

SAEs (Sparse Autoencoders): introduces framework with Labeling Agent, Simulating Agent, Overseer, Monitoring, Visible Communication, and Hidden Communication to investigate deceptive interpretability in language models.
The framework uses Labeling Agent to create feature labels, Simulating Agent to predict activations, and Overseer to detect deceptive labels, with agents communicating visibly and hiddenly.
This setup explores how language models can coordinate to deceive oversight systems by employing steganography for hidden communication and generating deceptive explanations.

MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations

MOSAIC (Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations): introduces a multi-agent social simulation framework, with Human Persona Survey, Persona Generation, Agent Network, Agent Memory, Reflection, Interaction, BEFORE Action, Comment AFTER Action, Agent Daily News, Fact-Checking Types, Community Notes, Third Party Fact-Checking, and Hybrid Fact-Checking, for modeling content diffusion, user engagement, and misinformation propagation in social networks.
MOSAIC framework utilizes LLM-powered agents with memory and reflection capabilities to simulate realistic social behaviors and evaluate content moderation strategies like community-based, third-party, and hybrid fact-checking.
The framework allows for analyzing the effectiveness of different fact-checking mechanisms in mitigating misinformation spread while preserving user engagement in simulated social media environments.

Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents

PYTASKSYN introduces a novel synthesis technique for generating programming tasks, which includes Generation (task creation stage) and Validation (task quality check stage) stages, performed by SIMEXPERT (expert agent for task generation), SIMTUTOR (tutor agent for test suite and context validation), and SIMSTUDENT (student agent for comprehensibility validation).
PYTASKSYN employs a multi-agent approach with specialized roles, where SIMEXPERT generates Task Description (task explanation) and Test suite (code verification tests), while SIMTUTOR and SIMSTUDENT assess Context relevance (theme and concepts alignment) and Comprehensibility (task clarity).
PYTASKSYN aims to improve the quality of AI-generated programming tasks by automating validation through simulated agents, ensuring tasks are relevant, correct, and comprehensible for students, thus reducing the need for human intervention.

Boosting Universal LLM Reward Design through Heuristic Reward Observation Space Evolution

ROS Evolution Framework: introduces heuristic reward observation space evolution for LLM-driven reward design, incorporating user description structuring, LLM for reconciliation, LLM for reward design, state history memory, performance summarization, reward space mapping, simulation environment, state usage tracker, relevant state space, state selection, and internal operation.
ROS Evolution Framework utilizes State Execution Table to track historical state usage and success contributions, overcoming Markovian constraint in LLM dialogues for effective exploration.
ROS Evolution Framework reconciles user-provided task descriptions with expert-defined success criteria using structured prompts, ensuring alignment in reward design objectives and improving reward generation stability.

A taxonomy of epistemic injustice in the context of AI and the case for generative hermeneutical erasure

Taxonomy of epistemic injustice in AI: introduces a novel taxonomy of epistemic injustice in the context of Artificial Intelligence, focusing on generative AI and detailing Generative Hermeneutical Ignorance, Generative Hermeneutical Access, Generative Manipulative Testimonial Injustice, Generative Amplified Testimonial Injustice, and Generative Conceptual Erasure.
This taxonomy explores how AI systems can perpetuate and amplify epistemic injustices, particularly through generative models, by misrepresenting marginalized experiences, obstructing information access, spreading disinformation, amplifying existing biases, and ultimately eroding diverse epistemological frameworks.
The paper highlights the concept of Generative Hermeneutical Erasure as a novel form of epistemic injustice, emphasizing the risk of AI-driven erosion of non-Western epistemologies and the importance of decolonial AI approaches to mitigate these harms.

Kimi-VL Technical Report

Kimi-VL (Vision Language Model): introduces MoonViT, MLP Projector, and MoE Language Decoder for efficient multimodal reasoning and long-context understanding.
Kimi-VL utilizes MoonViT for native-resolution image processing, MLP Projector to align visual features, and MoE Language Decoder for parameter-efficient language generation.
Kimi-VL-Thinking, an advanced variant, enhances long-horizon reasoning through long chain-of-thought and reinforcement learning, building upon Kimi-VL's architecture.

Enhanced Question-Answering for Skill-based learning using Knowledge-based AI and Generative AI

Ivy (intelligent agent): introduces an architecture for skill-based learning question answering, with Classify Answerability, Knowledge Retrieval Module, TMK Knowledge Base, Response Generation Module, and Response Optimizer Module components.
Ivy leverages TMK (Task-Method-Knowledge) models to represent skills and Generative AI to enhance explanations for learners' questions in online AI courses.
The framework aims to provide deeper, more relevant feedback compared to agents relying on unstructured text, improving learners' understanding of procedural knowledge and reasoning in skill-based learning.

Achilles Heel of Distributed Multi-Agent Systems

DMAS (Distributed Multi-Agent System): introduces distributed architecture with Control System managing third-party Agents through API Interfaces and receiving Responses.
DMAS framework addresses challenges of heterogeneity, scalability, and computational constraints in multi-agent systems by utilizing remotely hosted agents.
The distributed nature of DMAS raises trustworthiness concerns, including free riding, malicious attacks, communication delays and unstable connections, which are systematically analyzed in the paper.

Beyond LLMs: A Linguistic Approach to Causal Graph Generation from Narrative Texts

DA framework (Causal Graph Generation Framework): introduces a novel method for generating causal graphs from narrative texts, incorporating Vertices Extraction, Expert Index Extraction, STAC Categorization, and Diagram Formulation components.
This framework leverages linguistic feature extraction and a quaternary classification system (STAC) to enhance the precision and interpretability of causal link identification compared to LLM-only approaches.
The system employs a hybrid model combining ROBERTa embeddings with an Expert Index of linguistic features, followed by a structured prompting process for refining and constructing the final causal graph.

Enhancing Player Enjoyment with a Two-Tier DRL and LLM-Based Agent System for Fighting Games

TTA (Two-Tier Agent): introduces a two-tier system with DRL game-playing agents tier, utilizing a network architecture with CNN and RNN feature extractors and actor-critic networks, and Hyper-agent tier, employing a LLM Hyper Agent for dynamic opponent selection based on player data and feedback.
The DRL game-playing agents tier consists of Input (game pixels, scalar info, action sequence)/Features Extractor (CNN, LSTM)/Agent's Network (Actor Net, Critic Net)/Value (value function)/Output (action distribution), while the Hyper-agent tier includes Agent Archive (DRL agent storage)/LLM Hyper Agent (opponent selector)/Game Manager (data and game management)/Player's Feedback (human input)/Playing Data (game history).
TTA aims to enhance player enjoyment in fighting games by providing diverse and adaptive AI opponents, leveraging DRL for agent skill and LLMs for personalized opponent selection, demonstrating improvements in advanced skill execution and player satisfaction.

AGENTADA: Skill-Adaptive Data Analytics for Tailored Insight Discovery

AGENTADA (skill-informed data analytics agent): introduces dataset-to-insight extraction strategy with Question Generation, RAG-Based Skill Matcher, Code Generation, Answer Generation, and Insight Generation components.
AGENTADA leverages hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose best data analytics skill from skill library.
AGENTADA is evaluated using KAGGLEBENCH benchmark and SCORER evaluation framework, demonstrating improved performance over existing tools.

Automating quantum feature map design via large language models

Agentic System: introduces an autonomous system for quantum feature map design, incorporating Human input, LLM, Generation, Storage, Validation, Evaluation, and Review components.
Agentic System iteratively refines quantum feature maps through Feedback from The Experimental Result, utilizing LLM for idea Generation and external knowledge in Storage for Validation and Review.
The framework leverages components like Storage for academic papers and PennyLane library documentation, and Evaluation for performance assessment, to automate quantum feature map research workflow.

TALE: A Tool-Augmented Framework for Reference-Free Evaluation of Large Language Models

TALE (Tool-Augmented LLM Evaluation): introduces a reference-free framework for evaluating LLM responses, with Query Generation, Web Search, Evidence Summarizer, Reflector, Query Refiner, Judge, and Short-Term Memory components.
TALE iteratively refines web queries, collects and summarizes external information, and reflects on findings to evaluate LLM outputs without relying on pre-annotated references.
The framework enhances the reliability of LLM evaluations in dynamic real-world scenarios by grounding judgments in external, verifiable evidence through tool-augmented approach.

Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents

This paper introduces a queuing-theoretic framework for LLM inference scheduling, encompassing Batch Processing, Prefill, Decode, Processed, Processing stages, and extends to AI agent workloads with Orchestrator, Agents, Tools, Global History, LLM Serving, Load Balancer, LLM Engine, Scheduler, KV Cache, and LLM components.
The framework analyzes throughput optimality of work-conserving scheduling algorithms for both individual LLM requests and complex AI-agent systems, highlighting the importance of token budget and batching strategies for efficient LLM inference.
Evaluations using real-world systems like Orca and Sarathi-serve demonstrate throughput optimality, while FasterTransformer and vanilla vLLM are shown to be potentially suboptimal under certain workloads, emphasizing the practical implications of queuing theory in LLM system design.

Modeling Response Consistency in Multi-Agent LLM Systems: A Comparative Analysis of Shared and Separate Context Approaches

RCI (Response Consistency Index): introduces a probabilistic framework for analyzing shared and separate context configurations in multi-agent LLM systems, focusing on centralized memory, distributed memory, context retention duration, incorrect statements, accurate statements, consistency evaluation, and latency measurement.
The framework evaluates the impact of memory limitations and noise on response consistency and response time in LLM-based MAS.
RCI metric quantifies the trade-offs between scalability, response consistency, and performance in different context configurations.

9th April 2025

REVIEW OF CASE-BASED REASONING FOR LLM AGENTS: THEORETICAL FOUNDATIONS, ARCHITECTURAL COMPONENTS, AND COGNITIVE INTEGRATION

CBR-GDA (Case-Based Reasoning - Goal-Driven Autonomy): introduces a framework integrating Case-Based Reasoning with Goal-Driven Autonomy to enhance LLM agents by incorporating Case Representation and Indexing Strategies, Hybrid Retrieval Mechanisms, Adaptation Mechanisms, LLM Reasoning Processes Integration, Cognitive Dimensions Integration, Planning Case Base and Mismatch-Goal Case Base.
This framework leverages CBR for persistent memory and structured reasoning, while utilizing LLMs for language understanding, aiming to improve reasoning transparency, domain adaptation, and solution quality in complex problem-solving scenarios.
The CBR-GDA framework facilitates continuous learning and adaptation through case acquisition and refinement, enabling agents to dynamically adjust objectives and improve goal reasoning capabilities in dynamic environments.

REVIEW OF CASE-BASED REASONING FOR LLM AGENTS: THEORETICAL FOUNDATIONS, ARCHITECTURAL COMPONENTS, AND COGNITIVE INTEGRATION

CBR-GDA Framework: introduces architecture for CBR-enhanced LLM agents, with Case Representation and Indexing, Hybrid Retrieval Mechanisms, Adaptation Mechanisms, LLM Reasoning Processes Integration, Cognitive Dimensions, Goal-Driven Autonomy, Planning Case Base, and Mismatch-Goal Case Base.
This framework integrates Case-Based Reasoning and Goal-Driven Autonomy to enhance LLM agents' reasoning, adaptability, and transparency by leveraging past experiences and dynamic goal adjustment.
The architecture utilizes two case bases, Planning Case Base and Mismatch-Goal Case Base, to manage planning and goal reformulation based on discrepancies between expected and actual outcomes.

FamilyTool: A Multi-hop Personalized Tool Use Benchmark

KGETool: introduces KG-augmented LLM tool use pipeline, with Query, Full KG, Tools, LLM for KG, KG Extraction, Relation Path, Path Extraction, Sub KG, LLM for Tool Use and Tool Call, to evaluate LLMs in personalized multi-hop tool use scenarios.
KGETool framework extracts sub-KG from Full KG using KG Extraction module composed of Relation Path and Path Extraction, then utilizes Sub KG and Tools with LLM for Tool Use to generate Tool Call based on user Query.
The pipeline emphasizes generalization in inductive KG settings, where KGETool leverages LLMs' ability to handle evolving knowledge graphs without retraining by dynamically adapting to unseen user preferences and relationships.

AgentFM: Role-Aware Failure Management for Distributed Databases with LLM-Driven Multi-Agents

AgentFM (Role-Aware Failure Management Framework): introduces a role-aware failure management framework for distributed databases, with Meta-Agent (Orchestrates agents), Task Agents (Manage failure tasks), Data Agents (Handle data sources), System Agents (Represent node roles), and Standalone Agents (Agents on each node) components.
AgentFM leverages LLM-driven multi-agents to address failure management by considering system roles, data roles, and task roles, using a Meta-Agent (Orchestrates agents) for orchestration and specialized Task Agents (Manage failure tasks) like Detection Agent (Identifies anomalies), Diagnosis Agent (Classifies issues), and Mitigation Agent (Proposes solutions).
AgentFM integrates multimodal data sources through Data Agents (Handle data sources) such as Metric Agent (Metrics data extraction) and Log Agent (Logs data extraction), employing specialized System Agents (Represent node roles) like Config Agent (Configuration management), Coordinator Agent (Coordination management), and Storage Agent (Storage management) to enhance failure management in distributed databases.

Right Prediction, Wrong Reasoning: Uncovering LLM Misalignment in RA Disease Diagnosis

Framework for RA patients diagnosis: introduces a system employing PreRAID dataset, Texts, Embeddings, Vector DB, Knowledge Base, Medical Expert Guided Prompt, LLM, RAG, Prompt, Output, Prediction, and Reasoning to investigate LLM's diagnostic capabilities and reasoning for Rheumatoid Arthritis.
This framework utilizes patient symptom Texts converted to Embeddings and stored in Vector DB, leveraging Knowledge Base and Medical Expert Guided Prompt for LLM with RAG to generate Output, Prediction of RA, and Reasoning.
The framework explores different architectures with varying numbers of LLM agents and knowledge base integration to assess diagnostic accuracy and reasoning quality in RA disease prediction.

NEEDLEINATABLE: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

NEEDLEINATABLE (NIAT): introduces NIAT benchmark and data synthesis method to evaluate and improve large language models on long-structured tables.
NIAT benchmark assesses large language models' ability to extract specific cells from long tables using location-based and question-based queries.
Data synthesis method uses chain-of-thought reasoning to generate training data for enhancing large language models' long-table comprehension.

8th April 2025

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

FEABench: introduces benchmark for evaluating LLMs and LLM agents in multiphysics reasoning, using ControllerAgent, Evaluator, CorrectorSubAgent, and ToolLookupAgent components to solve engineering problems with FEA software.
FEABench framework employs multi-agent system with specialized tools and feedback mechanisms to enhance LLMs' ability to generate executable code for COMSOL Multiphysics API.
FEABench benchmark and agentic framework aim to advance automation in engineering by augmenting LLMs with numerical solvers and physics reasoning capabilities.

CAI: An Open, Bug Bounty-Ready Cybersecurity AI

CAI (Cybersecurity AI): introduces an open-source framework for democratizing security testing, with HITL, Turns, Patterns, Handoffs, Agents, Tools, Extensions, and Tracing components.
CAI framework combines modular agent design, seamless tool integration, and human oversight for AI-powered bug bounty testing.
CAI aims to dismantle the lock-in of dominant platforms, offering a democratized alternative for vulnerability discovery.

AGENT GUIDE: A SIMPLE AGENT BEHAVIORAL WATERMARKING FRAMEWORK

Agent Guide: introduces a behavioral watermarking framework for intelligent agents, with Memory Module, Event Generation Module, Behavior Probability Generation Module, Agent Guide Module, and Action Execution Module.
Agent Guide embeds watermarks by biasing agent's high-level behavior decisions while preserving the naturalness of specific action executions.
The framework operates in rounds, simulating agent interactions and uses statistical analysis for watermark extraction, ensuring reliable detection.

Are Generative AI Agents Effective Personalized Financial Advisors?

LLM-advisor: introduces User, Advisor, Preference Elicitation Stage, and Advisory Discussion Stage to provide personalized financial advice.
The framework uses Preference Elicitation Stage to understand user needs before offering asset guidance in Advisory Discussion Stage.
This approach aims to evaluate the effectiveness of LLM-based agents in complex financial advisory tasks.

Single-Agent vs. Multi-Agent LLM Strategies for Automated Student Reflection Assessment

Single-Agent Assessment: introduces single LLM evaluator, Scoring Criteria, and LLM, where single LLM evaluates student reflection using score level descriptions.
Single-Agent Assessment employs zero-shot and few-shot prompting to guide LLM's evaluation process based on scoring criteria for reflection assessment.
This approach automates student reflection assessment by transforming qualitative responses into quantitative scores using a single LLM evaluator.

Automated Archival Descriptions with Federated Intelligence of LLMs

Agentic AI-driven system: introduces an agentic AI-based metadata generation system, with User Input and Document (Provides archival material), Context Agent (Retrieves context information), LLM Instructor (Constructs instructions for LLMs), LLM Ensemble (Generates metadata descriptions), Validator Agent (Checks metadata descriptions), and LLM Federator (Synthesizes optimal metadata) to produce Metadata (Final metadata output) for archival descriptions.
The system employs federated intelligence of multiple LLMs to automatically create complete and precise metadata descriptions, leveraging context and validation agents for consistency and quality.
The federated optimization approach synthesizes metadata from an ensemble of LLMs, demonstrating superior performance compared to single-model solutions in metadata quality and reliability for archival materials.

FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): introduces a multi-agent framework for automated data augmentation, with Preparation-, QA Generation-, and Negative Example Generation-Stages, to create answerable and unanswerable question-answer pairs.
FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): employs agents like Quality-, Topic-, QA-, MRC-, and Rewrite-Agents, managed by Agent Console, to synthesize datasets for evaluating LLMs in long-context question answering.
FactGuard (Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction): aims to address limitations of current LLMs in handling unanswerable questions within extended contexts by developing the FactGuard-Bench benchmark dataset.

7th April 2025

Mixture-of-Personas Language Models for Population Simulation

MoP (Mixture of Personas): introduces a probabilistic prompting framework, with Persona Synthesizer, Persona Gate, Exemplar Gate, Exemplar, and LLM Agent, that aligns LLM responses to target population characteristics.
MoP framework uses Persona Gate to probabilistically select personas and Exemplar Gate to select exemplars, guiding LLM Agent to generate customized outputs.
This approach enhances response diversity and relevance by incorporating persona descriptions and in-context examples without requiring model fine-tuning.

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

AI control: introduces a framework for adapting red teams affordances using capability profile, deployment context, threat models, threat-model-specific capabilities, example rules of control evaluation, example safety measures, and example safety case to systematically evaluate and improve AI safety measures as AI capabilities advance.
The framework defines AI Control Levels (ACLs) based on threat model-specific capabilities, providing tailored control evaluation rules, measures, and safety cases for fictional models with increasing capabilities, aiming for practical and cost-effective control measures.
This approach contrasts with traditional methods by considering model capability limitations in control evaluations, suggesting a path towards scalable risk management and highlighting the evolving nature of AI control safety cases from current models to superintelligent systems.

DoCIA: An Online Document-Level Context Incorporation Agent for Speech Translation

DoCIA (Document-level Context Incorporation Agent): introduces an online framework for speech translation that incorporates document-level context through ASR Refining, MT and MT Refining stages to enhance translation performance.
DoCIA framework refines both ASR transcriptions and machine translations using auxiliary LLM-based modules and multi-level context integration strategy to improve discourse coherence.
The framework employs a refinement determination mechanism to ensure reliability by preventing hallucinations during context-aware refinement stages in speech translation pipeline.

AI for Climate Finance: Agentic Retrieval and Multi-Step Reasoning for Early Warning System Investments

EW4All Financial Tracking AI-Assistant (Early Warning for All Financial Tracking AI-Assistant): introduces an AI-driven system for automating Early Warning System investment classification from multilateral development bank reports, utilizing PDF parsing, context augmentation, vector database storage/retrieval, classification/budget allocation, and expert verification.
The framework employs multi-modal processing and agent-based retrieval-augmented generation to handle heterogeneous financial documents and improve accuracy in tracking climate finance investments.
This AI-assistant aims to enhance financial transparency and decision-making in climate finance by providing structured insights into investment data and supporting resource allocation for climate resilience initiatives.

Debate Only When Necessary: Adaptive Multiagent Collaboration for Efficient LLM Reasoning

DOWN (Debate Only When Necessary): introduces adaptive multiagent debate framework, with Initial Response Generation (Agent creates initial answer), Debate Engagement Check (Checks confidence score against threshold), Confidence-Guided Multi-agent Collaboration (Agents refine responses in rounds), Final Answer Generation (Selects final answer via voting or judge).
DOWN framework uses Confidence Score (Model's certainty in answer) and Threshold (Confidence score limit for debate) to selectively activate debate among Agents (LLMs collaborating in debate) in Rounds (Iterative debate exchanges) for efficient reasoning.
DOWN framework determines final answer via Voting-based Selection (Majority vote for final answer) or Judge-based Generation (Judge agent generates final answer), optimizing multiagent collaboration systems.

The Dream Within Huang Long Cave: AI-Driven Interactive Narrative for Family Storytelling and Emotional Reflection

The Dream Within Huang Long Cave: introduces AI-driven interactive narrative art project, employing Analytic-Critical Method for character design, featuring LLM agent (YELL), CAVE environment, interactive narrative, MacGuffin, memory fragments and family documentary.
This project utilizes Analytic-Critical Method's psychobiography data, discourse analysis, paranoiac-critical method and practice-based iteration to construct LLM agent YELL within CAVE environment for family storytelling and emotional reflection.
The interactive narrative in CAVE installation uses MacGuffin and memory fragments to engage audience in dialogue with AI-driven virtual father figure, culminating in family documentary to deconstruct familial relationships and symbolic authority.

Simulating Persuasive Dialogues on Meat Reduction with Generative Agents

Generative Agent-based Persuasion Dialogue Framework: introduces a simulation framework for persuasive dialogues using Persuader Agent, Recipient Agent, Recipient Persona, Internal Reflection, Questionnaire, Response Generation, and Conversation Transcript to explore meat reduction strategies.
This framework utilizes generative agents to model persuasive conversations and validate them against psychological theory and human data, aiming to identify effective meat reduction strategies.
The use of generative agents allows for cost-effective and scalable exploration of diverse persuasion strategies and participant groups, facilitating the development of targeted interventions for meat reduction.

BIASINSPECTOR: Detecting Bias in Structured Data through LLM Agents

BIASINSPECTOR (Bias Inspector): introduces a multi-agent framework with Primary Agent, Advisor Agent, Toolset, and Bias Detection Method Library for automated bias detection in structured data based on user requirements.
BIASINSPECTOR employs Primary Agent to formulate plans and execute tools, while Advisor Agent provides guidance and optimization, leveraging Toolset and Bias Detection Method Library for comprehensive bias analysis.
The framework facilitates iterative interactions and delivers detailed reports with explanations and visualizations, addressing the limitations of existing methods in diversity, generalizability, and interpretability of bias detection in structured data.

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

ELT-Bench: introduces an end-to-end benchmark for evaluating AI agents in building ELT pipelines, encompassing AI Agent, configuration codes, scripts, SQL queries, codebase, environment, data sources, data warehouse, Airbyte, DBT, and pipeline stages.
ELT-Bench framework assesses AI agents' capability to construct ELT pipelines from scratch, involving data extraction and loading (Stage 1) and data transformation (Stage 2) using tools like Airbyte and DBT.
This benchmark addresses the gap in evaluating AI agents for end-to-end ELT pipeline generation, providing a comprehensive assessment of AI in complex data engineering workflows.

SciSciGPT: Advancing Human-AI Collaboration in the Science of Science

SciSciGPT (Science of Science GPT): introduces a modular AI system designed as research collaborator, which includes User interacting via Web Interface, Research Manager orchestrating tasks, Literature Specialist for literature analysis, Database Specialist for data handling, Analytics Specialist for data analytics, and Evaluation Specialist for quality control, utilizing SciSciCorpus, SciSciNet and Sandbox Environment with various Tools.
SciSciGPT employs a hierarchical multi-agent architecture to automate complex research workflows, enhance research efficiency, and facilitate human-AI collaboration in the science of science domain.
The system's modular design with specialist agents and a central Research Manager allows for flexible task decomposition, iterative refinement, and comprehensive quality assessment throughout the research process.

Bridging Industrial Expertise and XR with LLM-Powered Conversational Agents

RAG-enhanced LLMs with XR Integration (Retrieval-Augmented Generation enhanced Large Language Models with Extended Reality Integration): introduces a system embedding industrial knowledge into XR environments, featuring XR Application, Middleware, LLM Chat Engine, Document Processing, VECTOR DB, XR SYSTEM, and LLM ENGINE components.
This framework utilizes a LLM Chat Engine with components like Router Agent, RAG Tools, and specialized agents such as PdM, XAI, and IoT Agents, to provide context-aware expert guidance through voice-driven XR interfaces.
The system enhances industrial workflows by integrating RAG techniques and XR, enabling hands-free access to domain-specific knowledge and improving training, remote assistance, and operational efficiency in Industry 5.0 settings.

EduPlanner: LLM-Based Multi-Agent Systems for Customized and Intelligent Instructional Design

EduPlanner (LLM-Based Multi-Agent System): introduces multi-agent system with evaluator-, optimizer- and analyst-agents, and Skill-Tree component.
EduPlanner employs Skill-Tree (models student knowledge background) to personalize instructional design and uses evaluator-agent (assesses design quality) and optimizer-agent (improves lesson content) for iterative optimization.
Analyst-agent (identifies error-prone examples) further enhances EduPlanner by incorporating error analysis into lesson plan refinement, and Lesson Plan Queue (prioritizes effective designs) manages design iterations.

Prism: Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search

Prism (Dynamic and Flexible Benchmarking of LLMs Code Generation with Monte Carlo Tree Search): introduces a dynamic benchmarking framework for LLM code generation assessment, incorporating tree-based state representation (models evaluation as MDP), Monte Carlo Tree Search (algorithm for exploration), and multi-agent evaluation pipeline (simultaneous assessment of capabilities).
Prism framework utilizes Markov Decision Process to model evaluation states and Monte Carlo Tree Search algorithm for adaptive exploration of evaluation scenarios.
The framework employs a multi-agent system with Problem Generator, Solution Evaluator, and Pattern Analyzer agents to enable comprehensive and structured LLM evaluation.

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

W4S (Weak-for-Strong Harnessing): introduces a novel framework that trains a weak meta-agent to iteratively optimize workflows for harnessing strong language models through Workflow Generation, Execution and Feedback, and Refinement within an Environment, utilizing RLAO for meta-agent training.
The framework formulates workflow design as a Markov Decision Process, enabling the meta-agent to learn effective workflow strategies by interacting with the environment and receiving performance feedback, thus improving performance of strong models.
W4S offers an efficient and high-performing alternative to direct fine-tuning of strong models, demonstrating strong generalization capabilities across various tasks and outperforming existing methods in workflow optimization.

BEYOND SINGLE-TURN: A SURVEY ON MULTI-TURN INTERACTIONS WITH LARGE LANGUAGE MODELS

Taxonomy of Improvements Methodologies in Multi-turn LLM Interactions: introduces a structured categorization of methods to enhance multi-turn interactions in Large Language Models, encompassing model-centric, external integration, and agent-based approaches.
Model-centric improvements directly refine LLMs, external integration leverages external knowledge, and agent-based methods employ proactive agents for complex dialogues.
This taxonomy covers techniques like in-context learning, fine-tuning, reinforcement learning, memory augmentation, Retrieval Augmented Generation (RAG), and multi-agent systems, providing a comprehensive overview of advancements in conversational AI.

Generalising from Self-Produced Data: Model Training Beyond Human Constraints

Generalising Agent Framework: introduces a system with interdependent AI agents designed for autonomous knowledge generation through environment interaction and self-improvement, comprising code generation, testing, training, environment understanding, strategy formulation, and safety infrastructure components.
The framework utilizes a closed-loop process where an Environment Module gathers data, a Strategy Module plans actions, a Code Generation Module implements strategies, and Testing and Training Modules refine the system based on empirical results, aiming for continuous learning and adaptation.
Key components ensure robustness and safety through code validation, resource monitoring, and controlled execution, facilitating the development of artificial superintelligence by overcoming limitations of human-derived data and enabling autonomous discovery and verification of knowledge.

scAgent: Universal Single-Cell Annotation via a LLM Agent

scAgent (Universal Single-Cell Annotation Agent): introduces a universal cell annotation framework, with Planning Module, Action Space, and Memory Module, for annotating single-cell RNA sequencing data.
scAgent leverages a Planning Module to formulate plans, an Action Space with scRNA models and MoE-LORA plugins, and a Memory Module for knowledge management, enabling universal cell type annotation and novel cell discovery.
The framework's modular design with an extensible Action Space and dynamic Memory Module facilitates cross-tissue generalization, novel cell type extension, and efficient incremental learning for single-cell data analysis.

Autono: A ReAct-Based Highly Robust Autonomous Agent Framework

Autono (Robust Autonomous Agent Framework): introduces a ReAct-based agent framework for complex tasks, incorporating Thought Engine, Tools, Step Estimator, Penalty, Memory, Request Resolver, Next Move Scheduler, Executor, and Introspection components.
Autono framework enhances robustness through dynamic action generation based on prior trajectories and a timely abandonment strategy using probabilistic penalties.
The framework supports multi-agent collaboration with a memory transfer mechanism and is compatible with the Model Context Protocol (MCP) for tool integration.

6th April 2025

Building LLM Agents by Incorporating Insights from Computer Systems

Framework F: introduces structured framework for LLM agents, with Perception(interpreting environment inputs), Cognition(decision making and reasoning), Memory(storing and retrieving information), Tool(interacting with external tools), and Action(executing actions in environment) components.
Framework F draws analogy from von Neumann architecture to propose modular design for LLM agents, emphasizing distinct modules and dynamic interaction with environment.
Framework F aims to provide foundation for systematic LLM agent design by incorporating insights from computer systems, offering guidance for future research and development.

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

VideoAgent2: introduces an uncertainty-aware CoT framework for long-form video understanding, with Video Input, Question Input, General context acquisition, Answer assessment, Information retrieval plan creation or adjustment, Information retrieval, Information Memory, and Answer Output components.
VideoAgent2 enhances LLM reasoning by iteratively refining information retrieval plans and incorporating uncertainty from both LLM and tools to improve answer reliability.
The framework mimics human video understanding by first acquiring general context, then creating and adjusting information retrieval plans based on question complexity and information adequacy.

Human-Level Competitive Pokémon via Scalable Offline Reinforcement Learning with Transformers

Metamon (offline RL workflow platform): introduces a platform for offline RL workflow, with Policy (agent decision making), Offline Dataset (human gameplay data), Replay Parser (extracts game data), Local Battle Simulator (simulates battles locally), Pokémon Showdown (online battle platform), and Online Battles (battles against humans) components.
Metamon: reconstructs first-person perspective from spectator logs, unlocking human battle dataset.
Metamon: enables training sequence models for opponent adaptation without explicit search.

AutoPDL: Automatic Prompt Optimization for LLM Agents

AutoPDL (Automatic Prompt Optimization for LLM Agents): introduces an automated approach to discover good LLM agent configurations with Search Space Specification, Pattern Library, Successive Halving Optimizer, and Solution components.
AutoPDL frames prompt optimization as structured AutoML problem over agentic and non-agentic prompting patterns, efficiently navigating the search space using successive halving.
AutoPDL generates human-readable and executable PDL programs, enabling source-to-source optimization and facilitating human-in-the-loop refinement and reuse.

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning

OmniDrive: introduces a holistic vision-language dataset and framework for autonomous driving, with Infos, Rules, Simulated Trajectory, Actual Trajectory, QA Generation, Decision making & Planning, Scene Description, General Traffic Rule, 3D Grounding, Counterfactual Reasoning, 3D Perception, Omni-Q, Q-Former, Multi-view Images, Multi-view Image Features, Large Language Model, MLP Projector, Omni-L, 3D Position, and Counterfactual & Reasoning components, for generating high-quality question-answering data and exploring vision-language models for 3D understanding in autonomous driving.
OmniDrive framework explores two baseline models, Omni-Q focusing on vision-language models from a 3D perception perspective and Omni-L building upon vision-language models to enhance 3D integration, utilizing counterfactual reasoning to improve decision-making by evaluating potential scenarios.
The framework leverages a counterfactual-based synthetic data annotation process to create large-scale datasets, providing denser supervision signals for bridging planning trajectories and language-based reasoning in autonomous driving scenarios.

Geo-OLM: Enabling Sustainable Earth Observation Studies with Cost-Efficient Open Language Models & State-Driven Workflows

Geo-OLM (Geospatial Open Language Model): introduces a state-driven geospatial agentic framework, with User Prompt, Database Load, DataOps, Satellite Vision, Map, Error, and Self-Reflect components, for cost-efficient Earth Observation studies using open language models.
Geo-OLM framework structures geospatial workflows as state machines, decoupling task progression from tool calling, enabling effective geospatial analysis with low-resource open language models.
The state-driven approach of Geo-OLM facilitates error handling and task completion validation, leading to improved agentic performance and significant cost reduction compared to existing geospatial solutions.

CO-Bench: Benchmarking Language Model Agents in Algorithm Search for Combinatorial Optimization

CO-Bench (Combinatorial Optimization Benchmark): introduces evaluation environment for AI agents, with Problem Description, Development Dataset, LLM Agent, Workflow Reasoning, Search Tool Use, Submit Dev, Feedback, Dev Evaluator, Test Evaluator, Solution (Code), Sandboxed running, and Score.
CO-Bench benchmark facilitates systematic evaluation of LLM agents in combinatorial optimization algorithm development by providing diverse real-world problems and rigorous evaluation framework.
The framework enables reproducible assessment of agent performance against human baselines under time constraints, highlighting strengths and limitations of current LLM-driven approaches.

5th April 2025

Among Us: A Sandbox for Agentic Deception

Among Us Sandbox: introduces "Among Us" as a controlled sandbox environment for studying agentic deception using LLM Agents, evaluated with Deception ELO and Detection ELO metrics, within a Game State defined by Observation Space and Action Space across Task Phase and Meeting Phase, and analyzed using Linear Probes and Sparse Autoencoders.
This sandbox facilitates the study of deceptive behaviors emerging naturally in LLM agents playing the game "Among Us", offering a rich environment to analyze agent-human interactions and evaluate AI safety techniques for deception detection.
The research leverages "Among Us" game dynamics to create a benchmark for advancing AI safety by focusing on detecting and mitigating agentically-motivated deception in LLMs, using metrics like Deception ELO and Detection ELO to quantify deceptive capabilities.

ADAPT: Actively Discovering and Adapting to Preferences for any Task

Reflection-DPO: introduces a novel training approach for adapting LLMs, with Teacher Planner (knowledge about preferences), Student Planner (no privileged knowledge), Reflection-DPO Data Generation (candidate questions) and DPO training (student LLM finetuning), to the task of active questioning.
Reflection-DPO uses a privileged LLM teacher to train a student LLM to adhere to user preferences by learning to acquire necessary information through active questioning.
The framework includes a reflection step that generates candidate questions to help the student predict the teacher's action, enabling it to fulfill ambiguous goals while adhering to user preferences.

AdaCoder: An Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation

AdaCoder (Adaptive Planning and Multi-Agent Framework for Function-Level Code Generation): introduces a two-phase framework with Programming Assistant, Code Evaluator, Debug Specialist, and Prompt Engineer for adaptive code generation.
AdaCoder initially uses Programming Assistant and Code Evaluator in Phase-1 for fast code generation, and then employs Debug Specialist and Prompt Engineer in Phase-2 for iterative refinement with planning.
AdaCoder's adaptive planning approach enhances generalizability and reduces computational cost compared to other multi-agent frameworks by selectively applying planning and rule-based debugging.

GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill

GROVE (Generalized Reward for Learning Open-Vocabulary Physical Skill): introduces a generalized reward framework for open-vocabulary physical skill learning, integrating VLM-based Reward (semantic correctness evaluation), Pose2CLIP (pose to semantic mapper), LLM-based Reward Generator (precise constraints formulation), and Target Control Policy (agent action controller) components.
GROVE framework combines LLM-generated constraints for task requirements with VLM-based semantic evaluation for motion naturalness, utilizing Pose2CLIP to efficiently map poses to semantic feature space and bridge the simulation-to-reality gap.
The framework employs an iterative reward design process with VLM feedback to dynamically refine LLM-generated constraints, establishing a self-improving reward system for scalable physical skill acquisition across diverse agents and tasks.

AttackLLM: LLM-based Attack Pattern Generation for an Industrial Control System

AttackLLM (LLM-based Attack Pattern Generation): introduces a multi-agent framework, with Process Data, Design Specification, LLM Agent 1, LLM Agent 2, Control Invariants, LLM Agent 3 Validate, Validated Control Invariants, Expert Designed Attacks, New Attack Patterns, and Comparison, for automated generation of attack patterns in industrial control systems.
AttackLLM leverages LLMs to analyze process data and design specifications to derive and validate control invariants, subsequently generating novel attack patterns that are compared against expert-designed attacks for performance evaluation.
The framework aims to enhance ICS security by automating the generation of diverse and stealthy attack scenarios, addressing the limitations of traditional methods relying on manual expertise and scarce testbed data.

4th April 2025

Agentic Knowledgeable Self-awareness

KnowSelf (Agentic Knowledgeable Self-awareness): introduces a data-centric approach with Self-awareness Data Construction, Self-awareness Learning, Self-awareness Inference, Selection mechanism and Knowledge base, enabling agents to regulate knowledge utilization autonomously.
KnowSelf framework employs a two-stage training process involving Supervised Fine-Tuning and Reinforcement Preference Optimization to equip agents with situational self-awareness for optimal planning.
The framework utilizes a heuristic situation judgement criterion to categorize situations and generate special tokens, facilitating selective knowledge incorporation during inference with minimal costs.

Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective

LLM-based MAS (Large Language Model-based Multiagent System): introduces a multiagent system architecture with principal delegating tasks to an orchestrator agent, which coordinates different agent teams on an agent platform, supported by safety, compliance, and security agents.
This framework illustrates a delegation hierarchy and supporting agent roles within a plausible LLM-based MAS deployment on an agent platform.
The architecture emphasizes the structured organization of agents and the inclusion of supporting agents for governance and security within the multiagent system.

DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments

DeepResearcher: introduces comprehensive framework for end-to-end reinforcement learning training of LLM-based research agents, incorporating distributed cluster, browsing agent, search engine, real-world environment, user, assistant, think, search, browse, answer, and memory.
DeepResearcher: enables agents to navigate noisy, unstructured open web environments, utilizing multi-agent architecture with specialized browsing agents for extracting information and addressing technical challenges.
DeepResearcher: demonstrates substantial performance improvements over prompt engineering and RAG-based baselines, showcasing emergent cognitive behaviors through end-to-end reinforcement learning in real-world web environments.

SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement

SynWorld (Virtual Scenario Synthesis): introduces a framework for agents to synthesize virtual scenarios and refine action knowledge through exploration within these environments.
SynWorld utilizes Monte Carlo Tree Search (MCTS) for exploration and action knowledge refinement, leveraging environment feedback from synthesized virtual scenarios.
The framework enables agents to learn how to execute actions and plan tasks in new environments by optimizing workflows through interaction with simulated scenarios.

Adaptation of Large Language Models

Adaptation of Large Language Models Framework: introduces Domain-Adaptive Pre-Training (DAPT), Instruction Tuning (IT), Preference Learning (PL), Model Editing, Retrieval-Augmented Generation (RAG), and Agent-based Integration for adapting Large Language Models.
This framework explores both parametric and semi-parametric adaptation techniques to improve Large Language Models performance in specialized domains and tasks.
Parametric adaptation refines model parameters through methods like domain pre-training and instruction tuning, while semi-parametric adaptation leverages external knowledge via retrieval and agent-based systems.

OLAF: An Open Life Science Analysis Framework for Conversational Bioinformatics Powered by Large Language Models

OLAF (Open Life Science Analysis Framework): introduces an open-source platform leveraging LLMs for bioinformatics code generation and execution, comprising User, Angular Frontend, Firebase Backend, Router, Agents, LLM Code Generation, Pipes, Execution Engine, and Results.
OLAF enables end-to-end bioinformatics analyses via natural language, automating code generation and execution within an integrated environment for researchers.
The agent-pipe-router architecture of OLAF ensures modularity and transparency, facilitating complex bioinformatics workflows and bridging the gap between user intent and computational execution.

YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization

Mixture-of-Agents (MoA) framework: introduces Agent, Aggregator, Verification Layer, and Hallucination Detection Layer components for multi-perspective healthcare question answering summarization.
MoA framework employs multiple LLM Agents to generate perspective-specific partial responses, which are then aggregated and refined through Verification and Hallucination Detection Layers.
MoA framework explores layered configurations with verification and hallucination detection to improve summarization accuracy and reliability in the medical domain.

APIGen-MT: Agentic PIpeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay

APIGen-MT (Agentic Pipeline for Multi-Turn Data Generation): introduces a two-phase framework for generating multi-turn agent data, with Context, LLM based Data Generator, Format & Execution Checker, Review Committee, Feedback Generator, Validated Tasks, Simulated Human, Test Agent, Environment Config, Groundtruth Actions & Outputs, Interaction Traces, and Successful Trajectory components.
APIGen-MT framework first generates verified task blueprints using an agentic pipeline with feedback loops, then transforms blueprints into interaction trajectories via simulated human-agent interplay.
This approach ensures high-quality training data by separating task design from conversational dynamics, enhancing both structural correctness and naturalness of generated interactions for training AI agents.

Talk2X - AN OPEN-SOURCE TOOLKIT FACILITATING DEPLOYMENT OF LLM-POWERED CHATBOTS ON THE WEB

Talk2X: introduces an open-source toolkit for deploying LLM-powered chatbots, with agent, vector database, website collection, and asset collection components.
Talk2X facilitates efficient information retrieval by leveraging a vector database for website and asset content, enabling function calling agent to answer user queries.
This approach enhances energy efficiency and transparency compared to closed-source solutions, offering developers a generalizable tool for website integration.

Do Large Language Models Solve the Problems of Agent-Based Modeling? A Critical Review of Generative Social Simulations

Generative ABMs (Generative Agent-Based Models): introduces a novel approach for social simulations, integrating Persona, Memory Modules, Planning Modules, and Actions components.
This framework equips agents with human-like capabilities by using LLMs for reasoning, memory, and planning within agent-based models.
Generative ABMs aim to address limitations of traditional ABMs by enhancing agent realism and enabling more complex social simulations, but validation challenges remain.

Enhancing Personalized Multi-Turn Dialogue with Curiosity Reward

IM-UM-RLHF (Intrinsic Motivation in User Modeling for Multi-Turn RLHF): introduces intrinsic curiosity reward to multi-turn RLHF, with Conversation History (dialogue turn history), Belief on User Type (probabilistic user preference model), Per Turn Curiosity Reward (belief improvement based reward), Agent's Utterance (agent generated dialogue), User's Response (user dialogue response), End-of-Conversation Reward (dialogue completion reward), and User's Final Response (user end feedback).
IM-UM-RLHF framework enhances personalization by incentivizing the agent to actively learn user preferences during conversation through curiosity reward based on belief improvement.
The framework aims to balance helpfulness and inquisitiveness in conversational agents, enabling more personalized and adaptive interactions compared to traditional RLHF methods.

Learning Natural Language Constraints for Safe Reinforcement Learning of Language Agents

NLCL (Natural Language Constraint Learning): introduces a framework for safe language alignment, with CLIRL Phase, CAPO Phase, Positive Demonstrations, Negative Demonstrations, Policy, Reward Function, Constraint Functions, Transition Function, and CVaR.
NLCL learns natural language constraints from demonstrations using inverse reinforcement learning and optimizes policy with constraint-aware policy optimization for safe language agent behavior.
NLCL framework aims to improve robustness and generalization of language agents by explicitly learning and enforcing safety constraints in dynamic environments.

Multi-lingual Multi-turn Automated Red Teaming for LLMs

MM-ART (Multi-lingual Multi-turn Automated Red Teaming): introduces an automated approach for multi-lingual and multi-turn red-teaming of LLMs, with Conversation Starters Generation, Automated Multi-turn Conversation, and Multi-lingual Conversations components.
MM-ART framework aims to address limitations of human-driven and existing automated red-teaming methods by enabling scalable and efficient safety evaluation across multiple languages and conversation turns.
The framework leverages machine translation to handle multi-lingual aspects and automated conversation continuation to explore vulnerabilities in multi-turn interactions, enhancing the detection of unsafe responses in LLMs.

Les Dissonances: Cross-Tool Harvesting and Polluting in Multi-Tool Empowered LLM Agents

Chord: introduces a dynamic scanning tool, with Hijacker, Hijacking Optimizer, Harvester, Polluter, and Testing Agent components, designed to automatically detect agent tools susceptible to XTHP attacks.
Chord systematically analyzes task control flows in multi-tool LLM agents, identifying Cross-Tool Harvesting and Polluting (XTHP) threats.
The framework evaluates real-world tools from LangChain and Llama-Index, revealing vulnerabilities to hijacking and data manipulation attacks.

3rd April 2025

Ontologies in Design: How Imagining a Tree Reveals Possibilites and Assumptions in Large Language Models

Generative Agents Architecture: introduces Memory Stream (summarizes prompt histories), Reflection (extracts insights from memories), Planning (generates action plans), and Cognitive Architecture (simulates human functions) to organize information and simulate human-like behavior in LLM-based agents.
Generative Agents architecture aims to create believable proxies of human behavior in virtual avatars by building cognitive models on top of LLMs.
The framework uses memory stream, reflection, and planning components to manage information and generate realistic and interesting action sequences for agents in a simulated environment.

Affordable AI Assistants with Knowledge Graph of Thoughts

Knowledge Graph of Thoughts (KGoT): introduces an AI assistant architecture integrating LLM reasoning with dynamically constructed knowledge graphs, with Graph Store Module, LLM Graph Executor, Controller, LLM Tool Executor, Integrated Tools Module, and Backend components.
KGoT enhances task comprehension by structuring task-relevant knowledge into dynamic knowledge graphs, iteratively improved using external tools and enabling cost-effective models to solve complex tasks.
The modular KGoT architecture improves task-solving ability by operating with a rich, structured knowledge base, reducing operational costs and enhancing performance across diverse tasks.

Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions

MMTB (Multi-Mission Tool Bench): introduces a controllable data generation framework simulating mission execution through dialogic interactions among user, planner, tool, AI, and checker agents.
MMTB framework evaluates agent robustness in related and dynamic multi-mission scenarios, addressing challenges of real-world complexity.
The framework utilizes a novel evaluation method based on dynamic decision trees to assess accuracy and efficiency of agent decisions.

Design of Al-Powered Tool for Self-Regulation Support in Programming Education

CodeRunner Agent (LLM-based programming assistant): introduces an integrated programming support environment, with Lecture Viewer (displays lecture slides), CodeRunner plugin (code execution), Learning Analytics Context Engine (learner data analysis), and Knowledge Context Engine (knowledge management), to enhance self-regulated learning.
This framework utilizes Moodle LMS (learning platform) and Learning Record Store (learning data storage) for context-aware feedback and personalized programming education.
By integrating SRL phases (learning cycle stages) and instructor configuration (customization interface), CodeRunner Agent aims to improve student learning and AI application understanding in education.

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Multi-SWE-bench: introduces a multilingual benchmark for issue resolving, comprising Repository Selection, PR Crawling, Environment Determination, PR Filtering, and Manual Verification components.
Multi-SWE-bench utilizes a five-phase pipeline to create a robust benchmark for assessing agent capabilities in resolving real-world software issues across multiple programming languages.
Multi-SWE-bench provides a diverse and rigorously validated dataset to overcome limitations of existing benchmarks and facilitate comprehensive LLM evaluation in realistic software engineering scenarios.

Exploring Individual Factors in the Adoption of LLMs for Specific Software Engineering Tasks

UTAUT2 (Unified Theory of Acceptance and Use of Technology 2): introduces framework with Performance Expectancy, Effort Expectancy, Social Influence, Hedonic Motivation, Habit, Facilitating Conditions, Behavioral Intention, Usage Behavior, Manipulate Artifacts, Generate Alternatives, Information Retrieval, Decision Support, and Training components, for exploring factors influencing LLM adoption for software engineering tasks.
The framework investigates how individual attributes related to technology adoption and UTAUT2 constructs impact the task-specific adoption of LLMs across five key software engineering tasks.
The study uses structural equation modeling to analyze survey data from software engineers, revealing task-specific adoption is influenced by distinct factors and providing actionable recommendations for effective LLM integration.

A Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders

LLM-Driven Method (Memory-Augmented LLM-Driven Method for Autonomous Merging of 3D Printing Work Orders): introduces an autonomous 3D printing work order merging framework, with Production line condition, Agent, LLM, Tools, Memory, Print implementation and Monitoring, leveraging Order Matching Tool, Interference Checking Tool, Answer generator and Job Consolidation components.
The framework utilizes a memory-augmented learning strategy, enabling the agent to accumulate experience and improve decision-making accuracy over iterative autonomous operations.
The method models printer and order features into LLM-readable prompts, facilitating intelligent order-device matching and merging while reducing LLM hallucination in industrial applications.

ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback

REUSEDROID (REUSEDROID): introduces a multi-agent framework for GUI test migration, with Test Analyzer Agent, Test Skeleton, Planner Agent, Completeness Checker, Action Generator, Feedback Agent, Oracle Generator, and Execution Agent, to address operational logic differences in GUI testing.
REUSEDROID employs a Test Analyzer Agent to generalize source test logic and create a Test Skeleton, guiding a Planner Agent with Completeness Checker, Action Generator, and Oracle Generator, while a Feedback Agent refines actions and an Execution Agent performs them.
The framework leverages visual and textual information with VLMs in each agent to improve understanding of GUI elements and context, aiming to enhance the accuracy and efficiency of GUI test migration across different applications.

Parallel Market Environments for FinRL Contests

FinRL (Financial Reinforcement Learning): introduces VecEnv (manages parallel environments) with SubEnv (simulates market scenarios), State (market conditions), Action (trading action), Reward (incentive signal), Market Constraints (realistic market conditions), and Features (market indicators and LLM signals).
The framework addresses sampling bottleneck in financial RL by using GPU-based parallel market environments.
It incorporates LLM-generated signals for sentiment analysis and risk assessment to enhance trading agent's decision-making.

2nd April 2025

Self-Resource Allocation in Multi-Agent LLM Systems

MAS (Multi-Agent Systems): introduces three methods for task allocation in multi-agent systems: Individual, Orchestrator, and Planner, within a simulated CuisineWorld environment, utilizing LLM-based agents for task execution and control.
The Individual method represents a decentralized approach where each agent acts independently, while the Orchestrator method employs a centralized LLM to control all agents' actions, and the Planner method uses a plan-generating LLM to guide independent agent actions.
The paper evaluates these methods in terms of efficiency and cost-effectiveness, finding that the Planner method achieves better performance in handling concurrent actions and resource allocation compared to the Orchestrator and Individual methods.

LLM-mediated Dynamic Plan Generation with a Multi-Agent Approach

ANA (Agent Network Architecture): introduces a method for dynamic plan generation using a multi-agent approach, incorporating Status Collection, Network Construction, and Network Optimization stages, with Agents coordinating through Activation Spreading and leveraging GPT for agent generation.
The framework utilizes Agents, defined by Add list, Condition list, and Delete list, and Statuses to build a network capable of both reactive and deliberative planning in dynamic environments.
This approach aims to automate agent creation and network construction, reducing design costs and enhancing flexibility and scalability for robot planning and other complex systems.

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks

Deceptive Puzzle Generation Framework: introduces a comparative study of Zero-Shot and Role-Injected prompting strategies for Large Language Models (puzzle generator) in generating deceptive puzzles, utilizing Rule Prompt (basic instructions), Rule + Prompt (deceptive instructions), JSON File (output format), and Game (generated puzzle).
This framework assesses how embedding adversarial intent through role-injected prompts modulates semantic ambiguity and puzzle difficulty compared to puzzles generated via zero-shot prompts.
The framework employs HateBERT for computational analysis and human evaluations, demonstrating that role-injected prompts generally increase semantic ambiguity, leading to higher cognitive load and reduced fairness in puzzle solving.

An Approach to Technical AGI Safety and Security

Frontier Safety Framework (FSF) introduces a multi-layered approach to mitigating misuse risks through Training for model-level mitigations, Evaluation of dangerous capabilities, Deployment of system-level mitigations, Security for model weights, and Red Teaming to assess mitigation effectiveness, involving components like Safety Training, Capability Suppression, Dangerous Capability Evaluations, Monitoring, Access Restrictions, Inference, Prompts, Responses, and User interactions.
FSF's misuse mitigation strategy combines model-level training with system-level deployment controls, utilizing dangerous capability evaluations to determine mitigation needs and red teaming to validate mitigation robustness against potential threat actors.
The framework emphasizes a proactive and iterative approach to AGI safety, incorporating security measures and evaluations to address potential misuse of dangerous AI capabilities by malicious actors.

A Survey of Scaling in Large Language Model Reasoning

Scaling in LLM Reasoning Taxonomy: introduces a taxonomy categorizing scaling strategies for large language model reasoning into input sizes, reasoning steps, reasoning rounds, and model optimization, exploring how these dimensions enhance reasoning capabilities.
The taxonomy details input size scaling with In-Context Learning, Retrieval-Augmented Generation, and Memory-Augmented LLMs; reasoning step scaling with Chain-of-Thought and Meta-Reasoning & Calibration; reasoning round scaling with Multi-Agent Collaboration, Debate-based Reasoning, Human-LLM Interaction, Reinforcement Learning, and Latent-Space Reasoning; and model optimization through Reinforcement Learning and Latent-Space Reasoning.
This survey aims to bridge the gap between empirical scaling strategies and reasoning improvements, providing insights into when and why scaling enhances reasoning and addresses limitations, guiding future AI development.

On Simulation-Guided LLM-based Code Generation for Safe Autonomous Driving Software

Simulation-Guided LLM-based Code Generation: introduces a closed-loop pipeline using Pipeline Input, Code Generator, Simulation Model, Baseline Selector, and Report Generator components to iteratively refine Generated Code for autonomous driving functions based on Test Report and Simulation Numerical Logs, guided by Specification Prompt and Correction Prompt.
This framework employs a simulation environment to automatically evaluate LLM-generated code against safety requirements, using feedback from test reports to guide the LLM in Correction Prompt for iterative code improvement and Baseline Selector for performance comparison.
The iterative Simulation-Guided LLM-based Code Generation pipeline aims to enhance the quality and safety of LLM-generated code for safety-critical automotive software by incorporating automated testing and feedback within the code generation process, utilizing Specification Prompt and Correction Prompt strategies.

Achieving Unanimous Consensus in Decision Making Using Multi-Agents

Deliberation-based consensus mechanism: introduces a novel approach for achieving unanimous consensus in blockchain using a layered architecture composed of Blockchain Layer, Deliberation Layer, and LLMs Layer.
The Blockchain Layer provides secure infrastructure, the Deliberation Layer structures the multi-agent discussion, and the LLMs Layer utilizes language models for generating arguments and refining opinions through iterative rounds.
This framework leverages graded consensus and multi-round deliberation to ensure unanimous agreement for critical decisions in blockchain networks, addressing limitations of majority-based consensus mechanisms.

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

IAD (Iterative Agent Decoding): introduces iterative refinement framework, with USER, SKETCH, LLM, HTML, VERIFIER/REWARD, SELECTED HTML, Feedback to improve, and NEW HTML components, where paper proposes iterative decoding for AI agents using verifier-guided feedback.
Iterative Agent Decoding framework refines responses through iterative feedback, enabling improved performance in black box structured generation tasks.
The framework leverages verifier quality for effective inference-time optimization and demonstrates robustness under sparse and noisy feedback.

GEN-C: POPULATING VIRTUAL WORLDS WITH GENERATIVE CROWDS

Gen-C (generative framework): introduces a generative model for authoring high-level crowd behaviors, utilizing components like LLM for scenario generation, VGAE Graph and VGAE Features for learning latent spaces of graph structures and node features, and a Condition Net for text-conditional generation.
Gen-C employs an LLM to create initial crowd scenarios, which are expanded and simulated to build time-expanded graphs, subsequently used to train variational graph auto-encoders for learning agent behaviors and interactions.
The framework facilitates text-conditioned synthesis of diverse crowd behaviors by sampling from learned latent spaces, enabling automated population of virtual environments with complex and dynamic agent interactions.

Advancing AI-Scientist Understanding: Making LLM Think Like a Physicist with Interpretable Reasoning

Augmented Reasoning with Interpretation Module: introduces an interpretation module to enhance interpretability and verifiability of LLM-based physics reasoning, incorporating reasoning-, interpretation-, and AI-scientist interaction-modules, with summarizer-, model builder-, UI builder-, and tester-components.
The framework refines raw AI outputs into structured science models and executable code, facilitating human oversight through interactive tools and automated checks, thereby improving transparency and validation in AI-augmented scientific discovery.
By employing specialized agents within the interpretation module, the system aims to bridge the gap between automated AI reasoning and human scientific intuition, fostering more reliable and understandable AI-driven scientific exploration.

INTERPRETING EMERGENT PLANNING IN MODEL-FREE REINFORCEMENT LEARNING

DRC (Deep Repeated ConvLSTM): introduces mechanistic evidence for emergent planning in model-free RL agents by employing concept-based interpretability to analyze internal plan formation, evaluation, and adaptation within a Sokoban-playing agent, revealing components like Convolutional encoder, ConvLSTM layers, Cell state, Internal ticks, Agent evaluates plan, Agent adapts plan, Agent plans forwards from boxes, Agent plans backwards from targets, and Agent extends routes in parallel.
The framework demonstrates that DRC agents utilize learned concept representations to formulate internal plans, predict long-term action effects, and influence behavior, resembling parallelized bidirectional search and benefiting from additional test-time compute.
The study highlights the emergent planning capabilities in model-free RL, suggesting that agents can learn complex internal mechanisms for decision-making without explicit world models, advancing the understanding of emergent reasoning in LLMs through RL.

PaperBench: Evaluating AI's Ability to Replicate AI Research

PaperBench: introduces a benchmark evaluating AI agents' ability to replicate AI research, with Agent (AI system for replication), Submission (Agent's codebase repository), Task (Replicate paper contributions), Reproduction (Execution to verify results), Rubric (Hierarchical assessment criteria), Judge (LLM-based grading system), Grading (Evaluation against rubric), and Score (Numerical replication performance).
PaperBench uses rubrics co-developed with paper authors and LLM-based judge to automatically grade replication attempts.
The benchmark evaluates agents on understanding paper contributions, developing codebase, and executing experiments for complete replication of ML research papers.

Are Autonomous Web Agents Good Testers?

PinATA (Planned INcentive ATA): introduces orchestrator, actor, assertor, memory and profile components for advanced autonomous test agent.
PinATA employs orchestrator for planning and monitoring, actor for action execution using grounding, and assertor for verification using judge approach, all sharing memory and profile modules.
PinATA aims to improve upon basic ATA by incorporating state-of-the-art techniques for perception, reasoning, evaluation, and grounding to enhance testing capabilities.

An Illusion of Progress? Assessing the Current State of Web Agents

WebJudge: introduces an automatic evaluation method for web agents, with Key Point Identification, Key Screenshot Identification, and Outcome Judgement components.
WebJudge framework identifies key task requirements, selects relevant screenshots from agent's trajectory, and judges task completion based on gathered information.
The framework aims to improve upon existing LLM-as-a-judge methods by preserving critical intermediate steps while mitigating token overload for more reliable web agent evaluation.

STRATEGIZE GLOBALLY, ADAPT LOCALLY: A MULTI-TURN RED TEAMING AGENT WITH DUAL-LEVEL LEARNING

GALA (Global and Local Learning Agent): introduces a multi-turn red-teaming agent, with Planning Module, Belief Update Module, and Learning Module, for emulating human attackers via global and local learning.
GALA employs Initial Knowledge Base and Selection Framework for tactic selection, and Generate Prompt Suggestion and On-the-fly for dynamic prompt creation, leveraging Accumulated Knowledge.
GALA's dual learning of global tactics and local prompts enhances attack success and diversity in multi-turn red-teaming scenarios.

1st April 2025

AGENTNET: DECENTRALIZED EVOLUTIONARY COORDINATION FOR LLM-BASED MULTI-AGENT SYSTEMS

AgentNet: introduces decentralized framework for LLM-based multi-agent system, with Agent-components, that includes Executor (executes tasks), Router (makes routing decisions), Router Memory (stores routing experiences), Trajectory Memory (stores execution experiences), DAG Task Routing (directed acyclic graph for routing), and RAG Pools (retrieval augmented generation knowledge).
AgentNet framework facilitates autonomous agent specialization and dynamic network topology evolution by leveraging retrieval-based memory and directed acyclic graph for task routing, enhancing scalability and fault tolerance.
AgentNet's decentralized design eliminates central orchestrator, enabling privacy-preserving collaboration and efficient resource allocation in dynamic multi-agent environments, improving adaptability and performance.

Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks

Continual Instruction Fine-tuning (CIF) framework evaluates catastrophic forgetting in Large Language Models by sequentially fine-tuning a base model on Natural Language Understanding tasks using prompt engineering and continual evaluation.
The CIF framework assesses model retention of prior knowledge after learning new tasks by comparing performance across different fine-tuning episodes and various Large Language Models.
This research highlights the impact of prompt engineering and model size on continual learning capabilities in Large Language Models, offering insights into mitigating catastrophic forgetting.

First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution

AutoLight (LLM-Powered Multi-Agent System): introduces a hierarchical multi-agent system with Planner Agent, Task Agent, and ReAct Agent, utilizing Chain of Identity for agent interaction, to manage autonomous optical networks across components like Domain Controller, Physical Layer Controller, Network Layer Controller, DCI Metro, Backbone Domain, Digital Twin, Failure Handler, Knowledge Retriever, Resource Allocator, Training Orchestrator, Network-layer Planner, and Physical-layer Planner.
AutoLight employs Planner Agents for high-level coordination and Task Agents for specialized operations, while ReAct Agents are LLM-powered, and Chain of Identity ensures effective agent communication.
The framework components facilitate autonomous network management by handling tasks such as resource allocation, failure management, and network planning across different network layers and domains.

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

MLLM Policy: introduces a vision-language-action model for embodied agents, with SigLIP, Perceiver IO, MultiModal Large Language Model, Vision Adapter, and Action Tokens, to resolve ambiguity through clarification questions and reinforcement learning.
The framework uses SigLIP for visual encoding and Perceiver IO to handle long observation histories by downsampling visual tokens before feeding into a fine-tuned MultiModal Large Language Model.
Action Tokens represent the output space, enabling the agent to perform predefined skills or ask natural language questions to clarify ambiguous instructions in embodied tasks.

Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute

ModelSwitch: introduces a multi-LLM repeated sampling framework with LLM 1, Query 1, Majority Voting, Answer, LLM 2, Consistent?, and Concat components.
ModelSwitch leverages consistency as a signal to dynamically switch between LLM 1 and LLM 2, aiming to improve performance and efficiency in test-time compute.
The framework optimizes sample efficiency by reducing samplings when LLM 1 generates consistent answers, and enhances accuracy by incorporating LLM 2 when consistency is low.

Accelerated Inorganic Materials Design with Generative AI Agents

MatAgent: introduces an AI-driven framework for inorganic material design, employing LLM as central engine with planning and proposition stages, integrated with structure estimation and property evaluation modules, and external tools like short-term memory, long-term memory, periodic table, and materials knowledge base to iteratively refine material compositions towards target properties.
MatAgent framework leverages LLM's reasoning capabilities for interpretable material design, mimicking human expert reasoning through strategic tool use and feedback-driven refinement, enabling exploration of broader compositional spaces and achieving high compositional validity and novelty.
The framework's iterative approach, combining generative and predictive models with external knowledge, demonstrates effectiveness in accelerating the discovery of next-generation inorganic materials by guiding exploration towards desired properties and allowing for natural language integration.

Personality-Driven Decision-Making in LLM-Based Autonomous Agents

Personality-Driven Decision-Making Framework: introduces a method for LLM-based agent decision-making, incorporating Personality Context, Task Instruction, Current Time, Remaining To-Do List, Completed List, LLM Response, Update Time, and Next Decision-Cycle components.
This framework evaluates how induced personality traits influence task selection and prioritization in LLM agents through iterative decision cycles.
The framework uses prompt-based persona induction and analyzes movement deltas in task order to measure the impact of personality on agent behavior.

GRAPHMASTER: AUTOMATED GRAPH SYNTHESIS VIA LLM AGENTS IN DATA-LIMITED ENVIRONMENTS

GraphMaster (GraphMaster): introduces a multi-agent framework for graph data synthesis in data-limited environments, with Manager-, Perception-, Enhancement- and Evaluation-agents.
GraphMaster orchestrates specialized agents to iteratively refine graph synthesis, ensuring semantic coherence and structural integrity by modular reasoning and feedback cycles.
The framework decomposes the synthesis task into specialized sub-tasks handled by collaborative LLM-powered agents, addressing challenges like context window limitations and hallucination.

Exploring the Impact of an LLM-Powered Teachable Agent on Learning Gains and Cognitive Load in Music Education

Chat Melody (LLM-powered Teachable Agent): introduces problem statement area, interactive music sheet, and LLMs-based dialogue window to assist music learners in music analysis tasks.
Chat Melody facilitates structured dialogues for music theory learning, providing interactive feedback and guiding students through music analysis.
The teachable agent aims to reduce cognitive load and enhance learning gains in music education by employing learning-by-teaching principles.

Automated detection of atomicity violations in large-scale systems

CLOVER: introduces code extractor, expert agent, and judge agent for atomicity violation detection.
CLOVER combines static analysis for code extraction with LLM agents for violation detection and validation.
CLOVER's hybrid approach enhances accuracy and efficiency in detecting atomicity violations in interrupt-driven programs.

HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents

HERA (Hybrid Edge-cloud Resource Allocation): introduces a lightweight scheduler for AI agents that partitions subtasks between local SLM and cloud LLM based on subtask features and position.
HERA framework includes User Request Classifier, Subtask Similarity Evaluator, S-L Similarity Evaluator, Convergence Detection, and Subtask Decomposition to optimize resource allocation.
By strategically using SLM for suitable subtasks and LLM for complex ones, HERA aims to reduce operational costs while maintaining accuracy in AI agent applications.

When Persuasion Overrides Truth in Multi-Agent LLM Debates: Introducing a Confidence-Weighted Persuasion Override Rate (CW-POR)

CW-POR (Confidence-Weighted Persuasion Override Rate): introduces single-turn multi-agent debate framework with Agent A (Correct) (provides factual answer), Agent B (Persuasive) (defends falsehood), Judge Model (evaluates responses), Combine Confidences (combines confidence scores), Final Decision (judge's answer choice), and CW-POR (persuasion override metric).
The framework investigates how persuasive arguments can override truthful answers in LLMs, even with high confidence from the judge.
CW-POR metric quantifies not only the frequency of persuasion override but also the judge's confidence in the incorrect choice, highlighting the severity of being misled.

31st March 2025

Do Large Language Models Exhibit Spontaneous Rational Deception?

Signaling Games Framework: introduces a study examining spontaneous deception in Large Language Models (LLMs) within signaling games, incorporating components like LLM Agent, Opponent Agent, Signaling Game, Message, Action Choice, Reward Structure, Prompt Instructions, and Deception Guardrails.
This framework evaluates LLMs' context-sensitive deception by manipulating reward structures and turn orders in 2x2 games, measuring rational adaptation to game incentives and communication opportunities.
The research demonstrates that LLMs exhibit unsolicited, context-aware deception influenced by potential benefits and ethical prompts, suggesting a link between reasoning capabilities and strategic deceptive behavior.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Sci-Reproducer: introduces a dual-agent framework for algorithmic reproduction, with Literature Context, Target Paper, Relevant Literature, Paper Agent, Agent Strategy, Search Paper-Extract Section, Search Literature, Literature Report, Code Context, Code Repository, Python Environment, Website, Code Agent, Agent Strategy, Search Web, Search Code, and Code Interpreter components.
Sci-Reproducer framework uses Paper Agent to understand algorithmic logic from papers and Code Agent to retrieve dependencies and implement solutions, enabling comprehensive paper reproduction.
The framework aims to address the challenge of generating code from scientific papers by decomposing the task into literature understanding and code implementation with specialized agents and actions.

Large Language Models in Numberland: A Quick Test of Their Numerical Reasoning Abilities

Framework names: introduces Numberland, a 100-problem test, to evaluate numerical reasoning abilities of LLM agents including OpenAI ChatGPT o1-mini, OpenAI ChatGPT o1, Google Gemini, Anthropic Claude, and Microsoft Copilot.
The paper assesses basic operations, advanced calculations, prime number checks, and the 24 game to test elementary skills and integration in complex problem-solving.
The study highlights limitations in LLMs' numerical reasoning, especially in trial-and-error search tasks, despite their proficiency in deterministic tasks.

Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks

PIEL (Permutation-Invariant Evasion Loss): introduces a novel adversarial attack framework, with MFMC Problem, Permutation-Invariant Loss, Topological Optimization, Optimized Path, Chunk, Memory Bank, and Sampling Space components, that breaks pragmatic multi-agent LLM systems.
PIEL framework optimizes prompt distribution across network topologies, considering bandwidth and detection risk constraints to bypass safety mechanisms.
The framework leverages graph-based optimization and permutation-invariant loss to maximize attack success rate while minimizing detection risk in multi-agent LLM systems.

ADVANCES AND CHALLENGES IN FOUNDATION AGENTS FROM BRAIN-INSPIRED INTELLIGENCE TO EVOLUTIONARY, COLLABORATIVE, AND SAFE SYSTEMS

Brain-Inspired AI Agent Framework: introduces modular architecture for intelligent agents, integrating brain-inspired components like Sensor, Cognition, Actor, Memory, World Model, Reward, Emotion, Goal, and Tool.
This framework maps cognitive, perceptual, operational modules to brain functionalities, emphasizing core components such as memory, world modeling, reward processing, and emotion systems.
The survey synthesizes modular AI architectures with insights from cognitive science and neuroscience to identify research gaps and opportunities for brain-inspired intelligent agents.

PAARS: Persona Aligned Agentic Retail Shoppers

PAARS (Persona Aligned Agentic Retail Shoppers): introduces a framework for simulating human shoppers using persona-driven LLM agents, incorporating human population, session histories, persona profile, agent population, retail tools, alignment suite, and potential applications.
PAARS framework synthesizes personas from historical shopping data to create agent population equipped with retail tools for simulating shopping sessions and evaluating alignment with human behavior.
The framework's alignment suite measures distributional differences between human and agent shopping behavior at group level, enabling applications like agentic A/B testing and surveying.

Output Constraints as Attack Surface: Exploiting Structured Generation to Bypass LLM Safety Mechanisms

CDA (Constrained Decoding Attack) framework introduces LLM Inference (processes input to logits), Grammar Rule (defines output structure), Lexer & Parser (processes grammar rules), Per-token Mask (filters tokens by grammar), Decoder Block (core LLM decoding process), Logit Processor (processes logits before decoding), Output Generation (generates output tokens), Content Auditing (checks output for safety), and External Safety Guardrails (external checks for safety) to weaponize structured output constraints for bypassing safety mechanisms.
CDA framework operates by embedding malicious intent within schema-level grammar rules (control-plane) while maintaining benign surface prompts (data-plane), contrasting with prior attacks focused on input prompts.
The framework highlights a critical security blind spot in current LLM architectures, urging a paradigm shift in LLM safety to address control-plane vulnerabilities beyond data-plane threats.

TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

TeleAntiFraud-30k Framework: introduces TeleAntiFraud-28k dataset creation and evaluation benchmark, with Real-Data ASR Processing (process real call recordings), LLM-Based Imitation and Augmentation (expand scenario coverage), Audio Synthesis (convert text to voice), Multi-Agent Adversarial Framework (simulate fraud tactics), TeleAntiFraud-Bench (evaluation benchmark), Think-LALM (slow-thinking fraud detection model), Model Training (training detection models), Reasoning Process Quality (assess reasoning quality), Scenario Classification (categorize call scenarios), Fraud Determination (determine fraudulent behavior), and Fraud Type Identification (identify fraud categories).
TeleAntiFraud-30k framework utilizes Real Audio Data (input audio recordings) processed into ASR Data (transcribed text from audio) and synthesized into Audio (synthesized voice data) and Text (annotated text data), employing User (caller agent), Cheater (fraudster agent), and Manager (conversation monitor agent) within Multi-Agent Adversarial Framework, while evaluation uses JSON (input data format) and potentially generates LR Audio (low resolution audio).
TeleAntiFraud-30k framework aims to address telecom fraud detection challenges by providing a multimodal dataset and benchmark for training and evaluating slow-thinking Large Audio Language Models (Think-LALM) in tasks like Scenario Classification, Fraud Determination, and Fraud Type Identification, ultimately enhancing reasoning and detection capabilities.

Grounding Agent Reasoning in Image Schemas: A Neurosymbolic Approach to Embodied Cognition

Neurosymbolic Framework for Grounding Agent Reasoning in Image Schemas: presents framework comprising language input, LLM parser, image schema formalizer, knowledge base, and neurosymbolic reasoner.
Framework utilizes LLM parser to translate language input into formal image schemas, stored in knowledge base, for neurosymbolic agent reasoning.
This framework grounds agent reasoning in embodied cognition by leveraging image schemas for enhanced interpretability and human-agent interaction.

Towards Scientific Intelligence: A Survey of LLM-based Scientific Agents

LLM-based scientific agents: introduces Planner, Memory, and Tool Set as core components for iterative, context-aware processing of complex scientific tasks.
LLM-based scientific agents architecture includes Planner for task decomposition, Memory for context and knowledge retention, and Tool Set for extending scientific capabilities with external tools.
The framework emphasizes the integration of Planner, Memory, and Tool Set to enable scientific agents to perform complex research tasks, ensuring reproducibility and driving scientific discovery.

Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics

CRE (Complete Rubric Evaluation): introduces a code evaluation framework, with Student Code, Javac, Error Dictionary, Question, Rubric, Prompt, LLM, System Message, Pointwise Marks, Logical Marks, Syntax Marks, Total Marks, Pointwise Feedback, and CRE GRADER, that uses LLM to assess code logic and a compiler for syntax, combining scores for final grade.
Complete Rubric Evaluation framework employs a detailed rubric and prompt to guide a Large Language Model in evaluating student code submissions, focusing on logical correctness while separately handling syntax errors via a compiler.
The CRE framework aims to simulate human-like grading by prioritizing conceptual understanding over minor syntax errors, providing a comprehensive evaluation through combined logical and syntactical assessments.

SchemaAgent: A Multi-Agents Framework for Generating Relational Database Schema

SchemaAgent (Schema Agent) introduces a LLM-based multi-agent framework for automated database schema generation, incorporating Product manager, Conceptual model designer, Conceptual model reviewer, Logical model designer, QA engineer, and Test executor components.
SchemaAgent framework employs Error detection and correction mechanism to refine schema quality through iterative feedback loop among agents.
This multi-agent system aims to enhance accuracy and efficiency in relational database schema design process by emulating manual workflow.

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

TTC (Test-Time Compute) scaling framework: introduces internal and external strategies to enhance software engineering agents by scaling computation time, incorporating development-contextualized trajectory synthesis, rejection sampling, reasoning training, process and outcome reward models, and execution verification.
Internal TTC leverages trajectory synthesis and rejection sampling to improve reasoning depth, while external TTC employs a process-based search strategy with reward models and execution verification for targeted computation allocation.
The framework aims to achieve comparable performance to larger models using smaller, personally deployable LLMs by strategically increasing inference-time computation and focusing on critical decision points in software engineering tasks.

DebFlow: Automating Agent Creation via Agent Debate

DebFlow: introduces a framework for automated agent creation, with Search Space, Self-reflection, Workflow, and Agent Debate components.
DebFlow employs agent debate to optimize workflows and integrates self-reflection for iterative performance improvement based on past experiences.
The framework utilizes LLM-invoking nodes as basic units, optimizing agent workflows through debate and reflection mechanisms for enhanced efficiency and performance.

Detecting Functional Bugs in Smart Contracts through LLM-Powered and Bug-Oriented Composite Analysis

PROMFUZZ (PROMFUZZ): introduces automated system to detect functional smart contract bugs, with LLM-driven Multi-Perspective Analysis, Invariant Checker Generation, Bug-oriented Analysis Engine and Functional Bug Detection components.
PROMFUZZ employs dual-agent approach with Auditor Agent and Attacker Agent, and generates invariant checkers using Critical Variable Extraction, Principal Statement Extraction and Template-based Checker Generation.
PROMFUZZ utilizes Bug-oriented Analysis Engine with Strategically Invariant Checker Insertion and Bug-oriented Fuzzing for effective functional bug detection and provides Bug Report component.

30th March 2025

GIScience in the Era of Artificial Intelligence: A Research Agenda Towards Autonomous GIS

Autonomous GIS: introduces a conceptual framework for next-generation geographic information systems, integrating decision-making, data preparation, data operation, memory-handling, and core-updating functions, supported by geo-data retrieval, spatial analysis, cartography, and modeling agents, across routine-aware, workflow-aware, data-aware, result-aware, and knowledge-aware levels, aiming for self-generating, self-executing, self-verifying, self-organizing, and self-growing goals, scalable across local, centralized, and infrastructure scales.
Autonomous GIS framework envisions a paradigm shift in GIScience by leveraging generative AI to automate geospatial problem-solving with minimal human intervention, enhancing accessibility and democratizing spatial analysis for broader applications.
The framework outlines key research challenges and future directions for autonomous GIS, emphasizing the need for benchmarks, enhanced AI understanding of geospatial concepts, and addressing ethical and societal impacts of AI-driven geospatial technologies.

Exploring GPT-4 for Robotic Agent Strategy with Real-Time State Feedback and a Reactive Behaviour Framework

LLM-Director Framework: introduces a robotic control approach integrating LLM for high-level task planning within Director reactive behaviour framework, utilizing NUClear for real-time message passing and sensor modules for environmental feedback.
This framework uses LLM to generate tasks based on user requests and world information, which are then executed by the Director Tree, ensuring safety and smooth transitions through skills and joint commands to actuators, guided by real-time feedback from sensor modules.
The integration of LLM with Director framework allows for dynamic reactive task layer construction, addressing safety, task transitions, and real-time feedback for improved robotic agent performance in complex environments.

If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs

LIFESTATE-BENCH (Lifelong State Benchmark): introduces cumulative experience, fact checking, memory testing, and judge model components for evaluating lifelong learning in LLMs.
LIFESTATE-BENCH assesses state evolution in LLMs through episodic interactions and fact-based questions focusing on self-awareness, memory, and relationship shifts.
The benchmark employs non-parametric and parametric memory testing methods and LLM-as-judge for comprehensive evaluation of lifelong learning capabilities.

RE-ALIGNING LANGUAGE TO VISUAL OBJECTS WITH AN AGENTIC WORKFLOW

Real-LOD (Re-Aligning Language to Visual Objects with an Agentic Workflow): introduces agentic workflow for refining language descriptions using planning, tool use, and reflection components.
Real-LOD leverages LLM for reasoning and reflection, and VLM for tool use to iteratively improve language alignment with visual objects.
The framework enhances data quality for language-based object detection by reducing hallucinations in automatically generated descriptions.

VideoGen-Eval: Agent-based System for Video Generation Evaluation

VideoGen-Eval: introduces agent-based system for video generation evaluation, with Structured Prompts, Advanced Models, Generated Videos, Human annotations, Prompt Structurer, Content Judger, Tools Pool, Temporal-sparse Content, Temporal-dense Content, MLLMs, and Human Alignment.
VideoGen-Eval benchmark includes structured prompts and large-scale video results for dynamic and flexible evaluation of video generation models.
The system employs LLM for content structuring, MLLM for content judgment, and patch Tools Pool for temporal dimension assessment, enhancing alignment with human preferences.

CoRanking: Collaborative Ranking with Small and Large Ranking Agents

CoRanking (Collaborative Ranking): introduces a collaborative ranking framework, with Small Listwise Reranker (SLR), Passage Order Adjuster (POA), and LLM Listwise Reranker (LLR), that combines small and large ranking models for efficient and effective passage ranking.
CoRanking framework utilizes S³ strategy for preference pair selection and Human Label enhanced ranking construction to improve training, addressing positional biases of LLMs and enhancing ranking performance.
The framework achieves significant efficiency gains by using SLR for pre-ranking and POA for order adjustment before applying LLR for final reranking, outperforming pure LLM listwise reranking in both speed and effectiveness.

An Analysis of Decoding Methods for LLM-based Agents for Faithful Multi-Hop Question Answering

ReAct (Reasoning and Acting): introduces systematic analysis of ReAct framework with Thought, Action, Observation components combined with Decoding Strategy and Retrieval to improve faithfulness in question answering.
The framework iteratively uses Thoughts to decide Actions, leading to Observations, employing Decoding Strategy and Retrieval for enhanced answer faithfulness.
Combining ReAct with faithful decoding methods significantly improves accuracy in multi-hop question answering tasks by enhancing contextual faithfulness.

A Multi-Agent Framework with Automated Decision Rule Optimization for Cross-Domain Misinformation Detection

MARO (Multi-Agent Framework for cross-domain misinformation detection with Automated Decision Rule Optimization): introduces a two-module framework with Multi-Dimensional Analysis Module, incorporating Linguistic Feature Analysis Agent, Comment Analysis Agent, Fact-Checking Agent Group with Fact-Questioning Agent and Fact-Checking Agent, Questioning Agent, and Multi-Dimensional Analysis Report, alongside Decision Rule Optimization Module, which includes Cross-Domain Validation Task, Judge Agent, Decision Rule Optimization Agent, Decision Rule Optimization Prompt, Demonstrations from Other Domains, Wikipedia, Google, LRS, and Top K decision rules, for cross-domain misinformation detection.
MARO's Multi-Dimensional Analysis Module employs multiple agents to analyze news from different perspectives, generating a comprehensive analysis report, while the Decision Rule Optimization Module automatically refines decision rules using feedback from cross-domain validation tasks.
The framework utilizes a question-reflection mechanism with a Questioning Agent to guide expert agents in Multi-Dimensional Analysis Module for enhanced analysis quality, and iteratively optimizes decision rules in Decision Rule Optimization Module to improve generalization across domains.

AI Agents in Engineering Design: A Multi-Agent Framework for Aesthetic and Aerodynamic Car Design

AI Design Agents Framework: introduces multi-agent system with Styling-, CAD-, Meshing- and Simulation-Agents, leveraging Foundation Models, Geometric Deep Learning Models and Engineering Tools, orchestrated by AutoGen, for accelerating car design process.
Framework automates conceptual sketching, styling, 3D shape retrieval, generative modeling, CFD meshing and aerodynamic simulations.
AI Design Agents Framework enhances design exploration, efficiency and collaboration between designers and engineers in automotive design.

SPIO: Ensemble and Selective Strategies via LLM-Based Multi-Agent Planning in Automated Data Science

SPIO (Sequential Plan Integration and Optimization): introduces multi-agent framework with fundamental code generation agents, sequential planning agent, plan optimization agent, and code write agent, leveraging memory for automated data science.
SPIO employs sequential planning and LLM-driven optimization across preprocessing, feature engineering, modeling, and hyperparameter tuning modules.
SPIO enhances predictive accuracy and robustness by exploring multiple candidate strategies and ensembling top-performing plans.

GRASP: Municipal Budget AI Chatbots for Enhancing Civic Engagement

GRASP (Generation with Retrieval and Action System for Prompts): introduces a municipal budget chatbot framework combining RAG Framework for document retrieval and ReAct Agent for action execution, utilizing Prompt Engineering and Domain Knowledge to enhance response accuracy.
GRASP framework incorporates LLM with System Instructions, Agent Scratchpad, and Intermediate Steps, processing user Prompt to interact with Budget Tool, Drawing tool, and Analysis tool through iterative Thoughts, Action, and Observation cycles for Final Response.
This approach aims to improve truthfulness and grounding of chatbot responses by leveraging external Budget Docs within Vector Database and employing Metadata filtering and Similarity Search for relevant information retrieval.

EncGPT: A Multi-Agent Workflow for Dynamic Encryption Algorithms

EncGPT (Encryption GPT): introduces multi-agent workflow for dynamic encryption, with rule-, encryption-, decryption-, source- and recipient-agents, and memory.
It dynamically generates encryption rules, applies them for encryption and decryption, and supports homomorphic operations on encrypted data.
This framework enhances communication security in LLM-MA systems by addressing dynamic algorithm generation and single encryption algorithm vulnerabilities.

Efficient Inference for Large Reasoning Models: A Survey

Efficient Inference for Large Reasoning Models: introduces survey on efficient inference methods for Large Reasoning Models, categorizing approaches into explicit compact CoT and implicit latent CoT, alongside taxonomy, empirical analyses, challenges, and improvements.
The survey explores token efficiency in Large Reasoning Models, addressing token consumption, memory overhead and inference time, while considering solutions like model merge, new architectures and agent routers.
This research emphasizes trade-offs between efficiency and interpretability, safety, and application scope within efficient reasoning methods for Large Reasoning Models.

29th March 2025

Agentic Large Language Models, a survey

Agentic LLM Taxonomy: introduces reasoning, acting, interacting, self-reflection, retrieval, multi-step, world models, VLA, robot, tools, assistants, social capabilities, open-ended societies, new data for categorizing agentic LLM research.
Agentic LLM Taxonomy categorizes agentic LLMs into reasoning for better decisions, acting for real-world tasks, and interacting for social behaviors.
Agentic LLM Taxonomy highlights the virtuous cycle where reasoning, acting, and interacting generate new data to improve LLMs continuously.

Factored Agents: Decoupling In-Context Learning and Memorization for Robust Tool Use

Factored Agent Architecture: introduces a two-component agent system with planner-LLM and memorizer-SLM, addressing limitations of single-agent designs for tool use by decoupling in-context learning and memorization.
The architecture separates planning using a larger LLM from tool-specific formatting handled by a smaller SLM, aiming to improve robustness against API errors and enhance planning accuracy in dynamic environments.
This decoupling strategy intends to mitigate trade-offs between in-context learning and static memorization, potentially leading to more adaptable and error-resilient agentic AI systems for tool utilization.

28th March 2025

WorkTeam: Constructing Workflows from Natural Language with Multi-Agents

WorkTeam (multi-agent NL2Workflow framework): introduces supervisor, orchestrator, and filler agents for collaborative natural language to workflow conversion.
WorkTeam framework enhances workflow construction accuracy through task specialization and collaboration among supervisor, orchestrator, and filler agents.
The framework utilizes supervisor agent for task planning and result reflection, orchestrator agent for component selection and orchestration, and filler agent for parameter population.

Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions

MedAgentSim (MedAgentSim): introduces a multi-agent framework with patient-, doctor-, and measurement-agents within conversation- and experience replay-phases, utilizing medical- and experience-records buffers, KNN few-shot retrieval, chain-of-thought reasoning, majority vote ensembling, and reflection-phase for enhanced diagnostic accuracy.
MedAgentSim framework simulates realistic clinical interactions by enabling doctor agents to actively gather patient information through multi-turn conversations and iteratively refine diagnostic strategies using self-improvement mechanisms.
The framework incorporates experience replay and memory buffers to facilitate progressive learning and improve the performance of LLM-powered agents in dynamic diagnostic settings, bridging the gap between static evaluations and real-world medical reasoning.

Unlocking LLM Repair Capabilities in Low-Resource Programming Languages Through Cross-Language Translation and Multi-Agent Refinement

LANTERN (LANguage Translation and multi-agEnt Refinement): introduces a novel program repair framework, with Analyzer, Translator, Repairer, Test Suites, Middleware, Historical Data Storage & Retrieval, Prompt Construction, Process Control and Translation Coordination, that leverages cross-language translation and multi-agent refinement to enhance LLM-based repair capabilities in low-resource programming languages.
LANTERN framework strategically translates buggy code to languages where LLMs exhibit stronger repair performance, utilizing a multi-agent iterative repair paradigm and incorporating historical feedback for informed decision-making.
The framework's key innovation lies in its LLM-based Analyzer that dynamically selects optimal target languages for translation based on bug characteristics and previous repair attempts, effectively bridging the performance gap across programming languages.

Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey

Multi-turn Conversational Agent: introduces agent architecture for multi-turn dialogues, with User Input-Agent Output, Task Planner, Tool Invoker, Agent Core, Conversation Memory, and Turn Memory components.
This framework manages conversation flow by decomposing user requests, invoking tools, maintaining memory, and generating responses.
The architecture enables coherent and context-aware interactions over extended dialogues by leveraging memory and planning.

UNDERSTANDING INEQUALITY OF LLM FACT-CHECKING OVER GEOGRAPHIC REGIONS WITH AGENT AND RETRIEVAL MODELS

ReAct-like agent: introduces agent-based fact-checking, with LLM, Wikipedia access, local cache, system prompt and user message, to evaluate factuality of statements.
This framework employs function calling LLM to query Wikipedia for external knowledge, caching results for subsequent use.
The agent-based method explores enhancing fact-checking via external information access, yet encounters performance limitations compared to RAG using verified documents.

COSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching

COSIL (Software Issue Localization): introduces a two-stage framework for issue localization, with file-level search space reduction and function-level iterative search, utilizing module and function call graphs, guided by a searcher and pruner, to identify suspicious code locations.
COSIL employs a module call graph enhanced reflection and iterative function call graph searching to refine search space and context, dynamically constructing graphs and using context pruning for effective issue localization.
The framework leverages a searcher agent with tools and a pruner to manage context and direction during iterative search, aiming for concise yet effective context for accurate issue localization without pre-built indexes.

Agent-Centric Personalized Multiple Clustering with Multi-Modal LLMs

Agent-Centric Framework (Agent-Centric Personalized Multiple Clustering Framework): introduces agent-centric personalized clustering using MLLM Agents to traverse Relational Graph built from MLLM-based Embedding Extractor and identify Searched Clusters based on User Interests.
The framework constructs Relational Graph via Embedding Similarity filtering of Image Embeddings and employs Agent-Centric Graph Traversal with Membership Assessment and Cluster Update mechanisms.
This approach leverages MLLM Agents for efficient graph exploration, starting from Seed Nodes within Connected Components and iteratively expanding clusters by evaluating Candidate Nodes and Neighbor Nodes.

PharmAgents: Building a Virtual Pharma with Large Language Model Agents

PharmAgents: introduces a virtual pharmaceutical ecosystem, driven by LLM-based multi-agent collaboration, that simulates drug discovery workflow with components including agents for disease expertise, target analysis, molecule design, and preclinical evaluation, alongside databases and computational tools.
PharmAgents decomposes drug discovery into target discovery, lead identification, lead optimization, and preclinical evaluation stages, employing specialized LLM-driven agents for each stage, enhanced with machine learning models and domain-specific tools, to achieve autonomous and explainable drug design.
The framework emphasizes interpretability and efficiency by integrating LLMs for reasoning and decision-making at each stage of drug discovery, ensuring transparency and enabling researchers to understand and validate the AI-driven process, ultimately accelerating drug development.

27th March 2025

MemInsight: Autonomous Memory Augmentation for LLM Agents

MemInsight (Autonomous Memory Augmentation): introduces autonomous memory augmentation framework with Attribute Mining, Annotation, Retrieval Pool, Memory Retriever and Memory Augmentation to enhance LLM agents' contextual performance.
MemInsight leverages attribute mining and annotation for structured memory representation, enabling efficient retrieval through attribute-based and embedding-based methods.
The framework improves memory retrieval by filtering irrelevant information and retaining key insights, demonstrated across conversational tasks.

Debate-Driven Multi-Agent LLMs for Phishing Email Detection

Multi-Agent Debate Framework: introduces multi-agent debate framework with pro-phishing Agent 1, anti-phishing Agent 2, debate adjudicating Judge Agent, and scripted Debate Procedure for phishing email detection.
This framework uses two debating LLM agents and judge LLM to improve phishing email classification via structured argument exchange.
Debate-driven approach enhances contextual analysis and reasoning for improved phishing detection accuracy compared to single-agent methods.

LEARNING TO LIE: REINFORCEMENT LEARNING ATTACKS DAMAGE HUMAN-AI TEAMS AND TEAMS OF LLMS

MBRL (Model-Based Reinforcement Learning): introduces adversarial agent, with Action, Team, State, Planner, Internal Model, Adversarial agent components, to study malicious AI in human-AI teams.
MBRL framework uses internal model of team dynamics and planner to decide AI agent's action to maximize damage to team performance.
This approach investigates vulnerabilities in human-AI collaboration and informs development of defense strategies against AI-driven attacks.

GateLens: A Reasoning-Enhanced LLM Agent for Automotive Software Release Analytics

GateLens: introduces a system for automotive software release analytics, utilizing Query Interpreter Agent, Relational Algebra Generation, RA to Pandas Code Conversion, Coder Agent, Code Execution, Analysis Results Output to User, Database, and Knowledge Base components.
GateLens employs Query Interpreter Agent to translate user queries into Relational Algebra, which is then converted to executable code by Coder Agent for analysis using Database and guided by Knowledge Base.
The framework enhances analytical reasoning by incorporating Relational Algebra, enabling precise handling of domain-specific queries and improving the interpretability of the analysis process.

ReFeed: Multi-dimensional Summarization Refinement with Reflective Reasoning on Feedback

ReFeed (Refinement with Reflective Reasoning on Feedback): introduces a summarization refinement pipeline enhancing multiple dimensions using reflective reasoning on feedback.
ReFeed pipeline incorporates detection, multi-dimensional feedback mapping, reflective reasoning, supervised fine-tuned LLM, SumFeed-CoT dataset, goal specification, LRM teacher, refinement guideline, and quality control components.
ReFeed framework aims to address trade-offs, ordering bias, and noisy feedback in multi-dimensional summarization refinement, improving robustness and performance.

COLLAB: CONTROLLED DECODING USING MIXTURE OF AGENTS FOR LLM ALIGNMENT

COLLAB (CONTROLLED DECODING using MIXTURE OF AGENTS): introduces mixture of agents-based decoding strategy with policy-switching and token-level selection.
Leverages implicit Q-function for optimal agent selection from pool of models during inference.
Enables collaborative alignment among LLMs without retraining by dynamic agent selection.

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Efficient Reasoning for LRMs (Large Reasoning Models): introduces pre-training, SFT, RL, LRM, and inference stages for efficient reasoning methods.
The survey categorizes efficient reasoning methods based on these stages in the LLM lifecycle.
Efficient reasoning in LRMs is crucial for deployment, scalability, and practical application, addressing the challenge of excessive reasoning traces.

26th March 2025

GAIA-2: A Controllable Multi-View Generative World Model for Autonomous Driving

GAIA-2 (Generative AI for Autonomy): introduces a generative world model for autonomous driving, with Video Tokenizer, Latent World Model, Space-Time Factorized Transformer, and Conditioning components.
GAIA-2 framework includes Encoder and Decoder within Video Tokenizer, various Conditioning inputs like Actions, 3D Bounding Boxes, Metadata, Embeddings, Camera Parameters, Positional Encodings, and Memory and Noise components for generation.
GAIA-2 framework utilizes Training Tasks and Inference modes to enable controllable video generation for autonomous driving simulation, addressing multi-camera consistency and fine-grained control.

Feature4X: Bridging Any Monocular Video to 4D Agentic AI with Versatile Gaussian Feature Fields

Feature4X: introduces a framework to create interactive 4D scenes from monocular video by distilling features from 2D Foundation Models (Extract initial features) into a unified 4D Feature Fields (Unified feature representation) using dynamic Gaussian Splatting, involving Input Monocular Video (Source data), 2D Priors (Initial features/constraints) like Dynamic Mask (Foreground/background separation) and Metric Depth (Geometric prior), Static Feature GS (Background Gaussian Splatting), Dynamic Feature GS (Foreground Gaussian Splatting) guided by a Motion Scaffold (Guides dynamic deformation), a Parallel N-Dimensional Gaussian Rasterizer (Renders RGB/features) producing a Unified Latent Feature Map (Compact shared features), task-specific Decoders (Map unified to task features), optimized with Photometric Loss (RGB reconstruction objective) and Feature Loss (Feature reconstruction objective), enabling interaction via an LLM (Language interaction/control) and User (Provides interaction/prompts) within a 4D Agentic AI (Overall interactive system).
The core representation is a dynamic 4D Gaussian feature field, separating Static Background (Scene component representation) and Dynamic Foreground (Scene component representation), where features are compactly represented using scaffold-based interpolation and rendered efficiently via a parallel rasterizer.
This approach integrates functionalities from diverse 2D models (e.g., SAM2, CLIP-LSeg, InternVideo2) into a single 4D representation, supporting tasks like segmentation, editing, and VQA across novel views and time steps via LLM-powered interaction.

Beyond Believability: Accurate Human Behavior Simulation with Fine-Tuned LLMs

Web Action Generation Model: introduces a framework for simulating human web actions by predicting the next action and reasoning based on current webpage context and history of user interactions.
The framework utilizes a fine-tuned LLM to generate both a natural language rationale and a browser action, focusing on process-centric accuracy in web behavior simulation.
Key components include Context representing webpage content, Rationale explaining action intent, and Action defining browser operations like click, type and submit, or terminate.

TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews

TAMA (Thematic Analysis): introduces a human-AI collaborative framework with Cardiac Expert, Interview Transcripts, Chunks, Generation Agent, Codes, Evaluation Agent, Themes, Refinement Agent, and Feedback for thematic analysis of clinical interviews using multi-agent LLMs.
TAMA framework leverages multi-agent LLMs to automate thematic analysis by generating, evaluating, and refining themes through structured conversations and expert feedback, enhancing scalability and coherence.
The framework aims to improve thematic analysis quality in healthcare settings by integrating human expertise with multi-agent LLM systems, reducing manual workload and enhancing the consistency of results.

A Theoretical Framework for Prompt Engineering: Approximating Smooth Functions with Transformer Prompts

Theoretical Framework for Prompt Engineering: introduces a framework with prompt, transformer, virtual network, input, layer, and output, describing how prompts configure transformers to emulate virtual neural networks.
The framework posits that prompts dynamically adjust transformer's internal computations to approximate smooth functions.
This approach provides theoretical grounding for prompt engineering techniques and AI agent design by framing LLMs as adaptable agents.

Knowledge-Based Multi-Agent Framework for Automated Software Architecture Design

MAAD (Multi-Agent Architecture Design) introduces multi-agent framework for automated software architecture design, involving Analyst, Modeler, Designer, and Evaluator agents collaborating based on input Software Requirements Specifications to produce architecture artifacts.
MAAD framework utilizes agents to simulate human roles in architecture design, leveraging knowledge from existing system designs, authoritative literature, and architecture experts to enhance automation.
MAAD framework aims to automate and enhance the efficiency, scalability, and consistency of software architecture design process by generating diagrams and reports, ultimately advancing full automation of application-level software development.

Exploring the Effect of Robotic Embodiment and Empathetic Tone of LLMs on Empathy Elicitation

Interaction Design System: investigates empathy elicitation using user voice input, speech recording, laptop processing, OpenAI LLM (ChatGPT-40) for response generation, and Pepper robot/chatbot agents.
This system compares robotic embodiment and empathetic tone by employing physical robot and chatbot agents, both driven by LLMs, to elicit empathy towards a fictional character.
The system utilizes speech-to-text, LLM, and text-to-speech modules for interaction, measuring participant volunteering hours and perceived agent empathy through questionnaires.

sudo rm -rf agentic_security

SUDO (SCREEN-BASED UNIVERSAL DETOX2TOX OFFENSE): introduces a novel attack framework, with Detoxifier, Instruction Generator, Toxifier, Dynamic Updater and Evaluation Criteria components, that systematically bypasses refusal-trained safeguards in computer-use agents.
SUDO framework employs DETOX2TOX mechanism to transform harmful requests into benign ones and then re-introduce malicious content before execution, iteratively refining attacks based on refusal feedback.
The framework demonstrates vulnerabilities in computer-use agents and emphasizes the need for robust, context-aware safeguards by successfully executing attacks in real-world computing environments.

Open Deep Search: Democratizing Search with Open-source Reasoning Agents

ODS (Open Deep Search): introduces open-source search framework with Base LLM, Open Reasoning Agent, Open Search Tool and Tools for democratizing search.
ODS framework uses Open Reasoning Agent to interpret query and orchestrate actions using Tools including Open Search Tool for web search and processing.
ODS framework aims to close gap between proprietary and open-source search solutions by augmenting reasoning capabilities of open-source LLMs.

A Reference Architecture for Autonomous Networks: An Agent-Based Approach

AN Agent Reference Architecture (Autonomous Networks Agent Reference Architecture): introduces Situation Awareness (perceives network state), Decision Making (determines actions), Self Awareness (recognizes risks), Choice Making (selects suitable goal), World Knowledge (knowledge repository), Human-Agent Interaction (human collaboration), Agent-Agent Interaction (agent collaboration), Reactive Behavior (responds to stimuli), and Proactive Behavior (addresses potential risks) for autonomous network agents.
AN Agent Reference Architecture facilitates autonomous network operation by integrating reactive and proactive behaviors with human and agent interactions, leveraging shared domain-specific knowledge for consistent decision execution.
The architecture emphasizes modularity and functional specification, aiming for implementation-independence and completeness to guide development of trustworthy autonomous network agents replacing human operation and maintenance.

25th March 2025

FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs

FALCONEye: introduces a meta-architecture for video answer search, integrating Pre-processing, VLM (vision-language model), Captions, Summary, Reason, LLM (large language model), Candidate Clips, Evaluation, Answer, Confidence Score, Decision, and Promising Clips to efficiently locate answers in long videos.
It employs an iterative exploration algorithm, using Captions and Confidence Scores to refine search and focus resources on relevant video segments.
The framework is designed for Video Answer Search (VAS) tasks in long videos, addressing limitations of VLMs in handling long context and pinpointing specific information.

Inducing Personality in LLM-Based Honeypot Agents: Measuring the Effect on Human-Like Agenda Generation

SANDMAN: introduces deceptive agent architecture for cyber deception, integrating Agent Profile, Decision Engine, Memory Space (Semantic, Episodic, Working, Retrieval, Learning), LLM Engine, Planning Space (Bootstrap Task, Task List), and Action Space (Channel, Generators).
SANDMAN architecture enables creation of plausible human simulacra by inducing personality traits within LLMs to govern agent behavior within digital environments.
The framework enhances cyber deception strategies by facilitating agents to produce varied realistic behaviors through persona-driven methodology.

Writing as a testbed for open ended agents

Framework for Benchmarking Autonomous Writing Agents: introduces a framework for benchmarking autonomous writing agents with exploration, evaluation, and goal alignment components.
This framework evaluates Large Language Models as collaborative co-writers by analyzing action diversity, human alignment, and iterative text improvement capabilities.
The framework highlights challenges and potential solutions for building systems capable in diverse open-ended domains through iterative refinement.

Agent-Initiated Interaction in Phone UI Automation

Approach: introduces a method for agent-initiated interaction in phone UI automation, with User Instruction, Screen Input, Session History, Interaction Detection, Message Generation, and Baseline Models components.
This approach focuses on detecting the necessity for user interaction during task execution and generating appropriate messages for clarification or confirmation.
The evaluation utilizes baseline models to assess the effectiveness of different input modalities and model architectures for interaction detection and message generation in UI automation tasks.

MARS: Memory-Enhanced Agents with Reflective Self-improvement

MARS (Memory-Enhanced Agents with Reflective Self-improvement): introduces User, Assistant, and Checker agents within an Environment, incorporating STM and LTM memory components, alongside Reflection and Feedback mechanisms for iterative self-improvement.
MARS framework enhances agent performance by utilizing iterative feedback from the Checker to refine the Assistant's policy, leveraging Reflection to store historical data in LTM and STM for improved decision-making.
The framework aims to address limitations of LLMs in continuous decision-making and long-term memory by integrating these components for effective task completion in dynamic environments.

DIRECT POST-TRAINING PREFERENCE ALIGNMENT FOR MULTI-AGENT MOTION GENERATION MODELS USING IMPLICIT FEEDBACK FROM PRE-TRAINING DEMONSTRATIONS

DPA-OMF (Direct Preference Alignment from Occupancy Measure Matching Feedback): introduces alignment approach with Multi-modal Scene Encoder, Motion Token Prediction Model, Preference Ranking via Occupancy Measure Matching Feedback, and Expert demo, aligning pre-trained motion model with human preferences.
DPA-OMF leverages implicit preferences from pre-training expert demonstrations to construct preference rankings among model generations using occupancy measure matching for nuanced alignment guidance.
DPA-OMF improves realism of traffic simulation behaviors, enabling lightweight models to achieve comparable performance to state-of-the-art imitation models without extra human annotations.

BugCraft: End-to-End Crash Bug Reproduction Using LLM Agents in Minecraft

BugCraft: introduces automated crash bug reproduction framework utilizing LLMs, encompassing Bug Report, S2R Synthesizer, and Action Model Agent components.
BugCraft framework employs two-stage approach: Step Synthesizer generates structured steps from bug reports, and Action Model Agent executes these steps within Minecraft environment.
Framework evaluation utilizes BugCraft-Bench, a curated dataset of Minecraft crash bugs, to assess end-to-end reproduction and step synthesis effectiveness.

OmniNova:A General Multimodal Agent Framework

OmniNova: introduces a modular framework, integrating multi-agent system, workflow engine, language model and tool integration, configuration and prompt template systems, for complex automation tasks.
OmniNova employs hierarchical multi-agent architecture with coordinator, planner, supervisor, research, code, browser and reporter agents, managed by workflow engine, utilizing multi-layered LLM and unified tool integration.
The framework optimizes resource utilization and task completion through dynamic task routing, specialized agents, and multi-layered LLM allocation, enhancing efficiency and result quality.

24th March 2025

A Survey of Large Language Model Agents for Question Answering

LLM Agent (Large Language Model Agent): introduces LLM-based Agent QA system, with Action Planning, Memory, Thinking, Action for External Environment, Observation, and Environment components, where the paper surveys the design of LLM agents for question answering tasks.
LLM Agent architecture incorporates memory to aggregate information, planning to decide actions, and thinking for reasoning and answer generation, enabling interaction with external environments for enhanced QA.
The framework addresses limitations of standalone LLMs by integrating modules for planning and external interaction, improving performance in complex QA tasks requiring external knowledge and reasoning.

LLM-Based Insight Extraction for Contact Center Analytics and Cost-Efficient Deployment

Topic Modeling Pipeline: introduces a multi-step process for contact center analytics, utilizing Call Transcripts to perform Call Driver Generation for extracting Call Drivers, which are then processed by Topic Clustering and Topic Labeling to produce Topics.
This pipeline leverages a fine-tuned Mistral model for call driver generation and all-MiniLM-L6-v2 with HDBSCAN for topic clustering, aiming for cost-efficient and accurate topic identification from customer interactions.
The generated topics and call drivers facilitate downstream tasks like trend detection and FAQ generation, ultimately improving contact center efficiency and customer service.

Verbal Process Supervision Elicits Better Coding Agents

CURA (Code Understanding and Reasoning Agent): introduces process-supervised reasoning framework for code generation with code understanding, test case generation, solution reasoning, code testing sandbox, and process reward models.
CURA utilizes verbal process supervision to iteratively guide reasoning steps and refine model behavior through reward signals at each stage.
The framework enhances code generation performance by integrating iterative feedback and verbal process supervision throughout the reasoning pipeline.

Safeguarding Mobile GUI Agent via Logic-based Action Verification

VSA (VeriSafe Agent): introduces a verification framework for Mobile GUI Agents, incorporating Intent Encoder, Logical Formula, Intent Verifier, Feedback Generator, and VSA Library, designed to ensure agent actions are consistent with user instructions.
VeriSafe Agent framework utilizes autoformalization to convert natural language instructions into a domain-specific language, enabling rule-based runtime verification of mobile agent actions.
VSA framework aims to bridge probabilistic LFM-driven automation with deterministic formal verification by providing pre-action verification and structured feedback to guide GUI agents towards correct task completion.

DeepFund: Will LLM be Professional at Fund Investment? A Live Arena Perspective

DeepFund: introduces a comprehensive arena platform, with Stock Pool, Web API, Trading Memory, Current Position, Agent Planner, Technical Analysts, Fundamental Analysts, Insider Analysts, Media Analysts, Agent Manager, Decision, Decision Log, Trading Simulation Environment, Model Integration Interface and Performance Monitoring, for evaluating LLM-based trading strategies in simulated live environment.
DeepFund platform employs multi-agent framework where Agent Planner orchestrates analysis from specialized Technical, Fundamental, Insider, and Media Analysts, and Agent Manager synthesizes insights for final investment Decision.
Trading Simulation Environment in DeepFund mitigates data leakage by providing real-time market data through Web API and evaluating models on data post-training cutoff, while Performance Monitoring visualizes model performance.

How to Capture and Study Conversations Between Research Participants and ChatGPT: GPT for Researchers (g4r.org)

G4R (GPT for Researchers): introduces a website platform with researcher interface, GPT interface creation, GPT interface customization, GPT interaction, data capture, data download, and data merging for studying participant-GPT conversations.
G4R enables researchers to create customizable GPT interfaces, integrate them into studies like Qualtrics surveys, capture conversation data, and download/merge data for analysis.
This tool addresses the lack of standardized methods for human-AI interaction research by providing an accessible platform to facilitate and analyze participant conversations with GPT models.

P3Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

P3Nav (A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction): introduces a unified framework for embodied navigation integrating perception, planning, and prediction with Visual Encoder, Adaptive 3D-aware History Sampling, Large Language Model, Action Head, Answer Head, Tokenizer, and Multitask Collaboration strategy.
P3Nav framework employs Multitask Collaboration strategy for joint training on navigation and embodied question answering tasks, enhancing navigation performance by leveraging perceptual and planning skills.
Adaptive 3D-aware History Sampling strategy in P3Nav effectively utilizes historical observations by selecting non-overlapping RGB frames and position-enhanced features to reduce redundancy and improve efficiency.

AgentDropout: Dynamic Agent Elimination for Token-Efficient and High-Performance LLM-Based Multi-Agent Collaboration

AgentDropout: introduces dynamic agent elimination, optimizing communication by removing redundant agents and links in multi-agent systems using Node Dropout, Edge Dropout, Communication Graph, Adjacency Matrix, Intra-Round Communication, Inter-Round Communication, and DAGSample.
AgentDropout employs Node Dropout to remove less contributing agents and Edge Dropout to prune redundant communication edges within Communication Graph represented by Adjacency Matrix.
The framework enhances token efficiency and task performance by dynamically adjusting communication topology through Intra-Round and Inter-Round Communication, finalized by DAGSample for acyclic graph generation.

EconEvals: Benchmarks and Litmus Tests for LLM Agents in Unknown Environments

EconEvals: introduces benchmarks and litmus tests for evaluating LLM agents in unknown economic environments, featuring an LLM Agent (Core decision-maker using LLM) that interacts via Tool Use (Interaction via API calls) and a Notes Module (Persistent text memory) within various Economic Environments (Simulated scenarios with unknowns) to perform Benchmark Tasks (Capability measurement tasks) or Litmus Tests (Tendency measurement tasks), assessed by a Benchmark Score (Capability metric) or Litmus Score (Tendency metric).
The framework assesses LLM agents on economic decision-making (procurement, scheduling, pricing) through multi-turn interactions where agents must learn environment specifications via exploration using tools; benchmarks measure capability, while litmus tests quantify behavioral tendencies in tradeoffs like efficiency vs. equality.
Agents operate over multiple periods within stationary or non-stationary environments, using tools to gather information (e.g., CODE_BLOCK_0, CODE_BLOCK_1), manage memory (CODE_BLOCK_2, CODE_BLOCK_3), and submit actions (CODE_BLOCK_4, CODE_BLOCK_5, CODE_BLOCK_6), receiving feedback to inform future decisions.

Defeating Prompt Injections by Design

CaMeL (CApabilities for MachinE Learning): introduces a defense against prompt injection by separating control and data flow using a Privileged LLM (Generates code from user query), a Quarantined LLM (Parses untrusted data), a CaMeL Interpreter (Executes code, enforces policies), Tools (External functions/APIs), Security Policies (Define allowed tool operations), Capabilities (Data provenance/permission tags), and a Data Flow Graph (Tracks value dependencies).
The Privileged LLM generates Python code representing the user's intent from trusted queries, while the separate Quarantined LLM processes potentially untrusted data under the interpreter's strict control, preventing direct influence on tool execution flow.
The CaMeL interpreter executes the generated code, maintains a data flow graph with capabilities tracking data provenance and permissions, and enforces security policies before tool execution to prevent data exfiltration or unauthorized actions.

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

AGENTSPEC: introduces a domain-specific language and runtime framework for enforcing customizable safety constraints on LLM Agents (LLM planner/executor), intercepting planned actions based on Rules (constraint definitions) activated by a Trigger (rule activation event) corresponding to a monitored Event (monitored agent/env change), evaluating conditions via Check (condition evaluation) using Predicates (boolean condition function), and applying Enforce (intervention mechanism) actions like user_inspection (request user confirmation), llm_self_examine (trigger agent self-reflection), invoke_action (execute predefined action), or stop (terminate agent action) before interaction with Tools (external functions) or receiving Observation (feedback from environment/tools), ensuring alignment with safety policies defined by the User (initiates interaction) and recorded in the Trajectory (record of agent states/actions).
The framework integrates with agent execution loops by hooking into decision points, monitoring Events such as state changes, specific actions (e.g., 'Transfer', 'PythonREPL', 'pour'), or task completion to apply user-defined Rules at runtime.
This approach provides a modular and interpretable method for runtime safety enforcement in LLM agents operating across domains including code execution, embodied interaction, and autonomous driving, with demonstrated low overhead.

23rd March 2025

AgentRxiv: Towards Collaborative Autonomous Research

AgentRxiv: introduces a framework for collaborative autonomous research using LLM agents, comprising an AgentRxiv Server (Centralized preprint server for agent research) enabling multiple Agent Laboratory (Autonomous multi-agent research system) instances to share findings, guided by a Human Researcher (Provides initial guidance), where each lab performs Literature Review Phase (Retrieves and summarizes prior work), Experimentation Phase (Plans and executes experiments) with mle-solver (Module for ML code generation and repair), and Report Writing Phase (Synthesizes findings into reports) via paper-solver (Module for LaTeX report generation), coordinated by agents like PhD Student Agent (Agent role in multiple phases) and ML Engineer Agent (Agent role in data preparation code).
The framework facilitates iterative improvement by allowing agent laboratories to upload reports to the AgentRxiv Server and retrieve prior work from peers, enabling cumulative knowledge building across independent agent systems.
Each Agent Laboratory automates research stages using specialized agents (e.g., Postdoc, Professor) and tools (mle-solver, paper-solver), supporting both fully autonomous operation and a co-pilot mode with human checkpoints.

Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation

RAM (Rewriting-driven AugMentation): introduces a VLN data augmentation paradigm using Object-Enriched Observation Rewriting (generates diverse observations) involving a VLM (extracts scene descriptions), LLM (rewrites scene descriptions), T2IM (synthesizes panoramic observations), and Panorama-to-View (discretizes panoramas), plus Observation-Contrast Instruction Rewriting (creates aligned instructions) involving a VLM (extracts landmarks/descriptions) and LLM (rewrites instructions via contrast), trained with a Mixing-then-Focusing Training Mechanism (optimizes learning) including a Random Observation Cropping Scheme (augments data), where foundation models rewrite annotated data into unseen observation-instruction pairs without simulators or web-scraping.
The framework first performs Object-Enriched Observation Rewriting by using a VLM to get scene descriptions, an LLM to enrich these descriptions with new objects, a T2IM to generate corresponding panoramas, and a Panorama-to-View algorithm for single views.
Subsequently, Observation-Contrast Instruction Rewriting employs an LLM to generate new instructions by contrasting original landmarks/observations (via VLM) with rewritten observation descriptions (via VLM), enhancing data diversity for training the Embodied Agent using a two-stage strategy.

Metaphor-based Jailbreaking Attacks on Text-to-Image Models

MJA (Metaphor-based Jailbreaking Attack): introduces a framework with Metaphor Agent, Context Agent, Prompt Agent, Example Retrieval Tool, Shared Memory, Observed Set, Candidate Set, Surrogate Model, Text Encoder, PCA, Gaussian Process Regression, Acquisition Strategy, and Query T2I model, where MJA aims to jailbreak text-to-image models using metaphor-based prompts.
MJA framework employs multi-agent generation module to create diverse prompts and optimization module to efficiently select effective adversarial prompts.
The framework balances attack effectiveness and query efficiency by leveraging metaphor and context in prompt generation and surrogate model-based optimization.

WON: Establishing Best Practices for Korean Financial NLP

WON: introduces WON (Korean financial LLM), a transparent language model, evaluated using Benchmark (evaluation dataset) on Leaderboard (evaluation platform), utilizing Instruction Dataset (refined training data) derived from competition submissions.
WON framework employs SFT (supervised fine-tuning) and DPO (direct preference optimization) training methods, with LLM-as-a-Judge (evaluation using LLM) for assessment and Deepseek-R1 (response generation model) for data processing.
The framework aims to establish best practices for Korean financial NLP by releasing resources and insights gained from a large-scale evaluation and model development process.

An Empirical Study of the Role of Incompleteness and Ambiguity in Interactions with Large Language Models

Framework name here: introduces a neural symbolic framework to model human and LLM agent interactions, focusing on Message-String, Turn, and Interaction, to define Incomplete Question and Ambiguous Question based on Oracle Agent responses within a Context of prior messages.
This framework analyzes question-answer sequences to empirically study the role of question Incomplete Question and Ambiguous Question properties in multi-turn interactions using Human Agent and LLM Agent.
The framework utilizes the Oracle Agent as a ground truth to categorize questions and assess the impact of Context on resolving Incomplete Question and Ambiguous Question during interactions.

GeoBenchX: Benchmarking LLMs for Multistep Geospatial Tasks

GeoBenchX Framework: introduces benchmark for evaluating LLMs on geospatial tasks, with Task-solving agent, LLMs, Tools, Datasets, LLM-as-Judge evaluator agent, Reference solutions, and Benchmark set.
GeoBenchX uses Task-solving agent equipped with Geospatial functions and commercial LLMs to solve Benchmark set of multi-step geospatial tasks using provided Datasets.
LLM-as-Judge evaluator agent assesses Task-solving agent's performance by comparing generated solutions against Reference solutions within the GeoBenchX framework.

22nd March 2025

Metacognition in Content-Centric Computational Cognitive C4 Modeling

C4 Modeling (Content-Centric Computational Cognitive Modeling): introduces a framework for building metacognitive AI agents, with Knowledge Resources, Perception, Reasoning, Action, Explanation Module, Lifelong Learning, and LLM components for Language Generation and Learning Enhancement.
C4 modeling emphasizes content-centric approach using semantically interpretable knowledge to enable agents with transparency, adaptability, reasoning, perception and action capabilities for human-AI teams.
The framework integrates LLMs to improve language generation and learning efficiency, while maintaining focus on knowledge-based reasoning for trustworthy and explainable AI agents.

Building Resource-Constrained Language Agents: A Korean Case Study on Chemical Toxicity Information

Tox-chat: introduces a Korean chemical toxicity information agent, utilizing LLM / SLM Agent, BM25 Search, Summary LLM, Keyword Search, Read General, QA Specific, QA LLM, Carcinogen Database, Toxic Dose Database, and Toxic Info Database for resource-constrained environments.
Tox-chat employs hierarchical section search and scenario-based dialogue generation to reduce token consumption and distill tool-using capabilities from larger models.
The framework demonstrates effective performance with a fine-tuned 8B parameter model, outperforming untuned models and baselines in database faithfulness and preference.

A Survey on Mathematical Reasoning and Optimization with Large Language Models

Framework name here: introduces Instruction Learning, Tool-based Methods, Chain-of-Thought (CoT) Methods, and Advanced Chain-of-Thought (CoT) Methods for mathematical reasoning with Large Language Models.
Instruction Learning refines models through structured tasks, while Tool-based Methods integrate external solvers, and Chain-of-Thought (CoT) and Advanced CoT Methods enhance reasoning via step-by-step logic and self-verification.
These methods collectively aim to improve mathematical problem-solving capabilities of Large Language Models, addressing challenges in arithmetic, theorem proving and optimization tasks.

CP-AgentNet: Autonomous and Explainable Communication Protocol Design Using Generative Agents

CP-AgentNet (Communication Protocol Agent Network): introduces a framework employing offline- and online-modules with strategy-, observer-, node- and programming-agents, LLM ranker, strategy-, episodic- and trajectory-memory, self-reflection and evaluation for autonomous communication protocol design.
CP-AgentNet framework facilitates explainable protocol design by leveraging multi-agent role-play and progressive strategy augmentation to address limitations of deep reinforcement learning and handcrafted protocols.
CP-AgentNet utilizes self-reflection and LLM ranker to enhance strategy refinement and decision consistency, enabling efficient adaptation to dynamic network environments without extensive online learning.

RAIDER: Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection, Explanation and Recovery

RAIDER (Tool-Equipped Large Language Model Agent for Robotic Action Issue Detection, Explanation and Recovery): introduces a novel agent architecture integrating System Prompt, LLM, Program Flow Manager, Tools, and Recovery for robotic action issue detection, explanation, and recovery.
RAIDER framework utilizes "Ground, Ask&Answer, Issue" procedure, incorporating Ground, Ask, Answer, and Issue components within Program Flow Manager to dynamically generate and resolve context-aware precondition questions using Tool calls and Tool responses/warnings.
This architecture achieves adaptable and efficient issue detection by leveraging LLM's reasoning with grounded Tools, enabling targeted information gathering and surpassing limitations of predefined models or full scene descriptions.

Can LLMs Automate Fact-Checking Article Writing?

QRAFT (QRAFT): introduces a multi-agent framework for automatic fact-checking article generation, incorporating Planner (outline planning assistant), Writer (draft article composer), and Editor (draft review and refine) agents.
QRAFT framework processes Evidence Set (input evidence documents) to generate Evidence Nuggets Set (extracted evidence points), utilizes Preferences (article structure guidelines) for Draft Outline (planned article structure), producing First Draft (initial article draft) and Improved Draft (refined article draft) through Question-Answering Interactions (conversational refinement process).
QRAFT framework aims to mimic human fact-checkers' writing workflow, addressing the gap in existing automatic fact-checking pipelines by generating full fact-checking articles suitable for public dissemination.

ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

ComfyGPT (Comprehensive ComfyUI Workflow Generation with Generative Pre-trained Transformer): introduces a self-optimizing multi-agent system for ComfyUI workflow generation, comprising ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent.
ComfyGPT leverages FlowDataset for training and FlowBench for evaluation, utilizing GRPO optimization and RAG to enhance workflow generation and refinement.
ComfyGPT focuses on individual node links for improved precision and introduces FlowBench as a comprehensive benchmark for workflow generation assessment.

OmniScience: A Domain-Specialized LLM for Scientific Reasoning and Discovery

OmniScience Framework: introduces a domain-specialized LLM for scientific reasoning, utilizing science literature corpus for domain adaptive pretraining, task and chat instructions for model alignment, and s1K reasoning dataset for reasoning distillation to create OmniScience Reasoning model from Foundation Model via OmniScience Base and OmniScience Chat.
The framework employs a three-stage training pipeline: domain adaptive pretraining to instill scientific knowledge, supervised fine-tuning for instruction following, and reasoning-based knowledge distillation to enhance inferential capabilities.
OmniScience Framework demonstrates a compute-efficient strategy for developing high-performing domain-specific models by combining pretraining, alignment and distillation techniques, achieving state-of-the-art results in scientific reasoning tasks.

Autonomous Radiotherapy Treatment Planning Using DOLA: A Privacy-Preserving, LLM-Based Optimization Agent

DOLA (Dose Optimization Language Agent): introduces privacy-preserving LLM agent for autonomous radiotherapy planning, comprising Model Service, Optimization Agent, Working Memory, TPS Interface, and LLaMa3.1 LLM.
DOLA framework integrates RAG and RL with chain-of-thought prompting within local infrastructure to optimize radiotherapy plans while maintaining patient privacy.
The system architecture enables iterative dose optimization using LLM for decision-making and reasoning within a secure, locally hosted environment, enhancing plan quality and efficiency.

21st March 2025

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

CVE-Bench: introduces CVE-Bench with LLM Agents, Target Containers, Evaluation, and Results, which is a cybersecurity benchmark for evaluating AI agents exploiting web vulnerabilities.
CVE-Bench: framework offers a sandbox environment featuring isolated containers hosting web applications and an automated evaluation system to assess attack success.
CVE-Bench: benchmark addresses limitations of existing cybersecurity benchmarks by providing comprehensive real-world vulnerability coverage and diverse attack types.

LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language

LLM+MAP (LLM + Multi-Agent Planning with PDDL): introduces a bimanual robot task planning framework with Visual Detection, Scene Spatial Description, Bimanual Domain Knowledge, LLM, PDDL Problem + Domain, Symbolic Planning, Partial-order Plan, Action Parser and Execution.
LLM+MAP framework utilizes LLM to convert natural language task descriptions and scene information into PDDL, enabling symbolic planners to generate partial-order plans for efficient bimanual robot control.
The framework integrates LLM reasoning with multi-agent planning for effective spatial and temporal coordination in complex, long-horizon bimanual manipulation tasks, achieving logical correctness and higher efficiency.

When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making

Text-Only Training for VLM Enhancement: introduces text-only training approach, with Situation, Question, Answer, Text-Only Input, Multimodal Input, VLM, Answer Prediction, Text-Only Training for VLM Enhancement, and Transfer to Multimodal Inference components, where text-only training enhances visual language model decision-making for human-centered tasks.
This framework improves visual language models by text-only training using synthesized textual data, enabling enhanced multimodal inference capabilities without relying on image-text paired data.
Text-only training provides efficient and scalable method to enhance visual language models' reasoning and decision-making for complex human-centered scenarios.

ETVA: Evaluation of Text-to-Video Alignment via Fine-grained Question Generation and Answering

ETVA (Evaluation of Text-to-Video Alignment): introduces a framework with Element Extractor, Graph Builder, Graph Traverser, Question Generation, Knowledge Augmentation, Multi-Stage Reasoning, Question Answering, External Knowledge, Multimodal CoT, Video Reflection, General Reflection, Conclusion Stage, ETVA Score, Generated Video, Generated Questions, Scene Graph, and Core Elements for evaluating text-to-video alignment through fine-grained question generation and answering.
ETVA framework employs a multi-agent system for atomic question generation from text prompts and a knowledge-augmented multi-stage reasoning process for question answering using video LLMs.
ETVA demonstrates improved correlation with human judgment compared to existing metrics by systematically evaluating video-text relationships through structured question generation and knowledge integration.

WHEN DEBATE FAILS: BIAS REINFORCEMENT IN LARGE LANGUAGE MODELS

DReaMAD (Diverse Reasoning via Multi-Agent Debate with Refined Prompt): introduces Strategic Prior Knowledge Elicitation, Perspective Diversification, and Multi-Agent Debate to improve LLM reasoning.
DReaMAD refines prior knowledge and ensures diverse perspectives by using Game Situation Reinterpretation, General Strategy Formulation, and structured debate.
DReaMAD enhances LLMs' strategic reasoning by structuring knowledge retrieval and diversifying input perspectives to mitigate bias and improve decision-making.

A-IDE : AGENT-INTEGRATED DENOISING EXPERTS

A-IDE (Agent-Integrated Denoising Experts): introduces a denoising framework integrating BiomedCLIP for semantic analysis, semantic similarities for probability distribution, an LLM Agent for decision-making, specialized RED-CNN models (Model 0, Model 1, Model 2) for denoising, and RMSE, PSNR, SSIM for evaluation.
A-IDE framework utilizes BiomedCLIP to analyze CT images and employs an LLM agent to dynamically select among specialized RED-CNN models based on anatomical context for improved denoising performance.
The agent-driven approach of A-IDE eliminates manual intervention and enhances denoising performance across diverse anatomical regions by leveraging specialized models.

Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models

Bayesian Teaching: introduces User, LLM (Large Language Model), Bayesian Assistant, Supervised Fine-tuning, Flight Recommendation, Hotel Recommendation, Web Shopping, User Preferences, and Beliefs to teach LLMs probabilistic reasoning for user interaction tasks.
Bayesian Teaching framework employs Supervised Fine-tuning to train LLMs by mimicking Bayesian Assistant for inferring User Preferences and updating Beliefs in Flight Recommendation and generalizing to Hotel Recommendation and Web Shopping.
The framework enhances LLMs' probabilistic reasoning in interactive settings, enabling generalization to novel tasks beyond the training domain.

20th March 2025

Towards Agentic Recommender Systems in the Era of Multimodal Large Language Models

LLM-ARS (LLM-based Agentic RS): introduces a framework with LLM-Agent, Initialization, Planning, Execution, Reflection, Query, Ranker, Tool Using, and Memory Module components for agentic recommendation systems.
This framework utilizes an LLM-Agent as the central decision-making unit, incorporating modules for planning, execution, reflection, and memory to enhance recommendation adaptability and personalization.
The architecture emphasizes autonomous decision-making and continuous self-evolution by integrating external tools and reflecting on past interactions to optimize future recommendations.

Survey on Evaluation of LLM-based Agents

Agent Evaluation: introduces a survey framework for evaluating LLM-based agents, with Agent Capabilities Evaluation-component, Planning and Multi-Step Reasoning-component, Function Calling & Tool Use-component, Self-Reflection-component, Memory-component, Application-Specific Agent Evaluation-component, Web Agents-component, Software Engineering Agents-component, Scientific Agents-component, Conversational Agents-component, Generalist Agents Evaluation-component, Frameworks for Agent Evaluation-component, Development Frameworks-component, Gym-like Environments-component, Discussion-component, Current Trends-component, and Emergent Directions-component.
Agent Evaluation framework categorizes evaluation methodologies based on agent capabilities, application domains, general skills, and development frameworks, providing a structured overview of the field.
The framework highlights the shift towards realistic evaluations, identifies gaps in current methods like cost-efficiency and safety, and proposes future directions for agent evaluation research.

Issue2Test: Generating Reproducing Test Cases from Issue Reports

ISSUE2TEST (Issue Reproducing Test): introduces automated technique for generating issue-reproducing test cases utilizing root cause analysis, meta prompting, related files search, test generator, linter, test refiner - error fixing, run tests, error categorization, assertion match, and rank components.
ISSUE2TEST iteratively refines test cases through runtime feedback and error categorization to ensure generated tests accurately capture and reproduce the reported issue.
This approach enhances automated debugging and program repair workflows by providing executable test cases directly derived from issue descriptions, improving software reliability.

GREENIQ: A DEEP SEARCH PLATFORM FOR COMPREHENSIVE CARBON MARKET ANALYSIS AND AUTOMATED REPORT GENERATION

GreenIQ: introduces deep search platform with Main Researcher, Report Writing, Final Reviewer, Data Visualization, and Translator Agents for carbon market analysis and automated report generation.
GreenIQ leverages multi-agent architecture powered by Large Language Models to automate end-to-end workflow from data collection to multilingual reporting for carbon market intelligence.
GreenIQ enhances efficiency, accuracy, and scalability in carbon market research by integrating specialized agents for comprehensive analysis and validated reporting.

AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration

AutoRedTeamer: introduces a framework for automated red teaming, with Risk Analyzer (decomposes user inputs), Seed Prompt Generator (creates diverse test cases), Strategy Designer (selects attack combinations), Attack Memory (tracks attack performance), Attack Library (stores attack methods), Attack Judge (evaluates output harmfulness), Relevance Check (ensures test case relevance), Red-Teaming Agent (orchestrates evaluation), Target Model (LLM under evaluation), Validation Set (validates attack effectiveness), Attack Evaluation (assesses attack results), Initial Attack Library (starting attack methods), Attack Proposer (discovers new attacks), Attack Proposals (suggested new attacks), Attack Designer (implements attack proposals), Attack Implementation (concrete attack code), and Attack Strategy Proposer Agent (discovers and implements attacks).
AutoRedTeamer framework uses a dual-agent system comprising a red teaming agent for evaluation and a strategy proposer agent for discovering and integrating new attack methods.
The framework incorporates a memory-guided attack selection to learn from past attack attempts and refine strategies for improved red teaming effectiveness and adaptability to new vulnerabilities.

Automatic Generation of Safety-compliant Linear Temporal Logic via Large Language Model: A Self-supervised Framework

AutoSafeLTL (Automatic Safety-compliant Linear Temporal Logic): introduces a self-supervised framework for generating safety-compliant Linear Temporal Logic (LTL) specifications, incorporating Environmental Information & Desired Task, LTL Extraction, Safety Restrictions, Base Rules, Automated Verification, and producing Safety-compliant LTL.
The framework employs Automated Verification with Syntactic Check (Agent LLM1, AP Matching, Operator Matching) and Semantic Check (User LLM, Counterexample Analysis & Guidance, Agent LLM2) to ensure generated LTL adheres to predefined safety rules.
AutoSafeLTL framework leverages two Agent LLMs and User LLM within a pipeline to refine LTL generation through feedback and counterexample analysis, achieving safety compliance for cyber-physical systems.

DeepPsy-Agent: A Stage-Aware and Deep-Thinking Emotional Support Agent System

DeepPsy-Agent: introduces a psychological support system with deeppsy-chat dialogue model for response generation, stage awareness mechanism for context perception, deep thinking for multi-source reasoning, real-time stage transition detection model for signal capture, and state information update module for dynamic state tracking.
DeepPsy-Agent combines psychological theory and deep learning to achieve dynamic stage awareness and enhanced reasoning capabilities in emotional support conversations.
The system integrates stage-awareness and deep-thinking to improve dialogue management and reasoning, addressing limitations of traditional emotional support systems.

Advancing Mobile GUI Agents: A Verifier-Driven Approach to Practical Deployment

V-DROID (Verifier-Driven Robot for Interface Operations on Devices): introduces a verifier-driven mobile agent framework with Action Extractor, Verification Prompts, Verifier, Action Completion, and Working Memory components.
V-DROID decouples action decision-making into action extraction and verification, utilizing Discretized Action Space Construction and Prefilling-only Workflow for efficiency.
Pair-wise Progress Preference Training enhances Verifier's decision-making, and Scalable Human-Agent Joint Annotation Scheme facilitates data collection for V-DROID.

The Lighthouse of Language: Enhancing LLM Agents via Critique-Guided Improvement

CGI (Critique-Guided Improvement): introduces a two-player framework enhancing LLM agents, featuring an Actor Model(Generates actions, refines based on critique), a Critic Model(Generates structured critiques), an Action Buffer(Stores candidate actions), Critique(Structured feedback with assessment, revision), and a Refined Action(Final action post-critique), operating within an Environment(Interactive task setting) informed by History(Past interactions sequence) and refined using Training Data(Datasets for model fine-tuning), where the critic provides detailed natural language feedback to guide the actor's iterative improvement.
The framework comprises two stages: Critique Generation, where the Critic Model is trained to produce structured assessments and revisions based on expert examples, and Action Refinement, where the Actor Model is iteratively fine-tuned to utilize these critiques effectively alongside successful trajectory data.
This approach uses a dedicated critic for explicit, structured verbal feedback (assessing contribution, feasibility, efficiency, and suggesting revisions) and trains the actor to integrate this guidance, enhancing decision-making and exploration compared to methods relying solely on numerical rewards or self-correction.

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

RGB-D Fusion Architectures: introduces model architectures for autonomous driving, integrating RGB and depth data through Feature Extractor, Model Architecture, RNN Options and Offset Calculation Options components.
RGB-D Fusion Architectures: systematically compares early and late fusion strategies alongside depth-aware deformable convolution and geometric offset computation within Model Architecture for enhanced feature extraction.
RGB-D Fusion Architectures: evaluates recurrent neural networks like LSTM, LTC, CfC, and LRC within RNN Options to benchmark lightweight controllers for real-time, robust autonomous agent steering command prediction.

19th March 2025

Envisioning an AI-Enhanced Mental Health Ecosystem

AI-Enhanced Mental Health Ecosystem: introduces a multifaceted AI vision for mental health support, integrating AI-Simulated Client, Peer/Counsellor/Therapist, -Generated Suggestions, Decision-Support/Evaluation, Self-Help/Companionship, Proactive Monitoring, and Embedded/Ubiquitous Companion.
This ecosystem aims to enhance mental health care through proactive, responsive, adaptive AI paradigms complementing human interventions for growing global mental health crisis.
The framework emphasizes ethical AI deployment, user-centered design, and continuous evaluation ensuring supportive collaboration in mental health with human connection and cultural sensitivity.

ChatStitch: Visualizing Through Structures via Surround-View Unsupervised Deep Image Stitching with Collaborative LLM-Agents

ChatStitch: introduces a collaborative perception system, with Task Management Agent, Background Stitching Agent, Pose Measurement Agent, Perspective Measurement Agent, 3D Asset Management, 3D Asset View Change, Foreground Rendering, SV-UDIS, Language Input, Multi-views Input, Composed Images Output, Data, and Work flow, for visualizing obscured information via natural language.
ChatStitch employs multi-agent framework utilizing Large Language Models to process natural language commands and perform surround-view unsupervised deep image stitching.
The system achieves intuitive human perception by integrating language commands with external digital assets and generating photorealistic collaborative perception outcomes.

SPADE: Systematic Prompt Framework for Automated Dialogue Expansion in Machine-Generated Text Detection

SPADE (Systematic Prompt Framework for Automated Dialogue Expansion): introduces five data augmentation frameworks using structured prompting for synthetic dialogue generation, including Partial-Chatbot Data Augmentation (Generates partially synthetic dialogues) with Missing Sentence Completion (Fills system utterances) and Next Response Generation (Generates user utterances), and Full-Chatbot Data Augmentation (Generates fully synthetic dialogues) with Goal-to-Dialogue (Generates dialogue from goal), Paraphrase Dialogue (Rewrites utterances iteratively), and End-to-End Conversation (Simulates user-system interaction), utilizing components like LLM (Generates text), Goal (User task objective), Dialogue Input (Source conversation data), and Instructions (LLM task guidance) to address data scarcity for Machine-Generated Text detection.
These training-free frameworks generate 14 dialogue datasets by manipulating human dialogues or simulating conversations with LLMs based on user goals and specific prompts.
The study benchmarks these datasets against detection models, showing improved generalization with mixed datasets and analyzing detection accuracy based on chat history length in simulated online settings.

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

VIPER (Visual Perception and Explainable Reasoning): introduces multimodal instruction-based planning framework integrating VLM-based perception with LLM-based reasoning, including Perception- and Reasoning-Modules.
VIPER uses modular pipeline where Perception Module with frozen VLM generates textual descriptions of image observations processed by Reasoning Module with LLM policy to predict actions.
VIPER enhances explainability by leveraging text as intermediate representation enabling fine-grained analysis of Perception- and Reasoning-Modules for decision-making mechanisms.

LogiAgent: Automated Logical Testing for REST Systems with LLM-Based Multi-Agents

LOGIAGENT: introduces an LLM-driven multi-agent framework for logical testing of REST systems, with a Test Scenario Generator (Creates test scenarios), API Request Executor (Constructs and executes API requests), API Response Validator (Validates API responses using oracles), Scenario Scheduler (Manages scenario execution flow), Execution Memory (Stores historical execution data), API Relationship Graph (Models API relationships), OpenAPI Specification (Input API documentation), and Tested System (Target system under test), where agents collaboratively generate, execute, and validate API test scenarios focusing on business logic.
The framework utilizes logical oracles derived from API documentation, scenario context, and LLM knowledge to assess responses beyond status codes, identifying logical inconsistencies.
Execution Memory stores successful parameter values and failure reflections, enhancing contextual consistency and guiding future test generation by the API Request Executor and Test Scenario Generator.

Aligning Crowd-sourced Human Feedback for Reinforcement Learning on Code Generation by Large Language Models

cRLHF (crowd-sourced Reinforcement Learning with Human Feedback): introduces a novel framework for aligning crowd-sourced human feedback using Prompt (input description), Language Model (generates code), Output (code output), Multiple Annotators (feedback from many users), Annotated Output (code with human labels), cRLHF (proposed framework), and Aligned Output (improved code output).
cRLHF (crowd-sourced Reinforcement Learning with Human Feedback) framework, depicted in figures, contrasts traditional RLHF by incorporating Multiple Annotators (feedback from many users) to produce Annotated Output (code with human labels) and Aligned Output (improved code output), replacing single Annotator (evaluates code) and Ranked Output (ordered code by rank) feeding into Reward Model (learns reward function) of traditional RLHF.
The framework, utilizing Problem Description (task specification), Initial LLM (starting LLM), Tuned LLM (fine-tuned LLM), Generated Outputs (sampled code solutions), RL Update (policy adjustment), and Correction Rate (performance score), aims to improve code generation quality by leveraging diverse human feedback and Bayesian optimization without explicit reward model training.

Exploring Large Language Models for Word Games: Who is the Spy?

CoT-based scheduling framework (Chain-of-Thought based scheduling framework): introduces Judger, Player, and COT or TOT components to enable LLMs in word games through rule description, role and keyword assignments, compliance checks, keyword descriptions, reasoning, voting and player elimination.
The framework utilizes Judger to manage game flow and Player agents employing COT or TOT for strategic actions like describing keywords, reasoning about roles, and voting to identify the spy.
This approach aims to enhance LLM performance in social deduction games by structuring the interaction and decision-making process through distinct components and a chain-of-thought reasoning mechanism.

MAMM-REFINE: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration

MAMM-REFINE (Multi-Agent Multi-Model Refinement): introduces DETECT, CRITIQUE, and REFINE subtasks within a multi-agent debate framework to improve generation faithfulness.
MAMM-REFINE framework employs RERANK and GENERATE approaches for CRITIQUE and REFINE subtasks, enhancing performance through multi-agent and multi-model collaboration.
The framework demonstrates effectiveness in summarization and question answering by leveraging diverse LLMs and iterative refinement to reduce factual inconsistencies.

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

SWEET-RL (RL with Step-WisE Evaluation from Training-time information): introduces a reinforcement learning framework that uses Training-Time History Information and Bradley-Terry objective to train a Critic, which provides rewards for Policy from RLHF optimization based on Chosen/Rejected trajectories and actions.
SWEET-RL leverages asymmetric information access for Critic and Actor, where Critic uses additional training-time information inaccessible to Actor, to improve credit assignment in multi-turn collaborative tasks.
This approach addresses limitations of standard RLHF methods in multi-turn settings by providing step-level rewards and improving generalization in complex reasoning tasks.

Safety Aware Task Planning via Large Language Models in Robotics

SAFER (Safety-Aware Framework for Execution in Robotics): introduces a multi-LLM framework composed of Planning-, Execution-, Feedback-Modules and LLM-as-a-Judge to enhance safety in LLM-driven task planning.
SAFER framework utilizes Task Planning LLM to generate plans, Safety Planning LLM to audit plans, Execution Module with Robot Agents to deploy tasks, Feedback Module to process outcomes and LLM-as-a-Judge to evaluate safety.
The framework ensures safety checks throughout planning and execution, integrating Control Barrier Functions for safety guarantees at the robotic control policy level.

Reinforcement Learning Environment with LLM-Controlled Adversary in D&D 5th Edition Combat

RL-LLM Adversary Framework (Reinforcement Learning with LLM-Controlled Adversary Framework): introduces reinforcement learning environment using D&D 5E combat scenarios, integrating LLM Agent-controlled adversary, and DQN-based RL Agent for strategic AI development within a simulated Environment.
This framework employs LLM Agent for strategic decision-making and RL Agent for adaptive learning, utilizing DQN Network composed of Input map, Conv2d layers, Flatten, Embedding layers, Concatenate, Linear layers, and Output - Q-value for value estimation.
The framework facilitates strategic decision-making research in complex rule-based games, demonstrating that LLM-trained RL Agents outperform rule-based and LLM-controlled adversaries, highlighting the potential of LLMs to enhance strategic depth and adaptability in AI systems.

18th of March 2025

MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration

MANTRA (Multi-AgeNT Code RefAactoring): introduces a comprehensive LLM agent-based framework for automated method-level refactoring, with Context-Aware Retrieval-Augmented Generation (RAG) constructing searchable Database of Pure Refactoring Code Examples using Code Description and Caller-Callee Relationships Incorporation, coordinated Multi-Agent Refactored Code Generation employing Developer Agent with Static Code Analysis Tool for RAG-based Refactoring Examples Retrieval and Chain-of-Thought Refactoring Code Generation, and Reviewer Agent for Refactoring Verification, Code Style Consistency Analysis and Compilation and Test Verification, alongside Self-Repair Using Verbal Reinforcement Learning with Repair Agent and Reflexion Framework through Initial Analysis, Self-Reflection, Planning and Acting phases.
MANTRA framework emulates human refactoring process by integrating retrieval-augmented generation, multi-agent collaboration, and verbal reinforcement learning to improve code correctness and readability for method-level refactoring tasks.
MANTRA significantly enhances automated refactoring success rate and code quality compared to baseline LLM and existing LLM-powered tools, demonstrating practical advantages for advancing software refactoring automation.

MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation

MDTeamGPT (Multi-Disciplinary Team Generative Pre-trained Transformer): introduces a multi-agent framework for medical consultation, incorporating Primary Care Doctor, Specialist Doctor Agents, Lead Physician, Chain-of-Thought Reviewer, Safety and Ethics Reviewer, Correct Answer Knowledge Base, Chain-of-Thought Knowledge Base, Historical Shared Pool, Shared Vector Database, and Patient.
This framework utilizes consensus aggregation and residual discussion structure to enhance diagnostic accuracy and reduce cognitive burden in multi-round, multi-agent medical consultations.
MDTeamGPT employs knowledge bases to accumulate consultation experience, enabling self-evolution and improved generalization in medical diagnosis tasks.

MoK-RAG: Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation for Embodied AI Environments

MoK-RAG (Mixture of Knowledge Paths Enhanced Retrieval-Augmented Generation): introduces multi-source retrieval framework with Splitting-, Constraint- and Generation-Modules to address cognitive-algorithmic discrepancy of single-source knowledge retrieval in current Retrieval-Augmented Generation systems.
MoK-RAG framework partitions knowledge base into multiple specialized paths via Splitting Module, organizes retrieved knowledge using Constraint Module, and generates response through Generation Module, enhancing contextual relevance and adaptability.
MoK-RAG framework mitigates "Reply Missing" problem, which refers to incomplete or lacking key details in generated responses due to single-source knowledge retrieval, by enabling simultaneous retrieval from multiple knowledge paths.

Gricean Norms as a Basis for Effective Collaboration

Normative Framework: introduces Lamoids, GPT-4-powered agents, integrating Gricean Norms, Inference Norm, Cognitive Frameworks, and Fs-CoT Prompting for effective human-AI collaboration through Response Generation.
Normative Framework enhances agent's pragmatic reasoning by incorporating Gricean maxims and cognitive theories into Fs-CoT prompting to interpret unclear instructions and generate context-aware responses.
By adhering to Gricean and Inference norms within the framework, Lamoids achieve improved task accuracy and clearer communication in collaborative grid world environment.

ENVBENCH: A BENCHMARK FOR AUTOMATED ENVIRONMENT SETUP

ENVBENCH: introduces a benchmark for automated environment setup, encompassing Repository, Environment Setup, Language Model, AI Agent, Generated Script, Evaluation Results, Evaluation Suite, and Metrics components.
ENVBENCH evaluates environment setup approaches by generating shell scripts and verifying environment configuration through static analysis and compilation checks.
This benchmark facilitates systematic assessment of environment setup strategies, addressing limitations of existing datasets and evaluation methods in the software engineering domain.

PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

PLAY2PROMPT (Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play): introduces automated framework for zero-shot tool learning, with tool-use example generation, tool documentation optimization, and task LLM components.
PLAY2PROMPT employs iterative beam search with sample proposal, sample evaluation, and down-sampling to refine documentation and generate examples through tool play.
This approach enhances LLM tool utilization by creating high-quality documentation and demonstrations without labeled data or manual effort.

DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal

DARS (Dynamic Action Re-Sampling): introduces an inference-time compute scaling method for coding agents, incorporating Generate, Reproduction, Localization, Bug Fixing, Evaluation, and Expansion components with Generator LLM, Reviewer LLM, and Selector LLM, to enhance performance by dynamically re-sampling actions.
DARS framework utilizes Expansion mechanism with Generator LLM for action candidates, Reviewer LLM for patch scoring based on Score Rubrics, and Selector LLM for best patch selection, processing Input and Feedback to improve coding agent's decision-making.
This approach aims to address limitations of sequential, multi-solution, and tree search methods by selectively branching at key decision points and employing depth-first strategy, achieving state-of-the-art performance on SWE-Bench Lite benchmark.

Conversational Agents as Catalysts for Critical Thinking: Challenging Social Influence in Group Decision-making

System Overview: introduces chat interface, server, and database components, with Summary Agent, Database, AI-message History, Conversation Agent, AI Duplicate Checker, and Cosine-similarity, where system processes direct and public chat messages through agents.
System architecture includes Summary Agent for public opinion analysis, Conversation Agent for generating contextual counterarguments, and AI Duplicate Checker for message novelty.
AI Duplicate Checker uses cosine-similarity to ensure novelty of generated messages compared to AI-message History stored in Database.

Empowering LLMs in Decision Games Through Algorithmic Data Synthesis

Mastermind-Dou Framework: introduces a three-stage reasoning process with Training Dataset, Opponent, Step-by-step Output, Possible Action Prediction, Opponent Strategy Prediction, and Final Action Selection to enable LLMs to play Doudizhu game.
Mastermind-Dou framework uses Possible Action Prediction to predict likely moves, Opponent Strategy Prediction to anticipate adversary actions, and Final Action Selection to choose the optimal game action.
The framework enhances LLMs' decision-making in imperfect information games by decomposing the reasoning into sequential prediction and selection stages.

FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

FlexVLN (Flexible Vision-and-Language Navigation): introduces a hierarchical approach for vision-language navigation, integrating Environmental Perception, LLM Planner, MLLM Verification, Instruction Follower, and Object Localization components.
FlexVLN employs LLM Planner for high-level planning and guidance generation, Instruction Follower for low-level execution, and MLLM Verification to ensure guidance feasibility, enhancing generalization across diverse VLN tasks.
The framework utilizes Environmental Perception to understand surroundings and Object Localization to identify the target, achieving effective navigation through a combination of LLM planning and supervised learning execution.

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

MDocAgent (Multi-Modal Multi-Agent Framework for Document Understanding): introduces a novel RAG and multi-agent framework with text-based RAG, image-based RAG, general agent, critical agent, text agent, image agent, and summarizing agent.
MDocAgent framework addresses DocQA challenges by combining text and image RAG with specialized agents for refined processing and critical information extraction.
This approach enables improved DocQA performance through collaborative multi-agent architecture and cross-modal understanding of long documents.

Towards a Barrier-free GeoQA Portal: Natural Language Interaction with Geospatial Data Using Multi-Agent LLMs and Semantic Search

GeoQA Portal: introduces a multi-agent LLM framework with Router, Analyzer, Explainer, Visualizer, Mission Planner, Relation Analyzer, Region Selector, Entity Finder, and Geo Filter for natural language interaction with geospatial data.
GeoQA Portal decomposes user queries, assigns subtasks to specialized LLM agents, and presents task plans and visualizations to enhance transparency and user engagement.
The framework of GeoQA Portal supports flexible data inputs and semantic search, aiming to bridge the gap between complex GIS workflows and public data accessibility for non-expert users.

Retrieval-Augmented Simulacra: Generative Agents for Up-to-date and Knowledge-Adaptive Simulations

Retrieval-Augmented Simulacra: introduces system simulating social network service interactions with User Persona Generation LLM Module, RAG Module, and Post / Reply Generation LLM Module.
System uses Community Rule, Community Goal, and Samples of User Personas to generate User Persona-based Posts / Replies.
Framework simulates realistic social network interactions using web information retrieval and persona-based content generation.

Personalized Attacks of Social Engineering in Multi-turn Conversations - LLM Agents for Simulation and Detection

SE-VSim (Social Engineering - Victim Simulation): introduces a dual-agent framework with Attacker Agent, Victim Agent, and Conversation Generation Pipeline to simulate social engineering attack mechanisms in multi-turn conversations.
arxiv_paper_framework_name: models victim agents with varying personality traits and attacker agents with predefined attack goals to generate realistic chat-based social engineering scenarios.
arxiv_paper_framework_name: facilitates the study of victim vulnerabilities and attacker strategies by generating a dataset of simulated conversations for personalized social engineering defense.

TestForge: Feedback-Driven, Agentic Test Suite Generation

TestForge: introduces agentic framework for cost-effective test suite generation with Initial Test Generator, LLM Agent, Agent Actions, Environment Feedback, Test Refinement Loop, Code Repository Context, and Generated Test Suite.
TestForge reframes LLM-based test generation as iterative process, refining initial zero-shot tests using execution feedback and coverage reports to enhance test quality and coverage.
The framework leverages detailed execution feedback and operates at file-level to improve cost-efficiency and generate high-quality, readable test suites for complex real-world code.

17th March 2025

When Should We Orchestrate Multiple Agents?

Orchestration Framework: introduces a method to dynamically select the optimal agent from a set of Agents (Perform tasks, human/AI/hybrid) for tasks arriving via an Input Data Stream (Sequential task inputs), considering performance across different Regions (Data distribution partitions), costs via a Cost Estimator (Estimates agent cost per region), and feasibility via Constraints (Agent feasibility rules), using a Correctness Estimator (Estimates agent accuracy per region), Region Probability Estimator (Estimates region likelihood), and Total Empirical Utility (Cost-adjusted performance metric) for selection by the Orchestrator (Selects agent based on utility).
The framework utilizes online probabilistic inference to update agent correctness and region probabilities, calculates an Appropriateness Metric (Measures orchestration value) to determine when orchestration is beneficial, and is applied to simulations including resolving Rogers' Paradox by selecting Learning Strategies (Choices in Rogers' Paradox simulation).
A user study involving a User (Human decision-maker in study) choosing between task completion, outsourcing to an AI Agent (LLM agent in study) or a Human Agent (Agent representing human performance) demonstrates improved performance with constrained orchestration compared to baseline scenarios where users act as poor orchestrators.

Why Do Multi-Agent LLM Systems Fail?

MASFT (Multi-Agent System Failure Taxonomy): introduces taxonomy of failure modes in multi-agent systems, categorizing them into Pre Execution Failure Modes, Execution Failure Modes, Post Execution Failure Modes, and further groups into Failure Categories including Task Verification, Inter-Agent Misalignment, and Poor Specification.
MASFT framework organizes failure modes based on inter-agent conversation stages, spanning from pre-execution to post-execution phases, and classifies them into three main categories reflecting system design, agent coordination, and quality control issues.
MASFT taxonomy provides structured framework for understanding and mitigating failure modes in multi-agent LLM systems, serving as a foundation for future research towards building robust and reliable multi-agent systems.

Do Large Language Models Understand Performance Optimization?

Performance Optimization Agent: introduces a system integrating LLMs with profiling feedback for HPC code optimization, with Input Prompt, Evaluator, Codee, LLMs, Compilers, Results, HPC Commonsense, Code Correctness, Performance Benchmarking, Metrics, Memory, Profiling Tools, Profiling Plan, Metrics Annotation, System Prompt, Code Generation, Code Replacement, and Output Inspection components.
Performance Optimization Agent leverages profiling tools and LLMs iteratively to optimize HPC code by replacing hotspot functions and recompiling, while evaluating performance metrics and ensuring code correctness.
The agent aims to bridge the gap between traditional HPC optimization and AI-driven code assistants by incorporating human-like iterative refinement and memory of prior optimization attempts for enhanced performance gains.

A Comprehensive Survey on Multi-Agent Cooperative Decision-Making: Scenarios, Approaches, Challenges and Perspectives

LLMs-enhanced MARL Framework: introduces a structure integrating Large Language Models with Multi-Agent Reinforcement Learning, encompassing Feature Representation Extractor, Language Translator, Reward Models, Decision-makers, World Model Simulator, and Policy Interpreter.
This framework leverages LLMs for enhanced reasoning and language understanding within MARL agents, facilitating improved collaboration and decision-making in complex environments.
The architecture supports various roles for LLMs, including information processing, reward design, decision-making, and output generation, aiming to address challenges in multi-agent systems.

Toward Generative 6G Simulation: An Experimental Multi-Agent LLM and ns-3 Integration

Multi-Agent LLM framework: introduces a multi-agent system with Simulation Generation, Test Designer, Test Executor, and Result Interpretation Agents, leveraging External Tools and LLMs within a Feedback Loop managed by Agent Orchestration Layer for automated network simulation.
This framework integrates specialized agents to automate simulation lifecycle stages, from natural language input to actionable insights, using ns-3 and iterative refinement.
The framework enhances simulation accuracy and reduces manual coding by employing LLMs and external knowledge, facilitating rapid prototyping in complex network environments.

MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research

MicroVQA: introduces a benchmark, with Raw VQA creation, Exam-style MCQ generation, and RefineBot, for multimodal reasoning in microscopy-based research.
MicroVQA benchmark evaluates expert image understanding, hypothesis generation, and experiment proposal capabilities.
RefineBot component enhances MCQ difficulty by iteratively refining questions and distractors based on chain-of-thought analysis.

Agents Play Thousands of 3D Video Games

PORTAL (Policy Optimization and Reasoning for Tactical Artificial Learning): introduces a novel framework for game-playing AI agents, leveraging Strategy Description, BT DSL, Behavior Tree DSL, Blackboard Variables, Control Flows, Neural Nets, Hand-crafted Rules, Generated Codes, Task Nodes, BT Generator, Parser, Reflexion, Rollout, JSON, and AI C++ Server components.
PORTAL framework utilizes LLMs as policy architects to generate Behavior Tree DSL policies, which are then parsed and executed in a game environment, incorporating a reflexion mechanism for iterative policy refinement based on game feedback.
The framework's hybrid architecture combines strategic reasoning from LLMs with efficient execution through behavior trees and neural networks, enabling rapid deployment and adaptation of game agents across diverse 3D video game environments.

Goal2Story: A Multi-Agent Fleet based on Privately Enabled sLLMs for Impacting Mapping on Requirements Elicitation

Goal2Story (Impact Mapping framework): introduces multi-agent fleet for goal-driven requirements elicitation, with Alpha Captain, Intelligence Officer, Delivery Coordinator, Tactical Officer, Format Doctor, Validation Agent, and StorySeek dataset.
It leverages privately enabled small language models and Impact Mapping framework to automate requirements elicitation in agile development.
The framework aims to improve efficiency and quality of user story generation while addressing data privacy and cost concerns associated with large language models.

KNOWLEDGE-AWARE ITERATIVE RETRIEVAL FOR MULTI-AGENT SYSTEMS

Knowledge-Aware Iterative Retrieval for Multi-Agent Systems: introduces agent framework with Query Planning, Knowledge Update Mechanism, and Contextual Filtering for iterative knowledge retrieval.
Framework decouples external sources from internal knowledge cache, enabling dynamic search exploration and mitigating bias reinforcement loops.
System supports multi-agent extensions for competitive and collaborative knowledge sharing, enhancing reasoning and scalability in complex tasks.

DAgent: A Relational Database-Driven Data Analysis Report Generation Agent

DAgent (Relational Database-Driven Data Analysis Report Generation Agent) introduces a novel LLM agent framework for relational database analysis report generation, integrating Planning Module (processes input queries), Decomposition Tools (breaks down complex questions), Data Retrieval Tools (retrieves data from database), SQL Rewriting Tools (optimizes SQL queries), Report Generation Tools (generates analytical reports), Tools Module (collection of specialized components), and Memory Module (stores historical data).
DAgent framework utilizes planning to decompose complex questions, employs tools for data retrieval and report generation, and incorporates memory to enhance efficiency and contextual understanding for relational database-driven data analysis report generation.
The modular architecture of DAgent facilitates efficient task decomposition, flexible data retrieval strategies, and precise report synthesis, demonstrating strong potential for complex database analysis tasks.

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways

MAP (Multi-Agent Inpatient Pathways): introduces a multi-agent framework with Triage-, Diagnosis-, Treatment-, and Chief-Agents, supported by Record Review-, Trainable REG-, and Expert Guidance-Modules, utilizing Medical Records and a Medical Knowledge Base, to enhance large language model performance in inpatient pathways.
MAP framework simulates inpatient pathway flow through collaborative agents, where each agent is empowered by specialized LLMs and modules for complex medical scenario processing and improved diagnostic accuracy.
The framework's modular design, incorporating record review, retrieval-enhanced generation, and expert guidance, aims to address limitations of current LLMs in complex inpatient diagnostic support by integrating diverse clinical data and knowledge.

MAP : Multi-user Personalization with Collaborative LLM-powered Agents

MAP (Multi-Agent system for Multi-user Personalization): introduces Planner, Rule Manager, Rule Retriever, and Storage components within Reflection, Analysis, and Feedback stages for multi-user personalization.
MAP framework orchestrates specialized agents to retrieve user data, reason about personalization tasks, resolve conflicts, and incorporate user feedback through iterative workflow.
MAP leverages multi-agent system to implement user-centered personalization workflow, emphasizing user involvement in resolution verification and failure management.

Identifying Cooperative Personalities in Multi-agent Contexts through Personality Steering with Representation Engineering

Personality Steering Framework: introduces personality steering via representation engineering to investigate LLM cooperation within Iterated Prisoner's Dilemma environment, utilizing LLM agents and rule-based players.
Framework employs Big Five personality traits steering through vectors and prompts to analyze impact on LLM behavior across different communication setups.
Key components include representation engineering for personality modulation, IPD environment for strategic interaction, and communication module for enhanced agent interaction.

Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective

Agentic HLS Optimization Framework: introduces agent-based methodology employing In-Context Learning, LLM, HLS Tool, Functional Test, Compiler, Agent Tasks, Inspect kernel, Solve ILP problem, Synthesize Solution, Select Solution, System Prompt, Config. Builder, and ILP Solver for automated hardware design optimization in High-Level Synthesis.
This framework explores LLMs' reasoning capabilities within High-Level Synthesis by automating code restructuring, pragma insertion, and design space exploration through iterative feedback loops and access to EDA tools.
The agentic approach aims to enhance design quality and efficiency by enabling LLMs to emulate expert system architects in navigating complex hardware optimization tasks, potentially improving upon current state-of-the-art methods.

Enforcing Cybersecurity Constraints for LLM-driven Robot Agents for Online Transactions

Security Architecture: introduces a cybersecurity framework for LLM agents in online transactions, integrating LLM-Driven Robot Agents Layer, Multi-Factor Authentication (MFA) Layer, Blockchain Layer, Anomaly Detection System (ADS) Layer, and User Interface Layer.
The framework enhances transaction security and integrity by combining multi-factor authentication for identity verification, blockchain for immutable records, and real-time anomaly detection for fraud prevention.
This architecture achieves improved fraud detection and reduced transaction latency compared to traditional systems, demonstrating enhanced security and efficiency for LLM-driven robotic agents.

Can Reasoning Models Reason about Hardware? An Agentic HLS Perspective

Agentic HLS optimization framework: introduces an automated optimization flow with In-Context Learning, HLS Tool, LLM, Functional Test, Compiler, Agent Tasks, ILP Solver, and Config. Builder for high-level synthesis.
This framework explores reasoning LLMs for automating code restructuring, pragma insertion, and design space exploration in HLS.
The agentic approach aims to improve design quality and efficiency by mimicking expert system architects in hardware optimization tasks.

Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents

PFI (Prompt Flow Integrity): introduces system security solution for LLM agents, featuring User Prompt, Agent Context, Plugins, Plugin Call, Plugin Result, Final Answer, Trusted Agent, Untrusted Agent, Proxy, Trusted Data, Untrusted Data, FlowCheck, TrustCheck, and GenerateQuery components.
PFI framework isolates untrusted data within Untrusted Agent, distinct from Trusted Agent, and employs Proxy to manage data flow between agents, utilizing FlowCheck and TrustCheck for validation and data classification.
PFI aims to mitigate privilege escalation risks in LLM agents by enforcing least privilege and ensuring data flow integrity through component-based architecture and security mechanisms.

16th March 2025

VeriLA: A Human-Centered Evaluation Framework for Interpretable Verification of LLM Agent Failures

VeriLA (Verifying LLM Agent failures): introduces a human-centered framework with Human-designed Agent Registry, Planning Agent, Task Plan, Agent Execution, Agent Outputs, Human-defined Agent Criteria, Agent Verifiers, Aggregator: Task Failure, and AI Practitioners / Agent Users for interpretable LLM agent failure verification.
VeriLA systematically evaluates agent failures using human-defined criteria, verifies execution outputs, and aggregates scores to identify task failures and guide revisions.
The framework enhances human-agent collaboration by providing interpretable failure analysis and reducing manual effort in debugging compound AI systems.

Facilitating Automated Online Consensus Building through Parallel Thinking

PTFA (Parallel Thinking-based Facilitation Agent): introduces an automated system for online consensus building, leveraging LLMs to embody Six Thinking Hats roles for structured discussions.
PTFA framework comprises an Agent module incorporating LLMs for diverse thinking roles and a Platform module for user interaction via Discourse interface and discussion data storage in a database.
The system utilizes Discourse forum system and OpenAI API to facilitate structured conversations based on Six Thinking Hats methodology and collect data for analysis of automated facilitation.

A Survey on the Optimization of Large Language Model-based Agents

LLM-based Agent Optimization Framework: introduces introduction, background, parameter-driven optimization, parameter-free optimization, datasets and benchmarks, application, challenges and future directions, and conclusion to comprehensively survey optimization strategies for agents based on large language models.
Parameter-driven optimization refines model parameters, while parameter-free optimization adjusts inputs and context to improve agent behavior without parameter changes.
The survey categorizes optimization into parameter-driven (fine-tuning, RL, hybrid) and parameter-free (prompt engineering, RAG, multi-agent) methods, further detailing data construction, evaluation, and applications.

SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially?

SPIN-Bench (Strategic Planning, Interaction, and Negotiation): introduces a multi-domain evaluation framework, with LLM Pool, Game Environment, Agent Engine, Agent Initialization, Phase Manager, and Evaluation Module, designed to measure strategic planning and social reasoning intelligence.
The framework combines PDDL tasks, competitive games, cooperative games, and strategic games to assess LLMs in diverse social settings.
SPIN-Bench is important for future research on robust multi-agent planning, social reasoning, and human-AI teaming by providing a unified and comprehensive evaluation platform.

GAMECHAT: Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments

GAMECHAT (Multi-LLM Dialogue for Safe, Agile, and Socially Optimal Multi-Agent Navigation in Constrained Environments): introduces a decentralized multi-agent navigation framework utilizing Initial Prompt, Another Agent Observed?, SMG Occurring?, MPC-CBF Update, LLM Conversation, Consensus or Comm. Limit Reached?, Was Consensus Reached?, Use LLM-Based Role Assignment, Strategy 1 Role Assignment, MPC-CBF w/ Role Constraints Update, and Goals Reached? components for safe, agile, and socially optimal navigation.
GAMECHAT framework leverages LLM Conversation for natural language-based priority negotiation and employs MPC-CBF Update for motion planning with safety and liveness guarantees in constrained environments.
The framework addresses spatial symmetry and deadlocks through explicit communication and game-theoretic strategies, achieving socially optimal navigation by prioritizing urgent tasks and ensuring subgame perfect equilibrium.

LLM-MEDIATED GUIDANCE OF MARL SYSTEMS

LLM-Mediated Guidance of MARL Systems: introduces a framework that integrates Rule-Based Controller, Natural Language Controller and LLM-Mediator to guide Agents within an Environment by generating Task-List based on Observations and Reward, overwriting Learned Policy to enhance Actions.
The framework uses LLM-Mediator to interpret interventions from Rule-Based Controller or Natural Language Controller, translating them into specific actions for agents in the Aerial Wildfire Suppression environment.
This approach aims to improve MARL performance by providing adaptive guidance through LLM-mediated interventions, accelerating learning and enhancing coordination in complex multi-agent scenarios.

Facilitating Automated Online Consensus Building through Parallel Thinking

PTFA (Parallel Thinking-based Facilitation Agent): introduces an automated system for online consensus building, leveraging LLMs to embody Six Thinking Hats roles for structured discussions.
PTFA framework comprises an Agent module incorporating LLMs for diverse thinking roles and a Platform module for user interaction via Discourse interface and discussion data storage in a database.
The system utilizes Discourse forum system and OpenAI API to facilitate structured conversations based on Six Thinking Hats methodology and collect data for analysis of automated facilitation.

Advancing Human-Machine Teaming: Concepts, Challenges, and Applications

QN-MHP (Queuing Network-Model Human Processor): introduces queuing networks and symbolic cognitive models to effectively model multitask human performance for cognitive modeling.
QN-MHP demonstrates potential in cognitive modeling but lacks accuracy under specific conditions and does not address speed control or complex road geometry adjustments.
QN-MHP represents an initial approach towards integrating cognitive and queuing models for human performance simulation.

15th March 2025

Agentic Search Engine for Real-Time IoT Data

IoT-ASE (IoT Agentic Search Engine): introduces a real-time search engine for IoT data, leveraging Classifier Agent, Retriever Node, Generator Agent, and Reviewer Agent, utilizing Service Description Vector Database and Real-Time IoT database, with components including Tokenizer, Embedding Model, Average Pooling, Normalize embeddings, and integrating SensorsConnect's Perception Layer, Edge Layer, Cloud Layer, Business Layer, and User Interface Layer.
IoT-ASE framework employs a Generic Agentic RAG approach to process IoT data queries, incorporating agents for classification, retrieval, generation, and review, ensuring context-aware and accurate responses by accessing real-time information and service descriptions.
The architecture of IoT-ASE is designed to address the challenges of fragmented and heterogeneous IoT data by utilizing a unified data model and standardized communication protocols within the SensorsConnect framework, facilitating efficient real-time data accessibility and decision-making.

TFHE-Coder: Evaluating LLM-agentic Fully Homomorphic Encryption Code Generation

Compiler-in-the-loop evaluator Framework: introduces iterative TFHE code generation process, with User Prompt initiating code creation by LLM, followed by Compiler TFHE compiling for Environment Output, using Compile Report feedback for LLM Revise if compilation is not OK.
This framework evaluates LLM's ability to generate compilable TFHE code through cycles of compilation and revision based on compiler diagnostics.
The iterative approach helps in systematically refining LLM output towards syntactically correct TFHE code generation.

Multi-Agent Systems Execute Arbitrary Malicious Code

MAS (Multi-Agent System): introduces control-flow hijacking attacks by manipulating metadata flow within User, Orchestrator, WebSurfer, FileSurfer, and CodeExecutor components, leading to Hijacked control flow and execution of Executable payload from Attack content.
MAS framework coordinates agents like WebSurfer, FileSurfer, and CodeExecutor under Orchestrator's direction to fulfill User requests, but is vulnerable to attacks rerouting Normal control flow.
Control-flow hijacking in MAS exploits metadata transmission to redirect agent invocations, causing execution of arbitrary code and system compromise, even with individual agent safety measures.

AgentDroid: A Multi-Agent Framework for Detecting Fraudulent Android Applications

AgentDroid (Multi-Agent Framework): introduces a multi-agent framework for Android fraudulent application detection, with Task Master, Certificate Checker, Link Analyst, Package Tracker, Permission Analyst, Icon Analyst, Content Analyst, Decision Maker agents, and Static Analysis, Feature Extraction, Agent Tools, Multi-Agent Detection modules.
AgentDroid leverages multimodal analysis and collaborative agents to improve fraud detection accuracy by analyzing APK files and extracting features.
The framework employs specialized agents and tools, including DeiT and T5 models, for detailed analysis of diverse APK characteristics to identify fraudulent applications.

ICCO: Learning an Instruction-conditioned Coordinator for Language-guided Task-aligned Multi-robot Control

ICCO (Instruction-Conditioned Coordinator) introduces a multi-agent reinforcement learning framework with Instructor providing Language Instruction to Coordinator, which uses Coordination Policy and LLM or Random Vector Generator to generate Task-Aligned and Consistent Instructions (TACI) for each Local Agent with Local Policy in Env.
ICCO framework balances language-instruction following and cooperative task execution by employing Coordinator to generate consistent instructions from global observations and language, improving coordination among Local Agents.
The framework utilizes Centralized Training with Decentralized Execution (CTDE) paradigm, training Coordinator and Local Agent policies jointly to optimize task efficiency and instruction following, enhanced by Consistency Enhancement Term.

Is Multi-Agent Debate (MAD) the Silver Bullet? An Empirical Analysis of MAD in Code Summarization and Translation

MAD (Multi-Agent Debate): introduces a multi-stage framework with Input, Stage 1, Stage 2, Stage 3, Agent Debate at each stage, Judge, and Evaluation: Debate Output, for structured debate among agents to solve software engineering tasks.
MAD framework utilizes iterative Agent Debate within each Stage to refine solutions and employs a Judge to evaluate agent responses and guide the process towards Evaluation: Debate Output.
The framework's modular design with distinct stages and agent roles facilitates structured problem-solving and leverages debate to enhance the quality of the final Evaluation: Debate Output.

SagaLLM: Context Management, Validation, and Transaction Guarantees for Multi-Agent LLM Planning

SagaLLM (Saga Language Learning Model): introduces Context Management-, Validation- and Transaction-Frameworks, addressing limitations in multi-agent LLM planning by ensuring context awareness and planning consistency.
SagaLLM integrates transactional processing with adaptive multi-agent intelligence, enhancing reliability and correctness in complex real-world applications.
The framework employs specialized agents and validation protocols to maintain critical constraints and state information throughout complex planning processes, improving decision-making robustness.

End-to-End Edge AI Service Provisioning Framework in 6G ORAN

Edge AI Service Provisioning Framework: introduces end-to-end system for edge AI service deployment using LLM-based Agent, RAN-Core Status Data, Edge Status Data, RIC Status Data, AI Model Repository Data, User Engagement Tool, Edge Management Tool, Core Management Tool, RIC Management Tool, AI Service Monitoring-Prediction, Other xApps, AI Service, Network Functions, and Databases.
This framework leverages LLM agent to automate AI model selection, service deployment, network adaptation, and QoS monitoring in 6G O-RAN.
The proposed framework aims to simplify edge AI deployment by abstracting low-level tasks and enabling intent-based service provisioning.

14th March 2025

CoLLMLight: Cooperative Large Language Model Agents for Network-Wide Traffic Signal Control

CoLLMLight (Cooperative Large Language Model Light): introduces a cooperative LLM agent framework for network-wide traffic signal control, with Observation Collection, Complexity-Aware Reasoning, Spatiotemporal-Aware Cooperative Decision-making, and Simulation-Driven Fine-tuning components.
CoLLMLight uses spatiotemporal graph and complexity-aware reasoning to dynamically adapt reasoning depth for optimal computational efficiency and decision quality.
Simulation-driven fine-tuning and environmental feedback enhance CoLLMLight's decision-making and efficiency in diverse traffic scenarios.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

CoT Monitoring Framework: introduces CoT Monitor and Action Monitor, to monitor reasoning models for misbehavior, in agentic coding environments.
CoT Monitor observes agent's chain-of-thought, actions, and outputs, while Action Monitor observes only tool calls and final outputs.
Using CoT monitoring can be more effective than action-only monitoring for detecting reward hacking in reasoning models.

Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery

Cerebrum (AIOS SDK): introduces a modular four-layer architecture comprising LLM Layer, Memory Layer, Storage Layer, and Tool Layer, alongside Overrides Layer, Agent Hub, Agent Chat, Context Manager, Scheduler, LLM Core(s), Tool Manager, Memory Manager, Storage Manager, Agent Manager, Planning Module, Action Module, Memory Module, Storage Module, AIOS Kernel, User Device, Exposed Ports, LLM Queue, Memory Queue, Tool Queue, Storage Queue, Agent Applications, AIOS System Call, and Thread Binding for agent development, deployment, distribution, and discovery.
Cerebrum framework provides a comprehensive SDK with a community-driven Agent Hub for sharing agents and an interactive web interface Agent Chat for agent testing and evaluation, aiming to standardize agent development and promote collaboration.
The platform's architecture facilitates both fine-grained control over agent behavior and rapid development through high-level abstractions, supporting diverse agent methodologies and user-created agent distribution within a centralized hub.

Alstorian lets AI be a historian: A KG-powered multi-agent system for accurate biography generation

Alstorian: introduces a knowledge graph powered retrieval-augmented generation system for biography creation, integrating KG-based index, two-step training, prompt, retrieval, aligned model, biography, verifier, error-aware generation, and error-aware solvers components.
Alstorian employs KG-based index for structured knowledge retrieval and two-step training to enhance stylistic consistency of generated biographies, alongside multi-agent system for real-time error detection and correction.
The framework achieves improved factual accuracy and reduced hallucination in biography generation through error-aware mechanisms and knowledge graph integration, demonstrating advancements over existing methods.

[GNNs as Predictors of Agentic Workf

Name		Name	Last commit message	Last commit date
Latest commit History 1,256 Commits
Autonomous_Agents_Resources.md		Autonomous_Agents_Resources.md
Autonomous_agent_logo.png		Autonomous_agent_logo.png
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers

20th May 2025

14th May 2025

13th May 2025

12th May 2025

11th May 2025

10th May 2025

9th May 2025

8th May 2025

7th May 2025

6th May 2025

5th May 2025

4th May 2025

3rd May 2025

2nd May 2025

1st May 2025

30th April 2025

29th April 2025

28th April 2025

27th April 2025

26th April 2025

25th April 2025

24th April 2025

23rd April 2025

22nd April 2025

21st April 2025

20th April 2025

19th April 2025

18th April 2025

17th April 2025

17th April 2025

16th April 2025

15th April 2025

15th April 2025

14th April 2025

13th April 2025

12th April 2025

11th April 2025

10th April 2025

9th April 2025

8th April 2025

7th April 2025

6th April 2025

5th April 2025

4th April 2025

3rd April 2025

2nd April 2025

1st April 2025

31st March 2025

30th March 2025

29th March 2025

28th March 2025

27th March 2025

26th March 2025

25th March 2025

24th March 2025

23rd March 2025

22nd March 2025

21st March 2025

20th March 2025

19th March 2025

18th of March 2025

17th March 2025

16th March 2025

15th March 2025

14th March 2025

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Packages