Skip to content

[Feature Request] Custom prompt and enhanced metadata for AgenticChunking #3402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
3 tasks
yangxg opened this issue May 29, 2025 · 1 comment
Open
3 tasks
Labels
enhancement New feature or request

Comments

@yangxg
Copy link

yangxg commented May 29, 2025

Problem Description

Hi authors, thanks for the amazing product!

I am probing into the AgenticChunking function of agno, and I like the design very much. The following is my test codes.

import os
import typer
from rich.prompt import Prompt
from typing import Optional
from pathlib import Path
import pandas as pd

from agno.agent import Agent
from agno.document.chunking.agentic import AgenticChunking
from agno.knowledge.markdown import MarkdownKnowledgeBase
from agno.vectordb.lancedb import LanceDb
from agno.models.openai import OpenAIChat

# --- Configuration ---

MD_FILE_PATH = "./Auto_extract_topics.md" # Path to your Markdown file
LANCEDB_URI = "./agno_lancedb_agentic" # Path for LanceDB data
TABLE_NAME = "auto_md_agentic_chunks"
# Ensure OpenAI API key is set, or configure for your LLM

llm_instance = OpenAIChat(
    id='gpt-4.1-mini'
)

# --- Load Knowledge Database--- 

# Setup LanceDB Vector Store
print(f"Initializing LanceDB at: {LANCEDB_URI}")
vector_db = LanceDb(
    table_name=TABLE_NAME,
    uri=LANCEDB_URI
)

# Setup Knowledge Base with Agentic Chunking
print(f"Setting up KnowledgeBase with AgenticChunking for file: {MD_FILE_PATH}")

agentic_chunker = AgenticChunking(
    model=llm_instance   
)

knowledge_base = MarkdownKnowledgeBase(
    path=Path('Auto_extract_topics.md'), # Use a list of local file paths
    vector_db=vector_db,
    chunking_strategy=agentic_chunker,
)

# Load the knowledge base
knowledge_base.load(recreate=True)

# --- View the populated KB---

import lancedb

DB_PATH = "./agno_lancedb_agentic"
TABLE_NAME = "auto_md_agentic_chunks"

# Connect to LanceDB
db = lancedb.connect(DB_PATH)

# Open the table
table = db.open_table(TABLE_NAME)

# Transfer into pandas dataframe
data = table.to_pandas()

# Drop vector column for illustration
data = `data.drop(columns='vector')``

The codes work well, and gives a decent chunking result. For example, below is the first chunk ( begning sections of a journal paper. I cut some of the content for better reading)

print(data['payload'][0])

{"name": "Auto_extract_topics", 
 "meta_data": {"chunk": 1, "chunk_size": 1921}, 
 "content": "RESEARCH Open Access
           Automatic extraction of informal topics from online suicidal ideation   Reilly N. Grant1, David Kucher2, Ana M. Le\u00f3n3, Jonathan F. Gemmell4*, Daniela S. Raicu4 and Samah J. Fodeh5
           From The 11th International Workshop on Data and Text Mining in Biomedical Informatics Singapore, Singapore. 10 November 2017     
            Abstract\n\nBackground: Suicide is an alarming public health problem accounting for a considerable number of deaths each year worldwide. .......\n\nConclusions: These informal topics topics can be more... ... and precision of language.\n\nKeywords: Suicidal ideation, Word2Vec, Text mining", 
"usage": {"prompt_tokens": 363, "total_tokens": 363}}

I got two more requirements which might better improve the knowledge base, and hopefully it could be possible.

  1. Custom the prompt in the AgenticChunking so that I can handling the chunking meet the user's personalized need. For example, seperating the author information and the abstract text in the example above.

  2. Incorporatting more meta_data items basing on the chunked piecies. For example, the LLM should identifiy the section type (with tailored custom prompts), so it would be ideal to incoporate it into meta_data.

My ideal chunking looks like below:

print(data['payload'][0])

{"name": "Auto_extract_topics", 
 "meta_data": {"chunk": 1, "chunk_size": 1921, "chunk_type": "general info"}, 
 "content": "RESEARCH Open Access Automatic extraction of informal topics from online suicidal ideation   Reilly N. Grant1, David Kucher2, Ana M. Le\u00f3n3, Jonathan F. Gemmell4*, Daniela S. Raicu4 and Samah J. Fodeh5\n\nFrom The 11th International Workshop on Data and Text Mining in Biomedical Informatics Singapore, Singapore. 10 November 2017 ", 
"usage": {"prompt_tokens": xx, "total_tokens": xx}}

print(data['payload'][1])

{"name": "Auto_extract_topics", 
 "meta_data": {"chunk": 1, "chunk_size": 1921, "chunk_type": "Abstract"}, 
 "content": "Abstract\n\nBackground: Suicide is an alarming public health problem accounting for a considerable number of deaths each year worldwide. .......\n\nConclusions: These informal topics topics can be more... ... and precision of language.\n\nKeywords: Suicidal ideation, Word2Vec, Text mining", 
"usage": {"prompt_tokens": xx, "total_tokens": xx}}

I thought it could provide more robust knowledgebase for downstream application, and the current AgenticChunking gives quiet a good starting point. But I don't kown if it is possible to directly achieved in AgenticChunking, or by other tweaking method?

Many thanks !

btw, I attached the md file in my example above. It is a journal article.
Auto_extract_topics.md

Proposed Solution

Not yet ideal solution.

Alternatives Considered

No response

Additional Context

No response

Would you like to work on this?

  • Yes, I’d love to work on it!
  • I’m open to collaborating but need guidance.
  • No, I’m just sharing the idea.
@yangxg yangxg added the enhancement New feature or request label May 29, 2025
Copy link

linear bot commented May 29, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant