You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am probing into the AgenticChunking function of agno, and I like the design very much. The following is my test codes.
import os
import typer
from rich.prompt import Prompt
from typing import Optional
from pathlib import Path
import pandas as pd
from agno.agent import Agent
from agno.document.chunking.agentic import AgenticChunking
from agno.knowledge.markdown import MarkdownKnowledgeBase
from agno.vectordb.lancedb import LanceDb
from agno.models.openai import OpenAIChat
# --- Configuration ---
MD_FILE_PATH = "./Auto_extract_topics.md" # Path to your Markdown file
LANCEDB_URI = "./agno_lancedb_agentic" # Path for LanceDB data
TABLE_NAME = "auto_md_agentic_chunks"
# Ensure OpenAI API key is set, or configure for your LLM
llm_instance = OpenAIChat(
id='gpt-4.1-mini'
)
# --- Load Knowledge Database---
# Setup LanceDB Vector Store
print(f"Initializing LanceDB at: {LANCEDB_URI}")
vector_db = LanceDb(
table_name=TABLE_NAME,
uri=LANCEDB_URI
)
# Setup Knowledge Base with Agentic Chunking
print(f"Setting up KnowledgeBase with AgenticChunking for file: {MD_FILE_PATH}")
agentic_chunker = AgenticChunking(
model=llm_instance
)
knowledge_base = MarkdownKnowledgeBase(
path=Path('Auto_extract_topics.md'), # Use a list of local file paths
vector_db=vector_db,
chunking_strategy=agentic_chunker,
)
# Load the knowledge base
knowledge_base.load(recreate=True)
# --- View the populated KB---
import lancedb
DB_PATH = "./agno_lancedb_agentic"
TABLE_NAME = "auto_md_agentic_chunks"
# Connect to LanceDB
db = lancedb.connect(DB_PATH)
# Open the table
table = db.open_table(TABLE_NAME)
# Transfer into pandas dataframe
data = table.to_pandas()
# Drop vector column for illustration
data = `data.drop(columns='vector')``
The codes work well, and gives a decent chunking result. For example, below is the first chunk ( begning sections of a journal paper. I cut some of the content for better reading)
print(data['payload'][0])
{"name": "Auto_extract_topics",
"meta_data": {"chunk": 1, "chunk_size": 1921},
"content": "RESEARCH Open Access
Automatic extraction of informal topics from online suicidal ideation Reilly N. Grant1, David Kucher2, Ana M. Le\u00f3n3, Jonathan F. Gemmell4*, Daniela S. Raicu4 and Samah J. Fodeh5
From The 11th International Workshop on Data and Text Mining in Biomedical Informatics Singapore, Singapore. 10 November 2017
Abstract\n\nBackground: Suicide is an alarming public health problem accounting for a considerable number of deaths each year worldwide. .......\n\nConclusions: These informal topics topics can be more... ... and precision of language.\n\nKeywords: Suicidal ideation, Word2Vec, Text mining",
"usage": {"prompt_tokens": 363, "total_tokens": 363}}
I got two more requirements which might better improve the knowledge base, and hopefully it could be possible.
Custom the prompt in the AgenticChunking so that I can handling the chunking meet the user's personalized need. For example, seperating the author information and the abstract text in the example above.
Incorporatting more meta_data items basing on the chunked piecies. For example, the LLM should identifiy the section type (with tailored custom prompts), so it would be ideal to incoporate it into meta_data.
My ideal chunking looks like below:
print(data['payload'][0])
{"name": "Auto_extract_topics",
"meta_data": {"chunk": 1, "chunk_size": 1921, "chunk_type": "general info"},
"content": "RESEARCH Open Access Automatic extraction of informal topics from online suicidal ideation Reilly N. Grant1, David Kucher2, Ana M. Le\u00f3n3, Jonathan F. Gemmell4*, Daniela S. Raicu4 and Samah J. Fodeh5\n\nFrom The 11th International Workshop on Data and Text Mining in Biomedical Informatics Singapore, Singapore. 10 November 2017 ",
"usage": {"prompt_tokens": xx, "total_tokens": xx}}
print(data['payload'][1])
{"name": "Auto_extract_topics",
"meta_data": {"chunk": 1, "chunk_size": 1921, "chunk_type": "Abstract"},
"content": "Abstract\n\nBackground: Suicide is an alarming public health problem accounting for a considerable number of deaths each year worldwide. .......\n\nConclusions: These informal topics topics can be more... ... and precision of language.\n\nKeywords: Suicidal ideation, Word2Vec, Text mining",
"usage": {"prompt_tokens": xx, "total_tokens": xx}}
I thought it could provide more robust knowledgebase for downstream application, and the current AgenticChunking gives quiet a good starting point. But I don't kown if it is possible to directly achieved in AgenticChunking, or by other tweaking method?
Many thanks !
btw, I attached the md file in my example above. It is a journal article. Auto_extract_topics.md
Proposed Solution
Not yet ideal solution.
Alternatives Considered
No response
Additional Context
No response
Would you like to work on this?
Yes, I’d love to work on it!
I’m open to collaborating but need guidance.
No, I’m just sharing the idea.
The text was updated successfully, but these errors were encountered:
Problem Description
Hi authors, thanks for the amazing product!
I am probing into the AgenticChunking function of agno, and I like the design very much. The following is my test codes.
The codes work well, and gives a decent chunking result. For example, below is the first chunk ( begning sections of a journal paper. I cut some of the content for better reading)
I got two more requirements which might better improve the knowledge base, and hopefully it could be possible.
Custom the prompt in the AgenticChunking so that I can handling the chunking meet the user's personalized need. For example, seperating the author information and the abstract text in the example above.
Incorporatting more meta_data items basing on the chunked piecies. For example, the LLM should identifiy the section type (with tailored custom prompts), so it would be ideal to incoporate it into meta_data.
My ideal chunking looks like below:
I thought it could provide more robust knowledgebase for downstream application, and the current AgenticChunking gives quiet a good starting point. But I don't kown if it is possible to directly achieved in AgenticChunking, or by other tweaking method?
Many thanks !
btw, I attached the md file in my example above. It is a journal article.
Auto_extract_topics.md
Proposed Solution
Not yet ideal solution.
Alternatives Considered
No response
Additional Context
No response
Would you like to work on this?
The text was updated successfully, but these errors were encountered: