Skip to content

ConardLi/easy-dataset

Repository files navigation

GitHub Repo stars GitHub Downloads (all assets, all releases) GitHub Release AGPL 3.0 License GitHub contributors GitHub last commit

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

FeaturesQuick StartDocumentationContributingLicense

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

  • Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
  • Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
  • Intelligent Question Generation: Extracts relevant questions from each text segment
  • Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
  • Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
  • Flexible Editing: Edit questions, answers, and datasets at any stage of the process
  • Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
  • Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
  • User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
  • Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

ed3.mp4

Local Run

Download Client

Windows MacOS Linux

Setup.exe

Intel

M

AppImage

Install with NPM

  1. Clone the repository:
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset
  1. Install dependencies:
   npm install
  1. Start the development server:
   npm run build

   npm run start
  1. Open your browser and visit http://localhost:1717

Build with Local Dockerfile

If you want to build the image yourself, you can use the Dockerfile in the project root:

  1. Clone the repository:

    git clone https://github.com/ConardLi/easy-dataset.git
    cd easy-dataset
  2. Build the Docker image:

    docker build -t easy-dataset .
  3. Run the container:

    docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

    Note: Please replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.

  4. Open your browser and visit http://localhost:1717

How to Use

Create a Project

  1. Click the "Create Project" button on the homepage;
  2. Enter a project name and description;
  3. Configure your preferred LLM API settings

Process Documents

  1. Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
  2. View and adjust the automatically split text segments;
  3. View and adjust the global domain tree

Generate Questions

  1. Batch construct questions based on text blocks;
  2. View and edit the generated questions;
  3. Organize questions using the label tree

Create Datasets

  1. Batch construct datasets based on questions;
  2. Generate answers using the configured LLM;
  3. View, edit, and optimize the generated answers

Export Datasets

  1. Click the "Export" button in the Datasets section;
  2. Choose your preferred format (Alpaca or ShareGPT);
  3. Select the file format (JSON or JSONL);
  4. Add custom system prompts as needed;
  5. Export your dataset

Project Structure

easy-dataset/
├── app/                                # Next.js application directory
│   ├── api/                            # API routes
│   │   ├── llm/                        # LLM API integration
│   │   │   ├── ollama/                 # Ollama API integration
│   │   │   └── openai/                 # OpenAI API integration
│   │   ├── projects/                   # Project management API
│   │   │   ├── [projectId]/            # Project-specific operations
│   │   │   │   ├── chunks/             # Text chunk operations
│   │   │   │   ├── datasets/           # Dataset generation and management
│   │   │   │   ├── generate-questions/ # Batch question generation
│   │   │   │   ├── questions/          # Question management
│   │   │   │   └── split/              # Text splitting operations
│   │   │   └── user/                   # User-specific project operations
│   ├── projects/                       # Frontend project pages
│   │   └── [projectId]/                # Project-specific pages
│   │       ├── datasets/               # Dataset management UI
│   │       ├── questions/              # Question management UI
│   │       ├── settings/               # Project settings UI
│   │       └── text-split/             # Text processing UI
│   └── page.js                         # Homepage
├── components/                         # React components
│   ├── datasets/                       # Dataset-related components
│   ├── home/                           # Homepage components
│   ├── projects/                       # Project management components
│   ├── questions/                      # Question management components
│   └── text-split/                     # Text processing components
├── lib/                                # Core libraries and tools
│   ├── db/                             # Database operations
│   ├── i18n/                           # Internationalization
│   ├── llm/                            # LLM integration
│   │   ├── common/                     # Common LLM tools
│   │   ├── core/                       # Core LLM clients
│   │   └── prompts/                    # Prompt templates
│   │       ├── answer.js               # Answer generation prompts (Chinese)
│   │       ├── answerEn.js             # Answer generation prompts (English)
│   │       ├── question.js             # Question generation prompts (Chinese)
│   │       ├── questionEn.js           # Question generation prompts (English)
│   │       └── ... other prompts
│   └── text-splitter/                  # Text splitting tools
├── locales/                            # Internationalization resources
│   ├── en/                             # English translations
│   └── zh-CN/                          # Chinese translations
├── public/                             # Static resources
│   └── imgs/                           # Image resources
└── local-db/                           # Local file database
    └── projects/                       # Project data storage

Documentation

Community Practice

Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Star History

Star History Chart

Built with ❤️ by ConardLi • Follow me: WeChat Official AccountBilibiliJuejinZhihuYoutube

About

A powerful tool for creating fine-tuning datasets for LLM

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages