GitHub - ConardLi/easy-dataset: A powerful tool for creating fine-tuning datasets for LLM

GitHub Downloads (all assets, all releases)

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

Features • Quick Start • Documentation • Contributing • License

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
Intelligent Question Generation: Extracts relevant questions from each text segment
Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
Flexible Editing: Edit questions, answers, and datasets at any stage of the process
Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

ed3.mp4

Local Run

Download Client

Windows	MacOS		Linux
Setup.exe	Intel	M	AppImage

Install with NPM

Clone the repository:

   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset

Install dependencies:

   npm install

Start the development server:

   npm run build

   npm run start

Open your browser and visit http://localhost:1717

Build with Local Dockerfile

If you want to build the image yourself, you can use the Dockerfile in the project root:

Clone the repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker image:
```
docker build -t easy-dataset .
```
Run the container:
```
docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset
```
Note: Please replace {YOUR_LOCAL_DB_PATH} with the actual path where you want to store the local database.
Open your browser and visit http://localhost:1717

How to Use

Create a Project

Click the "Create Project" button on the homepage;
Enter a project name and description;
Configure your preferred LLM API settings

Process Documents

Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
View and adjust the automatically split text segments;
View and adjust the global domain tree

Generate Questions

Batch construct questions based on text blocks;
View and edit the generated questions;
Organize questions using the label tree

Create Datasets

Batch construct datasets based on questions;
Generate answers using the configured LLM;
View, edit, and optimize the generated answers

Export Datasets

Click the "Export" button in the Datasets section;
Choose your preferred format (Alpaca or ShareGPT);
Select the file format (JSON or JSONL);
Add custom system prompts as needed;
Export your dataset

Project Structure

easy-dataset/
├── app/                                # Next.js application directory
│   ├── api/                            # API routes
│   │   ├── llm/                        # LLM API integration
│   │   │   ├── ollama/                 # Ollama API integration
│   │   │   └── openai/                 # OpenAI API integration
│   │   ├── projects/                   # Project management API
│   │   │   ├── [projectId]/            # Project-specific operations
│   │   │   │   ├── chunks/             # Text chunk operations
│   │   │   │   ├── datasets/           # Dataset generation and management
│   │   │   │   ├── generate-questions/ # Batch question generation
│   │   │   │   ├── questions/          # Question management
│   │   │   │   └── split/              # Text splitting operations
│   │   │   └── user/                   # User-specific project operations
│   ├── projects/                       # Frontend project pages
│   │   └── [projectId]/                # Project-specific pages
│   │       ├── datasets/               # Dataset management UI
│   │       ├── questions/              # Question management UI
│   │       ├── settings/               # Project settings UI
│   │       └── text-split/             # Text processing UI
│   └── page.js                         # Homepage
├── components/                         # React components
│   ├── datasets/                       # Dataset-related components
│   ├── home/                           # Homepage components
│   ├── projects/                       # Project management components
│   ├── questions/                      # Question management components
│   └── text-split/                     # Text processing components
├── lib/                                # Core libraries and tools
│   ├── db/                             # Database operations
│   ├── i18n/                           # Internationalization
│   ├── llm/                            # LLM integration
│   │   ├── common/                     # Common LLM tools
│   │   ├── core/                       # Core LLM clients
│   │   └── prompts/                    # Prompt templates
│   │       ├── answer.js               # Answer generation prompts (Chinese)
│   │       ├── answerEn.js             # Answer generation prompts (English)
│   │       ├── question.js             # Question generation prompts (Chinese)
│   │       ├── questionEn.js           # Question generation prompts (English)
│   │       └── ... other prompts
│   └── text-splitter/                  # Text splitting tools
├── locales/                            # Internationalization resources
│   ├── en/                             # English translations
│   └── zh-CN/                          # Chinese translations
├── public/                             # Static resources
│   └── imgs/                           # Image resources
└── local-db/                           # Local file database
    └── projects/                       # Project data storage

Documentation

View the demo video of this project: Easy Dataset Demo Video
For detailed documentation on all features and APIs, visit our Documentation Site

Community Practice

Easy Dataset × LLaMA Factory: Enabling LLMs to Efficiently Learn Domain Knowledge

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

Fork the repository
Create a new branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Star History

_{Built with ❤️ by ConardLi • Follow me: WeChat Official Account｜Bilibili｜Juejin｜Zhihu｜Youtube}

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
.github		.github
.husky		.husky
app		app
components		components
constant		constant
electron		electron
hooks		hooks
lib		lib
local-db		local-db
locales		locales
prisma		prisma
public/imgs		public/imgs
styles		styles
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
.npmrc		.npmrc
.prettierrc.js		.prettierrc.js
.windsurfrules		.windsurfrules
ARCHITECTURE.md		ARCHITECTURE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
commitlint.config.mjs		commitlint.config.mjs
jsconfig.json		jsconfig.json
next.config.js		next.config.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Features

Quick Demo

Local Run

Download Client

Install with NPM

Build with Local Dockerfile

How to Use

Create a Project

Process Documents

Generate Questions

Create Datasets

Export Datasets

Project Structure

Documentation

Community Practice

Contributing

Join Discussion Group & Contact the Author

License

Star History

About

Uh oh!

Releases 19

Packages

Uh oh!

Contributors 14

Languages

License

ConardLi/easy-dataset

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Quick Demo

Local Run

Download Client

Install with NPM

Build with Local Dockerfile

How to Use

Create a Project

Process Documents

Generate Questions

Create Datasets

Export Datasets

Project Structure

Documentation

Community Practice

Contributing

Join Discussion Group & Contact the Author

License

Star History

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 19

Packages 0

Uh oh!

Contributors 14

Languages

Packages