Releases: ConardLi/easy-dataset
[1.3.5] 2025-05-21
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e
🔧 修复
- 数据集确认/保存失败
→ 修复因权限校验异常或网络波动导致的数据集保存失败问题,提升操作稳定性。 - 修改文本块后筛选条件失效
→ 解决文本块内容更新后,筛选条件(如标签、状态)未同步刷新的问题。 - 硅基流动默认 API 错误
→ 修正默认配置中硅基流动 API 地址及认证参数,确保模型调用正常。 - 导出自定义格式数据集丢失标签
→ 恢复自定义格式导出时标签字段的正常提取,支持保留完整元数据。
⚡ 优化
- Windows 安装路径自定义
→ 安装程序新增路径选择功能,默认不再强制安装至 C 盘,支持用户指定安装目录。 - Alpaca 数据集导出配置优化
- 字段选择:支持切换问题使用
instruction
或input
字段,适配不同模型训练需求。 - 自定义指令:允许手动输入或修改 instruction 内容,提升数据生成灵活性。
- 字段选择:支持切换问题使用
🔧 Fixes
- Dataset confirmation/saving failures
→ Fixed issues with dataset saving due to permission errors or network fluctuations, improving operational stability. - Filter criteria失效 after text block modification
→ Resolved synchronization issues where filter conditions (e.g., labels, status) failed to update after text block edits. - Default API error for SiliconFlow
→ Corrected the default API endpoint and authentication parameters for SiliconFlow to ensure proper model invocation. - Missing labels in custom-format dataset exports
→ Restored label fields in custom exports to preserve complete metadata during data export.
⚡ Optimizations
- Windows installation path customization
→ Added a path selection feature during installation, allowing users to specify a directory instead of forcing C:\ by default. - Alpaca dataset export configuration
- Field selection: Supported switching between
instruction
andinput
fields for questions, adapting to different model training needs. - Custom instruction: Allowed manual input or modification of instruction content for more flexible data generation.
- Field selection: Supported switching between
[1.3.4] 2025-05-20
如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/ef8d0ef3785a
🔧 修复
- 领域树视图下问题无法展示
→ 修复领域树节点展开后问题列表空白的异常,确保层级结构正常渲染。 - 自定义视觉模型解析失效
→ 恢复自定义视觉模型对 PDF/图片的解析功能,优化模型加载逻辑。 - 多文件文本块排序错乱
→ 解决跨文件文本块混合排序时的顺序混乱问题。 - 新版本升级后数据库同步失败
→ 修复升级过程中本地数据库与后台数据同步异常,确保版本迭代数据完整性。
🔧 Fixes
- Issues not displayed in domain tree view
→ Resolved blank issue lists in expanded domain tree nodes, ensuring proper hierarchical rendering. - Custom visual model parsing failure
→ Restored parsing functionality for custom visual models with optimized model loading logic. - Multi-file text block sorting chaos
→ Fixed disordered sorting of text blocks across multiple files, supporting sorting by file priority or creation time. - Database synchronization failure after version upgrade
→ Addressed data synchronization issues between local databases and backend during upgrades, ensuring data integrity.
[1.3.3] 2025-5-20
🔧 修复
- 修复文本块待生成问题筛选失效的问题
- 修复文本块排序错乱的问题
- 修复上传文档后不等待接口响应直接刷新业务的问题
⚡ 优化
- 文本块查询时剔除包含“distill content”的无效文本块
✨ 新功能:后台异步任务
背景:原前端同步执行批量任务易受浏览器并发限制,导致页面卡顿。
优化:将任务迁移至后台异步处理,提升大规模数据操作效率。
-
支持的异步任务类型
-
交互改进
🔧 Fixes
- Fixed the issue of invalid filtering for "to-be-generated questions" in text blocks.
- Resolved the problem of text block sorting chaos.
- Fixed the issue where the business interface refreshed directly without waiting for the response after document upload.
⚡ Optimizations
- Excluded invalid text blocks containing "distill content" from text block queries.
✨ New Feature: Background Asynchronous Tasks
Context: Frontend synchronous execution of batch tasks was limited by browser concurrency (typically 6-8 connections), causing page freezes.
Improvement: Migrated tasks to background asynchronous processing for large-scale data operations.
-
Supported Asynchronous Tasks
- Automatic Question Extraction: After creating a task, the background automatically processes text blocks without generated questions in batches, with configurable concurrency.
- Automatic Dataset Generation: The background batch-generates answers for questions without answers, freeing up frontend resources.
-
Interaction Enhancements
- Task Status Icon: A real-time progress indicator (e.g., 🔄) in the top-right corner allows clicking to view task details, logs, and exception handling options.
- Resilient Processing: Automatic retries for failed tasks, with manual termination or restart support for complex network scenarios.
[1.3.2] 2025-05-18
✨ 新功能
-
新模块:蒸馏模块
- 无文献蒸馏模式:无需依赖现有文献,直接从大模型中蒸馏生成数据集 ,查看文档:https://docs.easy-dataset.com/jin-jie-shi-yong/images-and-media
- 无文献蒸馏模式:无需依赖现有文献,直接从大模型中蒸馏生成数据集 ,查看文档:https://docs.easy-dataset.com/jin-jie-shi-yong/images-and-media
-
数据集一键上传 Huggingface
- 支持将数据集直接推送至 Huggingface 平台,方便模型训练与共享
⚡ 优化
- 项目管理增强
- 支持删除待升级、升级失败状态的项目
- 新增“打开项目文件夹”功能,快速定位目标项目路径
- 领域树性能优化
- 问题节点改为按需加载,大幅提升领域树视图的查询速度
- 顶部导航栏样式
- 优化布局和视觉设计,提升操作便捷性
- 数据集详情页渲染
- 答案内容支持 Markdown 格式渲染,增强可读性
- 数据存储优化
- 数据集存储时不再包含关联文本块原始内容,节省约大量存储空间
⚡ Optimizations
- Project Management Improvements
- Allowed deletion of projects in "Pending Upgrade" or "Upgrade Failed" states
- Added "Open Project Folder" feature to quickly access project directory
- Domain Tree Performance
- Implemented lazy loading for issue nodes, significantly improving query speed in domain tree view
- Top Navigation Bar Redesign
- Optimized layout and styling for better usability
- Dataset Details Page Rendering
- Enabled Markdown rendering for answers to enhance readability
- Storage Optimization
- Removed redundant storage of associated text block originals in datasets, reducing storage usage by ~90%
✨ New Features
- One-Click Dataset Upload to Huggingface
- Directly push datasets to Huggingface for model training and sharing
- Distillation Module Expansion
- Document-Free Distillation: Generate datasets directly from large language models without relying on existing literature
- User documentation updated in Help Center for setup guidance : https://docs.easy-dataset.com/ed/en/advanced/images-and-media
[1.3.1] 2025-05-14
🔧 修复
- 修复数据集优化过程中意外生成 COT 的问题
- 修复文本处理页上传时已移除文件仍被处理致报错的问题
⚡ 优化
- 将本地文件存储重构为本地数据库存储,大幅优化大量数据下的使用体验
- 随机取出问题中的问号(支持配置)
- 优化多项功能使用体验
✨ 新功能
-
领域树灵活管理模式
-
多种文本分块策略
-
可视化自定义分块
-
客户端工具增强
- 新增本地日志存储,可一键打开日志目录排查问题
- 新增清除缓存功能,支持清理历史日志和数据库备份文件
🔧 Fixes
- Fixed the issue of accidental COT generation during dataset optimization.
- Fixed the error caused by processing removed files during upload on the text processing page.
⚡ Optimizations
- Refactored local file storage to local database storage, significantly improving performance with large datasets.
- Added configurable option to randomly remove question marks from generated questions.
- Enhanced user experience across multiple functions.
✨ New Features
-
Flexible Domain Tree Management
- Three modes for adding/deleting documents:
- Revise Mode: Only update domain tree nodes related to new/deleted documents, minimizing impact on existing structure.
- Rebuild Mode: Regenerate domain tree from all document catalogs (current logic).
- Lock Mode: Freeze domain tree, no updates triggered by document changes.
- Three modes for adding/deleting documents:
-
Multiple Text Chunking Strategies
- Markdown Chunking: Auto-split by document headings to preserve semantic integrity (for structured Markdown).
- Recursive Delimiter Chunking: Try multi-level delimiters recursively (configurable), ideal for complex documents.
- Fixed-Length Delimiter Chunking: Split by specified delimiter (configurable) and combine into fixed-length chunks.
- Token Chunking: Split based on token count (not character count) for model-friendly input.
- Code Intelligence Chunking: Smart splitting by programming language syntax to avoid incomplete code segments.
-
Visual Custom Chunking
- Manual adjustment of chunk boundaries via graphical interface with real-time preview.
-
Client Tool Enhancements
- Local log storage added, with one-click access to log directory for troubleshooting.
- Cache clearing function added to clean historical logs and database backups.
[1.3.0-beta.1] 2025-05-06
本次更新在修复系统问题的基础上,对存储方式进行了重大优化,将本地文件存储重构为本地数据库存储,为提升大量数据下的使用体验带来大幅改进。由于此次改动较大,特发布 beta 版本供大家体验。如果大家在使用本版本过程中遇到任何问题,欢迎通过 Issues 提交反馈,帮助我们进一步完善产品。
🔧 修复
- 修复数据集优化过程中意外生成 COT 的问题
- 修复了文本处理页上传时已移除文件仍被处理致报错的问题
⚡ 优化
- 将本地文件存储重构为本地数据库存储,大幅优化大量数据下的使用体验
- 随机取出问题中的问号(支持配置)
- 优化多项功能使用体验
✨ 新功能
- 客户端新增本地日志存储,可打开日志目录排查问题
- 客户端新增清除缓存功能,可清理历史日志文件和备份的数据库文件
🔧 Fixes
- Fixed the problem of accidentally generating COT during dataset optimization.
- Fixed the error caused by processing removed files during upload on the text processing page.
⚡ Optimizations
- Refactored local file storage to local database storage, significantly improving the user experience with large amounts of data.
- Randomly remove question marks from questions (configurable).
- Optimized the user experience of multiple functions.
✨ New Features
- The client has added local log storage, allowing you to open the log directory to troubleshoot problems.
- The client has added a cache clearing function, which can clean up historical log files and backed - up database files.
[1.2.5] - 2025-04-13
🔧 修复
- 修复第一次配置模型报错的问题
- 修复 Docker 打包镜像报错的问题
🔧 Fixes
- Fixed the issue of errors occurring during the first model configuration.
- Fixed the issue of errors when packaging Docker images.
[1.2.4] - 2025-04-12
⚡ 优化
- 使用 OPEN AI SDK 对模型交互接口进行重构,提升兼容性
✨ 新功能
- 支持视觉模型配置
- 支持使用自定义视觉模型解析 PDF,准确率更高
- 模型测试支持发送图片,对视觉模型进行测试
- 数据集详情页支持查看所属文本块
- 支持用户自己编辑文本块
- 支持下载和预览查看解析好的 Markdown 文件
⚡ Optimizations
- Refactored the model interaction interface using the OPEN AI SDK to improve compatibility.
✨ New Features
- Support for visual model configuration.
- Support for using custom visual models to parse PDFs with higher accuracy.
- Model testing now supports sending images to test visual models.
- The dataset details page supports viewing the associated text blocks.
- Users can edit text blocks by themselves.
- Support for downloading and previewing parsed Markdown files.
[1.2.3] - 2025-03-30
⚡ 优化
- 增强模型默认最大输出 Token 限制
- 去除更新失败弹窗
- 去除部分干扰错误日志输出
✨ 新功能
- 支持一键打开客户端数据目录
- 支持模型温度、最大生成 Token 数量配置
- 支持两种 PDF 文件解析(基础解析、MinerU 解析)
- 支持数据集导出 CSV 格式
⚡ Optimizations
- Enhanced the default maximum output Token limit of the model.
- Removed the pop - up window for update failure.
- Removed some interfering error log outputs.
✨ New Features
- Supported one - click opening of the client data directory.
- Supported configuration of model temperature and maximum generated Token count.
- Supported multiple PDF file parsing methods (basic parsing, MinerU parsing).
- Supported exporting datasets in CSV format.
[1.2.2] 2025-03-24
🔧 修复
- 修复领域树视图下无法选中问题、删除问题失败的 Bug
- 修复升级新版本链接可能不准确的问题
⚡ 优化
- 去除答案和思维链中多余的换行符
- 去除更新失败弹窗、更新下载最新安装包地址
✨ 新功能
- 文献管理支持已生成、未生成问题的筛选
🔧 Fixes
- Fixed the bugs that problems could not be selected and deletion failed in the domain tree view.
- Fixed the issue that the link to upgrade to the new version might be inaccurate.
⚡ Optimizations
- Removed redundant line breaks in answers and thought chains.
- Removed the pop - up window for update failure and updated the download address of the latest installation package.
✨ New Features
- Literature management now supports filtering of generated and ungenerated questions.