Skip to content

Releases: ConardLi/easy-dataset

[1.3.5] 2025-05-21

21 May 15:11
55a7ec9
Compare
Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/194b7eedf16e

🔧 修复

  1. 数据集确认/保存失败
    → 修复因权限校验异常或网络波动导致的数据集保存失败问题,提升操作稳定性。
  2. 修改文本块后筛选条件失效
    → 解决文本块内容更新后,筛选条件(如标签、状态)未同步刷新的问题。
  3. 硅基流动默认 API 错误
    → 修正默认配置中硅基流动 API 地址及认证参数,确保模型调用正常。
  4. 导出自定义格式数据集丢失标签
    → 恢复自定义格式导出时标签字段的正常提取,支持保留完整元数据。

⚡ 优化

  1. Windows 安装路径自定义
    → 安装程序新增路径选择功能,默认不再强制安装至 C 盘,支持用户指定安装目录。
  2. Alpaca 数据集导出配置优化
    • 字段选择:支持切换问题使用 instructioninput 字段,适配不同模型训练需求。
    • 自定义指令:允许手动输入或修改 instruction 内容,提升数据生成灵活性。

🔧 Fixes

  1. Dataset confirmation/saving failures
    → Fixed issues with dataset saving due to permission errors or network fluctuations, improving operational stability.
  2. Filter criteria失效 after text block modification
    → Resolved synchronization issues where filter conditions (e.g., labels, status) failed to update after text block edits.
  3. Default API error for SiliconFlow
    → Corrected the default API endpoint and authentication parameters for SiliconFlow to ensure proper model invocation.
  4. Missing labels in custom-format dataset exports
    → Restored label fields in custom exports to preserve complete metadata during data export.

⚡ Optimizations

  1. Windows installation path customization
    → Added a path selection feature during installation, allowing users to specify a directory instead of forcing C:\ by default.
  2. Alpaca dataset export configuration
    • Field selection: Supported switching between instruction and input fields for questions, adapting to different model training needs.
    • Custom instruction: Allowed manual input or modification of instruction content for more flexible data generation.

[1.3.4] 2025-05-20

20 May 15:34
bd5b93c
Compare
Choose a tag to compare

如果遇到 Github 下载慢的问题可以使用网盘下载:https://pan.quark.cn/s/ef8d0ef3785a

🔧 修复

  1. 领域树视图下问题无法展示
    → 修复领域树节点展开后问题列表空白的异常,确保层级结构正常渲染。
  2. 自定义视觉模型解析失效
    → 恢复自定义视觉模型对 PDF/图片的解析功能,优化模型加载逻辑。
  3. 多文件文本块排序错乱
    → 解决跨文件文本块混合排序时的顺序混乱问题。
  4. 新版本升级后数据库同步失败
    → 修复升级过程中本地数据库与后台数据同步异常,确保版本迭代数据完整性。

🔧 Fixes

  1. Issues not displayed in domain tree view
    → Resolved blank issue lists in expanded domain tree nodes, ensuring proper hierarchical rendering.
  2. Custom visual model parsing failure
    → Restored parsing functionality for custom visual models with optimized model loading logic.
  3. Multi-file text block sorting chaos
    → Fixed disordered sorting of text blocks across multiple files, supporting sorting by file priority or creation time.
  4. Database synchronization failure after version upgrade
    → Addressed data synchronization issues between local databases and backend during upgrades, ensuring data integrity.

[1.3.3] 2025-5-20

19 May 16:03
7c75bb9
Compare
Choose a tag to compare

🔧 修复

  1. 修复文本块待生成问题筛选失效的问题
  2. 修复文本块排序错乱的问题
  3. 修复上传文档后不等待接口响应直接刷新业务的问题

⚡ 优化

  1. 文本块查询时剔除包含“distill content”的无效文本块

✨ 新功能:后台异步任务

背景:原前端同步执行批量任务易受浏览器并发限制,导致页面卡顿。
优化:将任务迁移至后台异步处理,提升大规模数据操作效率。

  1. 支持的异步任务类型

    • 自动提取问题:创建任务后,后台自动批量处理未生成问题的文本块,支持配置并发量。
      image

    • 自动生成数据集:后台自动为未生成答案的问题批量生成答案,释放前端资源。
      image

  2. 交互改进

    • 任务状态图标:右上角显示实时进度,点击查看任务详情、日志及异常处理选项。
      image

🔧 Fixes

  1. Fixed the issue of invalid filtering for "to-be-generated questions" in text blocks.
  2. Resolved the problem of text block sorting chaos.
  3. Fixed the issue where the business interface refreshed directly without waiting for the response after document upload.

⚡ Optimizations

  1. Excluded invalid text blocks containing "distill content" from text block queries.

✨ New Feature: Background Asynchronous Tasks

Context: Frontend synchronous execution of batch tasks was limited by browser concurrency (typically 6-8 connections), causing page freezes.
Improvement: Migrated tasks to background asynchronous processing for large-scale data operations.

  1. Supported Asynchronous Tasks

    • Automatic Question Extraction: After creating a task, the background automatically processes text blocks without generated questions in batches, with configurable concurrency.
    • Automatic Dataset Generation: The background batch-generates answers for questions without answers, freeing up frontend resources.
  2. Interaction Enhancements

    • Task Status Icon: A real-time progress indicator (e.g., 🔄) in the top-right corner allows clicking to view task details, logs, and exception handling options.
    • Resilient Processing: Automatic retries for failed tasks, with manual termination or restart support for complex network scenarios.

[1.3.2] 2025-05-18

18 May 14:09
7ae363d
Compare
Choose a tag to compare

✨ 新功能

  1. 新模块:蒸馏模块

  2. 数据集一键上传 Huggingface

    • 支持将数据集直接推送至 Huggingface 平台,方便模型训练与共享

⚡ 优化

  1. 项目管理增强
    • 支持删除待升级、升级失败状态的项目
    • 新增“打开项目文件夹”功能,快速定位目标项目路径
  2. 领域树性能优化
    • 问题节点改为按需加载,大幅提升领域树视图的查询速度
  3. 顶部导航栏样式
    • 优化布局和视觉设计,提升操作便捷性
  4. 数据集详情页渲染
    • 答案内容支持 Markdown 格式渲染,增强可读性
  5. 数据存储优化
    • 数据集存储时不再包含关联文本块原始内容,节省约大量存储空间

⚡ Optimizations

  1. Project Management Improvements
    • Allowed deletion of projects in "Pending Upgrade" or "Upgrade Failed" states
    • Added "Open Project Folder" feature to quickly access project directory
  2. Domain Tree Performance
    • Implemented lazy loading for issue nodes, significantly improving query speed in domain tree view
  3. Top Navigation Bar Redesign
    • Optimized layout and styling for better usability
  4. Dataset Details Page Rendering
    • Enabled Markdown rendering for answers to enhance readability
  5. Storage Optimization
    • Removed redundant storage of associated text block originals in datasets, reducing storage usage by ~90%

✨ New Features

  1. One-Click Dataset Upload to Huggingface
    • Directly push datasets to Huggingface for model training and sharing
  2. Distillation Module Expansion

[1.3.1] 2025-05-14

14 May 15:03
d0a758e
Compare
Choose a tag to compare

🔧 修复

  1. 修复数据集优化过程中意外生成 COT 的问题
  2. 修复文本处理页上传时已移除文件仍被处理致报错的问题

⚡ 优化

  1. 将本地文件存储重构为本地数据库存储,大幅优化大量数据下的使用体验
  2. 随机取出问题中的问号(支持配置)
  3. 优化多项功能使用体验

✨ 新功能

  1. 领域树灵活管理模式

    • 新增/删除文献时支持三种模式:
      • 修订模式:仅修正新增/删除文献相关的领域树节点,最小化影响现有结构
      • 完全重建模式:基于所有文献目录重新生成领域树(现有逻辑)
      • 锁定模式:固定当前领域树,新增/删除文献不触发更新
        image
  2. 多种文本分块策略

    • Markdown分块:根据文档标题自动分割,保持语义完整性(适用于结构化Markdown)
    • 自定义分割符递归分块:按优先级递归尝试多级分隔符(可配置),适合复杂文档
    • 自定义分割符固定长度分块:按指定分隔符切分后组合为固定长度(可配置)
    • Token分块:基于Token数量分块(非字符数),适配模型输入要求
    • 程序代码智能分块:根据编程语言语法结构智能分割,避免语法断裂
      image
  3. 可视化自定义分块

    • 支持通过图形界面手动调整分块边界,实时预览分块效果
      image
  4. 客户端工具增强

    • 新增本地日志存储,可一键打开日志目录排查问题
    • 新增清除缓存功能,支持清理历史日志和数据库备份文件

🔧 Fixes

  1. Fixed the issue of accidental COT generation during dataset optimization.
  2. Fixed the error caused by processing removed files during upload on the text processing page.

⚡ Optimizations

  1. Refactored local file storage to local database storage, significantly improving performance with large datasets.
  2. Added configurable option to randomly remove question marks from generated questions.
  3. Enhanced user experience across multiple functions.

✨ New Features

  1. Flexible Domain Tree Management

    • Three modes for adding/deleting documents:
      • Revise Mode: Only update domain tree nodes related to new/deleted documents, minimizing impact on existing structure.
      • Rebuild Mode: Regenerate domain tree from all document catalogs (current logic).
      • Lock Mode: Freeze domain tree, no updates triggered by document changes.
  2. Multiple Text Chunking Strategies

    • Markdown Chunking: Auto-split by document headings to preserve semantic integrity (for structured Markdown).
    • Recursive Delimiter Chunking: Try multi-level delimiters recursively (configurable), ideal for complex documents.
    • Fixed-Length Delimiter Chunking: Split by specified delimiter (configurable) and combine into fixed-length chunks.
    • Token Chunking: Split based on token count (not character count) for model-friendly input.
    • Code Intelligence Chunking: Smart splitting by programming language syntax to avoid incomplete code segments.
  3. Visual Custom Chunking

    • Manual adjustment of chunk boundaries via graphical interface with real-time preview.
  4. Client Tool Enhancements

    • Local log storage added, with one-click access to log directory for troubleshooting.
    • Cache clearing function added to clean historical logs and database backups.

[1.3.0-beta.1] 2025-05-06

06 May 15:03
1758189
Compare
Choose a tag to compare
Pre-release

本次更新在修复系统问题的基础上,对存储方式进行了重大优化,将本地文件存储重构为本地数据库存储,为提升大量数据下的使用体验带来大幅改进。由于此次改动较大,特发布 beta 版本供大家体验。如果大家在使用本版本过程中遇到任何问题,欢迎通过 Issues 提交反馈,帮助我们进一步完善产品。

🔧 修复

  1. 修复数据集优化过程中意外生成 COT 的问题
  2. 修复了文本处理页上传时已移除文件仍被处理致报错的问题

⚡ 优化

  1. 将本地文件存储重构为本地数据库存储,大幅优化大量数据下的使用体验
  2. 随机取出问题中的问号(支持配置)
  3. 优化多项功能使用体验

✨ 新功能

  1. 客户端新增本地日志存储,可打开日志目录排查问题
  2. 客户端新增清除缓存功能,可清理历史日志文件和备份的数据库文件

🔧 Fixes

  1. Fixed the problem of accidentally generating COT during dataset optimization.
  2. Fixed the error caused by processing removed files during upload on the text processing page.

⚡ Optimizations

  1. Refactored local file storage to local database storage, significantly improving the user experience with large amounts of data.
  2. Randomly remove question marks from questions (configurable).
  3. Optimized the user experience of multiple functions.

✨ New Features

  1. The client has added local log storage, allowing you to open the log directory to troubleshoot problems.
  2. The client has added a cache clearing function, which can clean up historical log files and backed - up database files.

[1.2.5] - 2025-04-13

13 Apr 07:31
1758189
Compare
Choose a tag to compare

🔧 修复

  1. 修复第一次配置模型报错的问题
  2. 修复 Docker 打包镜像报错的问题

🔧 Fixes

  1. Fixed the issue of errors occurring during the first model configuration.
  2. Fixed the issue of errors when packaging Docker images.

[1.2.4] - 2025-04-12

12 Apr 14:18
07404fc
Compare
Choose a tag to compare

⚡ 优化

  1. 使用 OPEN AI SDK 对模型交互接口进行重构,提升兼容性

✨ 新功能

  1. 支持视觉模型配置
  2. 支持使用自定义视觉模型解析 PDF,准确率更高
  3. 模型测试支持发送图片,对视觉模型进行测试
  4. 数据集详情页支持查看所属文本块
  5. 支持用户自己编辑文本块
  6. 支持下载和预览查看解析好的 Markdown 文件

⚡ Optimizations

  1. Refactored the model interaction interface using the OPEN AI SDK to improve compatibility.

✨ New Features

  1. Support for visual model configuration.
  2. Support for using custom visual models to parse PDFs with higher accuracy.
  3. Model testing now supports sending images to test visual models.
  4. The dataset details page supports viewing the associated text blocks.
  5. Users can edit text blocks by themselves.
  6. Support for downloading and previewing parsed Markdown files.

[1.2.3] - 2025-03-30

30 Mar 15:07
13470d7
Compare
Choose a tag to compare

⚡ 优化

  1. 增强模型默认最大输出 Token 限制
  2. 去除更新失败弹窗
  3. 去除部分干扰错误日志输出

✨ 新功能

  1. 支持一键打开客户端数据目录
  2. 支持模型温度、最大生成 Token 数量配置
  3. 支持两种 PDF 文件解析(基础解析、MinerU 解析)
  4. 支持数据集导出 CSV 格式

⚡ Optimizations

  1. Enhanced the default maximum output Token limit of the model.
  2. Removed the pop - up window for update failure.
  3. Removed some interfering error log outputs.

✨ New Features

  1. Supported one - click opening of the client data directory.
  2. Supported configuration of model temperature and maximum generated Token count.
  3. Supported multiple PDF file parsing methods (basic parsing, MinerU parsing).
  4. Supported exporting datasets in CSV format.

[1.2.2] 2025-03-24

24 Mar 15:54
09163ec
Compare
Choose a tag to compare

🔧 修复

  1. 修复领域树视图下无法选中问题、删除问题失败的 Bug
  2. 修复升级新版本链接可能不准确的问题

⚡ 优化

  1. 去除答案和思维链中多余的换行符
  2. 去除更新失败弹窗、更新下载最新安装包地址

✨ 新功能

  1. 文献管理支持已生成、未生成问题的筛选

🔧 Fixes

  1. Fixed the bugs that problems could not be selected and deletion failed in the domain tree view.
  2. Fixed the issue that the link to upgrade to the new version might be inaccurate.

⚡ Optimizations

  1. Removed redundant line breaks in answers and thought chains.
  2. Removed the pop - up window for update failure and updated the download address of the latest installation package.

✨ New Features

  1. Literature management now supports filtering of generated and ungenerated questions.