kraken 5.3.0 segmentation `Aborted (core dumped)` #693

johnlockejrr · 2025-03-30T15:35:25Z

Now, this never happened to me. On two different systems.

(kraken-5.3.0) incognito@DESKTOP-FVRLETC:~/kraken-train/catmus-medieval$ ketos segtrain -d cuda:0 -f alto -t output.txt -q early --resize both --schedule reduceonplateau -i /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/blla.mlmodel -o catmus_seg
/catmus_med_seg_v1
Training line types:
  HeadingLine   2       2081
  MusicLine     3       167
  DefaultLine   4       95194
  DropCapitalLine       5       1280
  InterlinearLine       13      2835
  TironianSignLine      14      282
  default       20      39
Training region types:
  MainZone      6       1523
  NumberingZone 7       613
  GraphicZone   8       134
  DropCapitalZone       9       689
  MusicZone     10      16
  MarginTextZone        11      364
  RunningTitleZone      12      398
  TitlePageZone 15      5
  QuireMarksZone        16      94
  DigitizationArtefactZone      17      28
  StampZone     18      39
  DamageZone    19      13
  text  21      39
  SealZone      22      3
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[03/30/25 17:31:47] WARNING  Setting baseline location to baseline from unset model.                                                                                                                                                                                       train.py:1032
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃ Mode  ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │ train │             [1, 3, 1800, 300] │  [[1, 23, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  9.5 K │ train │      [[1, 3, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ train │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ train │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ train │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │ train │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │ train │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.l_16          │ ActConv2D                │  1.5 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 23, 450, 75], '?'] │
│ 18 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │ train │                             ? │                        ? │
│ 19 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │ train │                             ? │                        ? │
│ 20 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │ train │                             ? │                        ? │
│ 21 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │ train │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
Modules in train mode: 39
Modules in eval mode: 0
stage 0/∞ ━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/1374 0:00:32 • 0:18:51 1.18it/s  early_stopping: 0/10 -inf
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/incognito/miniconda3/envs/kraken-5.3.0/bin/ketos:8 in <module>                             │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1161 in  │
│ __call__                                                                                         │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1082 in  │
│ main                                                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1697 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1443 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:788 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/decorators.py:33 │
│ in new_func                                                                                      │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/ketos/segmentat │
│ ion.py:366 in segtrain                                                                           │
│                                                                                                  │
│   363 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   364 │                                                                                          │
│   365 │   with threadpool_limits(limits=threads):                                                │
│ ❱ 366 │   │   trainer.fit(model)                                                                 │
│   367 │                                                                                          │
│   368 │   if model.best_epoch == -1:                                                             │
│   369 │   │   logger.warning('Model did not improve during training.')                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/lib/train.py:12 │
│ 9 in fit                                                                                         │
│                                                                                                  │
│    126 │   │   with warnings.catch_warnings():                                                   │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')                            │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                                                  │
│    130                                                                                           │
│    131                                                                                           │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:538 in fit                                                                        │
│                                                                                                  │
│    535 │   │   self.state.fn = TrainerFn.FITTING                                                 │
│    536 │   │   self.state.status = TrainerStatus.RUNNING                                         │
│    537 │   │   self.training = True                                                              │
│ ❱  538 │   │   call._call_and_handle_interrupt(                                                  │
│    539 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    540 │   │   )                                                                                 │
│    541                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/call.py:47 in _call_and_handle_interrupt                                                     │
│                                                                                                  │
│    44 │   try:                                                                                   │
│    45 │   │   if trainer.strategy.launcher is not None:                                          │
│    46 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,    │
│ ❱  47 │   │   return trainer_fn(*args, **kwargs)                                                 │
│    48 │                                                                                          │
│    49 │   except _TunerExitException:                                                            │
│    50 │   │   _call_teardown_hook(trainer)                                                       │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:574 in _fit_impl                                                                  │
│                                                                                                  │
│    571 │   │   │   model_provided=True,                                                          │
│    572 │   │   │   model_connected=self.lightning_module is not None,                            │
│    573 │   │   )                                                                                 │
│ ❱  574 │   │   self._run(model, ckpt_path=ckpt_path)                                             │
│    575 │   │                                                                                     │
│    576 │   │   assert self.state.stopped                                                         │
│    577 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:981 in _run                                                                       │
│                                                                                                  │
│    978 │   │   # ----------------------------                                                    │
│    979 │   │   # RUN THE TRAINER                                                                 │
│    980 │   │   # ----------------------------                                                    │
│ ❱  981 │   │   results = self._run_stage()                                                       │
│    982 │   │                                                                                     │
│    983 │   │   # ----------------------------                                                    │
│    984 │   │   # POST-Training CLEAN UP                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:1025 in _run_stage                                                                │
│                                                                                                  │
│   1022 │   │   │   with isolate_rng():                                                           │
│   1023 │   │   │   │   self._run_sanity_check()                                                  │
│   1024 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                 │
│ ❱ 1025 │   │   │   │   self.fit_loop.run()                                                       │
│   1026 │   │   │   return None                                                                   │
│   1027 │   │   raise RuntimeError(f"Unexpected state {self.state}")                              │
│   1028                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:205 in run                                                                         │
│                                                                                                  │
│   202 │   │   while not self.done:                                                               │
│   203 │   │   │   try:                                                                           │
│   204 │   │   │   │   self.on_advance_start()                                                    │
│ ❱ 205 │   │   │   │   self.advance()                                                             │
│   206 │   │   │   │   self.on_advance_end()                                                      │
│   207 │   │   │   │   self._restarting = False                                                   │
│   208 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:363 in advance                                                                     │
│                                                                                                  │
│   360 │   │   │   )                                                                              │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):                          │
│   362 │   │   │   assert self._data_fetcher is not None                                          │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                                        │
│   364 │                                                                                          │
│   365 │   def on_advance_end(self) -> None:                                                      │
│   366 │   │   trainer = self.trainer                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:140 in run                                                              │
│                                                                                                  │
│   137 │   │   self.on_run_start(data_fetcher)                                                    │
│   138 │   │   while not self.done:                                                               │
│   139 │   │   │   try:                                                                           │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                                                 │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                                          │
│   142 │   │   │   │   self._restarting = False                                                   │
│   143 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:212 in advance                                                          │
│                                                                                                  │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                                            │
│   210 │   │   else:                                                                              │
│   211 │   │   │   dataloader_iter = None                                                         │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                                              │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by the fetcher, however   │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after restarting              │
│   215 │   │   │   batch_idx = self.batch_idx + 1                                                 │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:133 in __next__                                                                    │
│                                                                                                  │
│   130 │   │   │   │   self.done = not self.batches                                               │
│   131 │   │   elif not self.done:                                                                │
│   132 │   │   │   # this will run only when no pre-fetching was done.                            │
│ ❱ 133 │   │   │   batch = super().__next__()                                                     │
│   134 │   │   else:                                                                              │
│   135 │   │   │   # the iterator is empty                                                        │
│   136 │   │   │   raise StopIteration                                                            │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:60 in __next__                                                                     │
│                                                                                                  │
│    57 │   │   assert self.iterator is not None                                                   │
│    58 │   │   self._start_profiler()                                                             │
│    59 │   │   try:                                                                               │
│ ❱  60 │   │   │   batch = next(self.iterator)                                                    │
│    61 │   │   except StopIteration:                                                              │
│    62 │   │   │   self.done = True                                                               │
│    63 │   │   │   raise                                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:341 in __next__                                                         │
│                                                                                                  │
│   338 │                                                                                          │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                                                │
│   340 │   │   assert self._iterator is not None                                                  │
│ ❱ 341 │   │   out = next(self._iterator)                                                         │
│   342 │   │   if isinstance(self._iterator, _Sequential):                                        │
│   343 │   │   │   return out                                                                     │
│   344 │   │   out, batch_idx, dataloader_idx = out                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:78 in __next__                                                          │
│                                                                                                  │
│    75 │   │   out = [None] * n  # values per iterator                                            │
│    76 │   │   for i in range(n):                                                                 │
│    77 │   │   │   try:                                                                           │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                                           │
│    79 │   │   │   except StopIteration:                                                          │
│    80 │   │   │   │   self._consumed[i] = True                                                   │
│    81 │   │   │   │   if all(self._consumed):                                                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:630 in __next__                                                                         │
│                                                                                                  │
│    627 │   │   │   if self._sampler_iter is None:                                                │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  630 │   │   │   data = self._next_data()                                                      │
│    631 │   │   │   self._num_yielded += 1                                                        │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1344 in _next_data                                                                      │
│                                                                                                  │
│   1341 │   │   │   │   self._task_info[idx] += (data,)                                           │
│   1342 │   │   │   else:                                                                         │
│   1343 │   │   │   │   del self._task_info[idx]                                                  │
│ ❱ 1344 │   │   │   │   return self._process_data(data)                                           │
│   1345 │                                                                                         │
│   1346 │   def _try_put_index(self):                                                             │
│   1347 │   │   assert self._tasks_outstanding < self._prefetch_factor * self._num_workers        │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1370 in _process_data                                                                   │
│                                                                                                  │
│   1367 │   │   self._rcvd_idx += 1                                                               │
│   1368 │   │   self._try_put_index()                                                             │
│   1369 │   │   if isinstance(data, ExceptionWrapper):                                            │
│ ❱ 1370 │   │   │   data.reraise()                                                                │
│   1371 │   │   return data                                                                       │
│   1372 │                                                                                         │
│   1373 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False):                     │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/_utils.py:706 in │
│ reraise                                                                                          │
│                                                                                                  │
│    703 │   │   │   # If the exception takes multiple arguments, don't try to                     │
│    704 │   │   │   # instantiate since we don't know how to                                      │
│    705 │   │   │   raise RuntimeError(msg) from None                                             │
│ ❱  706 │   │   raise exception                                                                   │
│    707                                                                                           │
│    708                                                                                           │
│    709 def _get_available_device_type():                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 38, in do_one_step
    data = pin_memory(data, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 69, in pin_memory
    clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 69, in <dictcomp>
    clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 59, in pin_memory
    return data.pin_memory(device)
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Aborted (core dumped)

This one on a Geforce GTX 3090 FOunders Edition 24Gb. I was watching the GPU with nvitop and never got 100% usage.

(nvitop) incognito@DESKTOP-FVRLETC:~$ nvidia-smi
Sun Mar 30 17:36:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   27C    P8              8W /  350W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python modules:

(kraken-5.3.0) incognito@DESKTOP-FVRLETC:~/kraken-train/catmus-medieval$ pip list
Package                   Version
------------------------- -----------
aiohappyeyeballs          2.4.6
aiohttp                   3.11.13
aiosignal                 1.3.2
attrs                     25.1.0
cattrs                    24.1.2
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
coremltools               8.2
filelock                  3.17.0
frozenlist                1.5.0
fsspec                    2025.2.0
idna                      3.10
imageio                   2.37.0
importlib_resources       6.5.2
Jinja2                    3.1.5
joblib                    1.4.2
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
kraken                    5.3.0
lazy_loader               0.4
lightning                 2.4.0
lightning-utilities       0.12.0
lxml                      5.3.1
markdown-it-py            3.0.0
MarkupSafe                3.0.2
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.1.0
networkx                  3.4.2
numpy                     2.0.2
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.8.61
nvidia-nvtx-cu12          12.1.105
packaging                 24.2
pillow                    11.1.0
pip                       25.0
propcache                 0.3.0
protobuf                  5.29.3
pyaml                     25.1.0
pyarrow                   19.0.1
pycparser                 2.22
Pygments                  2.19.1
python-bidi               0.6.6
pytorch-lightning         2.5.0.post0
pyvips                    2.2.3
PyYAML                    6.0.2
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
rpds-py                   0.23.1
scikit-image              0.24.0
scikit-learn              1.5.2
scipy                     1.13.1
setuptools                75.8.0
shapely                   2.0.7
sympy                     1.13.3
threadpoolctl             3.5.0
tifffile                  2025.2.18
torch                     2.4.1
torchmetrics              1.6.1
torchvision               0.19.1
tqdm                      4.67.1
triton                    3.0.0
typing_extensions         4.12.2
urllib3                   2.3.0
wheel                     0.45.1
yarl                      1.18.3

The dataset is CATMuS/medieval-segmentation

On the same GPU and dataset I trained all yolo11 segmentation models (from nano to x large) without any issue.

The text was updated successfully, but these errors were encountered:

johnlockejrr · 2025-03-31T17:26:02Z

Exactly the same dataset on exactly the same GPU ran on PyLaia training, no crash:

(pylaia-py3.10) incognito@DESKTOP-H1BS9PO:~/pylaia-latest-train/Catmus_medieval$ pylaia-htr-train-ctc --config config_train_model.yaml
Global seed set to 74565
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type     | Params
---------------------------------------
0 | model     | LaiaCRNN | 5.4 M
1 | criterion | CTCLoss  | 0
---------------------------------------
5.4 M     Trainable params
0         Non-trainable params
5.4 M     Total params
21.678    Total estimated model params size (MB)
Global seed set to 74565
TR - E0:  10%|████████████████████▋                                                                                                                                                                                              | 1875/19102 [04:06<37:40,  7.62it/s, running_loss=107]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kraken 5.3.0 segmentation `Aborted (core dumped)` #693

kraken 5.3.0 segmentation `Aborted (core dumped)` #693

johnlockejrr commented Mar 30, 2025 •

edited

Loading

johnlockejrr commented Mar 31, 2025 •

edited

Loading

Uh oh!

kraken 5.3.0 segmentation Aborted (core dumped) #693

kraken 5.3.0 segmentation Aborted (core dumped) #693

Comments

johnlockejrr commented Mar 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

johnlockejrr commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kraken 5.3.0 segmentation `Aborted (core dumped)` #693

kraken 5.3.0 segmentation `Aborted (core dumped)` #693

johnlockejrr commented Mar 30, 2025 •

edited

Loading

johnlockejrr commented Mar 31, 2025 •

edited

Loading