Skip to content

kraken 5.3.0 segmentation Aborted (core dumped) #693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
johnlockejrr opened this issue Mar 30, 2025 · 1 comment
Open

kraken 5.3.0 segmentation Aborted (core dumped) #693

johnlockejrr opened this issue Mar 30, 2025 · 1 comment

Comments

@johnlockejrr
Copy link

johnlockejrr commented Mar 30, 2025

Now, this never happened to me. On two different systems.

(kraken-5.3.0) incognito@DESKTOP-FVRLETC:~/kraken-train/catmus-medieval$ ketos segtrain -d cuda:0 -f alto -t output.txt -q early --resize both --schedule reduceonplateau -i /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/blla.mlmodel -o catmus_seg
/catmus_med_seg_v1
Training line types:
  HeadingLine   2       2081
  MusicLine     3       167
  DefaultLine   4       95194
  DropCapitalLine       5       1280
  InterlinearLine       13      2835
  TironianSignLine      14      282
  default       20      39
Training region types:
  MainZone      6       1523
  NumberingZone 7       613
  GraphicZone   8       134
  DropCapitalZone       9       689
  MusicZone     10      16
  MarginTextZone        11      364
  RunningTitleZone      12      398
  TitlePageZone 15      5
  QuireMarksZone        16      94
  DigitizationArtefactZone      17      28
  StampZone     18      39
  DamageZone    19      13
  text  21      39
  SealZone      22      3
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
[03/30/25 17:31:47] WARNING  Setting baseline location to baseline from unset model.                                                                                                                                                                                       train.py:1032
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
┏━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    ┃ Name              ┃ Type                     ┃ Params ┃ Mode  ┃                      In sizes ┃                Out sizes ┃
┡━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 0  │ net               │ MultiParamSequential     │  1.3 M │ train │             [1, 3, 1800, 300] │  [[1, 23, 450, 75], '?'] │
│ 1  │ net.C_0           │ ActConv2D                │  9.5 K │ train │      [[1, 3, 1800, 300], '?'] │ [[1, 64, 900, 150], '?'] │
│ 2  │ net.Gn_1          │ GroupNorm                │    128 │ train │ [[1, 64, 900, 150], '?', '?'] │ [[1, 64, 900, 150], '?'] │
│ 3  │ net.C_2           │ ActConv2D                │ 73.9 K │ train │ [[1, 64, 900, 150], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 4  │ net.Gn_3          │ GroupNorm                │    256 │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 5  │ net.C_4           │ ActConv2D                │  147 K │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 6  │ net.Gn_5          │ GroupNorm                │    256 │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 128, 450, 75], '?'] │
│ 7  │ net.C_6           │ ActConv2D                │  295 K │ train │ [[1, 128, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 8  │ net.Gn_7          │ GroupNorm                │    512 │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 9  │ net.C_8           │ ActConv2D                │  590 K │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 10 │ net.Gn_9          │ GroupNorm                │    512 │ train │ [[1, 256, 450, 75], '?', '?'] │ [[1, 256, 450, 75], '?'] │
│ 11 │ net.L_10          │ TransposedSummarizingRNN │ 74.2 K │ train │ [[1, 256, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 12 │ net.L_11          │ TransposedSummarizingRNN │ 25.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 13 │ net.C_12          │ ActConv2D                │  2.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 14 │ net.Gn_13         │ GroupNorm                │     64 │ train │  [[1, 32, 450, 75], '?', '?'] │  [[1, 32, 450, 75], '?'] │
│ 15 │ net.L_14          │ TransposedSummarizingRNN │ 16.9 K │ train │  [[1, 32, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 16 │ net.L_15          │ TransposedSummarizingRNN │ 25.1 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 64, 450, 75], '?'] │
│ 17 │ net.l_16          │ ActConv2D                │  1.5 K │ train │  [[1, 64, 450, 75], '?', '?'] │  [[1, 23, 450, 75], '?'] │
│ 18 │ val_px_accuracy   │ MultilabelAccuracy       │      0 │ train │                             ? │                        ? │
│ 19 │ val_mean_accuracy │ MultilabelAccuracy       │      0 │ train │                             ? │                        ? │
│ 20 │ val_mean_iu       │ MultilabelJaccardIndex   │      0 │ train │                             ? │                        ? │
│ 21 │ val_freq_iu       │ MultilabelJaccardIndex   │      0 │ train │                             ? │                        ? │
└────┴───────────────────┴──────────────────────────┴────────┴───────┴───────────────────────────────┴──────────────────────────┘
Trainable params: 1.3 M
Non-trainable params: 0
Total params: 1.3 M
Total estimated model params size (MB): 5
Modules in train mode: 39
Modules in eval mode: 0
stage 0/∞ ━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35/1374 0:00:32 • 0:18:51 1.18it/s  early_stopping: 0/10 -inf
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/incognito/miniconda3/envs/kraken-5.3.0/bin/ketos:8 in <module>                             │
│                                                                                                  │
│   5 from kraken.ketos import cli                                                                 │
│   6 if __name__ == '__main__':                                                                   │
│   7 │   sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])                         │
│ ❱ 8 │   sys.exit(cli())                                                                          │
│   9                                                                                              │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1161 in  │
│ __call__                                                                                         │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1082 in  │
│ main                                                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1697 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:1443 in  │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/core.py:788 in   │
│ invoke                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/click/decorators.py:33 │
│ in new_func                                                                                      │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/ketos/segmentat │
│ ion.py:366 in segtrain                                                                           │
│                                                                                                  │
│   363 │   │   │   │   │   │   │   **val_check_interval)                                          │
│   364 │                                                                                          │
│   365 │   with threadpool_limits(limits=threads):                                                │
│ ❱ 366 │   │   trainer.fit(model)                                                                 │
│   367 │                                                                                          │
│   368 │   if model.best_epoch == -1:                                                             │
│   369 │   │   logger.warning('Model did not improve during training.')                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/kraken/lib/train.py:12 │
│ 9 in fit                                                                                         │
│                                                                                                  │
│    126 │   │   with warnings.catch_warnings():                                                   │
│    127 │   │   │   warnings.filterwarnings(action='ignore', category=UserWarning,                │
│    128 │   │   │   │   │   │   │   │   │   message='The dataloader,')                            │
│ ❱  129 │   │   │   super().fit(*args, **kwargs)                                                  │
│    130                                                                                           │
│    131                                                                                           │
│    132 class KrakenFreezeBackbone(BaseFinetuning):                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:538 in fit                                                                        │
│                                                                                                  │
│    535 │   │   self.state.fn = TrainerFn.FITTING                                                 │
│    536 │   │   self.state.status = TrainerStatus.RUNNING                                         │
│    537 │   │   self.training = True                                                              │
│ ❱  538 │   │   call._call_and_handle_interrupt(                                                  │
│    539 │   │   │   self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule,  │
│    540 │   │   )                                                                                 │
│    541                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/call.py:47 in _call_and_handle_interrupt                                                     │
│                                                                                                  │
│    44 │   try:                                                                                   │
│    45 │   │   if trainer.strategy.launcher is not None:                                          │
│    46 │   │   │   return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer,    │
│ ❱  47 │   │   return trainer_fn(*args, **kwargs)                                                 │
│    48 │                                                                                          │
│    49 │   except _TunerExitException:                                                            │
│    50 │   │   _call_teardown_hook(trainer)                                                       │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:574 in _fit_impl                                                                  │
│                                                                                                  │
│    571 │   │   │   model_provided=True,                                                          │
│    572 │   │   │   model_connected=self.lightning_module is not None,                            │
│    573 │   │   )                                                                                 │
│ ❱  574 │   │   self._run(model, ckpt_path=ckpt_path)                                             │
│    575 │   │                                                                                     │
│    576 │   │   assert self.state.stopped                                                         │
│    577 │   │   self.training = False                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:981 in _run                                                                       │
│                                                                                                  │
│    978 │   │   # ----------------------------                                                    │
│    979 │   │   # RUN THE TRAINER                                                                 │
│    980 │   │   # ----------------------------                                                    │
│ ❱  981 │   │   results = self._run_stage()                                                       │
│    982 │   │                                                                                     │
│    983 │   │   # ----------------------------                                                    │
│    984 │   │   # POST-Training CLEAN UP                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/trai │
│ ner/trainer.py:1025 in _run_stage                                                                │
│                                                                                                  │
│   1022 │   │   │   with isolate_rng():                                                           │
│   1023 │   │   │   │   self._run_sanity_check()                                                  │
│   1024 │   │   │   with torch.autograd.set_detect_anomaly(self._detect_anomaly):                 │
│ ❱ 1025 │   │   │   │   self.fit_loop.run()                                                       │
│   1026 │   │   │   return None                                                                   │
│   1027 │   │   raise RuntimeError(f"Unexpected state {self.state}")                              │
│   1028                                                                                           │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:205 in run                                                                         │
│                                                                                                  │
│   202 │   │   while not self.done:                                                               │
│   203 │   │   │   try:                                                                           │
│   204 │   │   │   │   self.on_advance_start()                                                    │
│ ❱ 205 │   │   │   │   self.advance()                                                             │
│   206 │   │   │   │   self.on_advance_end()                                                      │
│   207 │   │   │   │   self._restarting = False                                                   │
│   208 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fit_loop.py:363 in advance                                                                     │
│                                                                                                  │
│   360 │   │   │   )                                                                              │
│   361 │   │   with self.trainer.profiler.profile("run_training_epoch"):                          │
│   362 │   │   │   assert self._data_fetcher is not None                                          │
│ ❱ 363 │   │   │   self.epoch_loop.run(self._data_fetcher)                                        │
│   364 │                                                                                          │
│   365 │   def on_advance_end(self) -> None:                                                      │
│   366 │   │   trainer = self.trainer                                                             │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:140 in run                                                              │
│                                                                                                  │
│   137 │   │   self.on_run_start(data_fetcher)                                                    │
│   138 │   │   while not self.done:                                                               │
│   139 │   │   │   try:                                                                           │
│ ❱ 140 │   │   │   │   self.advance(data_fetcher)                                                 │
│   141 │   │   │   │   self.on_advance_end(data_fetcher)                                          │
│   142 │   │   │   │   self._restarting = False                                                   │
│   143 │   │   │   except StopIteration:                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/training_epoch_loop.py:212 in advance                                                          │
│                                                                                                  │
│   209 │   │   │   batch_idx = data_fetcher._batch_idx                                            │
│   210 │   │   else:                                                                              │
│   211 │   │   │   dataloader_iter = None                                                         │
│ ❱ 212 │   │   │   batch, _, __ = next(data_fetcher)                                              │
│   213 │   │   │   # TODO: we should instead use the batch_idx returned by the fetcher, however   │
│   214 │   │   │   # fetcher state so that the batch_idx is correct after restarting              │
│   215 │   │   │   batch_idx = self.batch_idx + 1                                                 │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:133 in __next__                                                                    │
│                                                                                                  │
│   130 │   │   │   │   self.done = not self.batches                                               │
│   131 │   │   elif not self.done:                                                                │
│   132 │   │   │   # this will run only when no pre-fetching was done.                            │
│ ❱ 133 │   │   │   batch = super().__next__()                                                     │
│   134 │   │   else:                                                                              │
│   135 │   │   │   # the iterator is empty                                                        │
│   136 │   │   │   raise StopIteration                                                            │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/loop │
│ s/fetchers.py:60 in __next__                                                                     │
│                                                                                                  │
│    57 │   │   assert self.iterator is not None                                                   │
│    58 │   │   self._start_profiler()                                                             │
│    59 │   │   try:                                                                               │
│ ❱  60 │   │   │   batch = next(self.iterator)                                                    │
│    61 │   │   except StopIteration:                                                              │
│    62 │   │   │   self.done = True                                                               │
│    63 │   │   │   raise                                                                          │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:341 in __next__                                                         │
│                                                                                                  │
│   338 │                                                                                          │
│   339 │   def __next__(self) -> _ITERATOR_RETURN:                                                │
│   340 │   │   assert self._iterator is not None                                                  │
│ ❱ 341 │   │   out = next(self._iterator)                                                         │
│   342 │   │   if isinstance(self._iterator, _Sequential):                                        │
│   343 │   │   │   return out                                                                     │
│   344 │   │   out, batch_idx, dataloader_idx = out                                               │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/lightning/pytorch/util │
│ ities/combined_loader.py:78 in __next__                                                          │
│                                                                                                  │
│    75 │   │   out = [None] * n  # values per iterator                                            │
│    76 │   │   for i in range(n):                                                                 │
│    77 │   │   │   try:                                                                           │
│ ❱  78 │   │   │   │   out[i] = next(self.iterators[i])                                           │
│    79 │   │   │   except StopIteration:                                                          │
│    80 │   │   │   │   self._consumed[i] = True                                                   │
│    81 │   │   │   │   if all(self._consumed):                                                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:630 in __next__                                                                         │
│                                                                                                  │
│    627 │   │   │   if self._sampler_iter is None:                                                │
│    628 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    629 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  630 │   │   │   data = self._next_data()                                                      │
│    631 │   │   │   self._num_yielded += 1                                                        │
│    632 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    633 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1344 in _next_data                                                                      │
│                                                                                                  │
│   1341 │   │   │   │   self._task_info[idx] += (data,)                                           │
│   1342 │   │   │   else:                                                                         │
│   1343 │   │   │   │   del self._task_info[idx]                                                  │
│ ❱ 1344 │   │   │   │   return self._process_data(data)                                           │
│   1345 │                                                                                         │
│   1346 │   def _try_put_index(self):                                                             │
│   1347 │   │   assert self._tasks_outstanding < self._prefetch_factor * self._num_workers        │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/datal │
│ oader.py:1370 in _process_data                                                                   │
│                                                                                                  │
│   1367 │   │   self._rcvd_idx += 1                                                               │
│   1368 │   │   self._try_put_index()                                                             │
│   1369 │   │   if isinstance(data, ExceptionWrapper):                                            │
│ ❱ 1370 │   │   │   data.reraise()                                                                │
│   1371 │   │   return data                                                                       │
│   1372 │                                                                                         │
│   1373 │   def _mark_worker_as_unavailable(self, worker_id, shutdown=False):                     │
│                                                                                                  │
│ /home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/_utils.py:706 in │
│ reraise                                                                                          │
│                                                                                                  │
│    703 │   │   │   # If the exception takes multiple arguments, don't try to                     │
│    704 │   │   │   # instantiate since we don't know how to                                      │
│    705 │   │   │   raise RuntimeError(msg) from None                                             │
│ ❱  706 │   │   raise exception                                                                   │
│    707                                                                                           │
│    708                                                                                           │
│    709 def _get_available_device_type():                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 38, in do_one_step
    data = pin_memory(data, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 69, in pin_memory
    clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 69, in <dictcomp>
    clone.update({k: pin_memory(sample, device) for k, sample in data.items()})
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/incognito/miniconda3/envs/kraken-5.3.0/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 59, in pin_memory
    return data.pin_memory(device)
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Aborted (core dumped)

This one on a Geforce GTX 3090 FOunders Edition 24Gb. I was watching the GPU with nvitop and never got 100% usage.

(nvitop) incognito@DESKTOP-FVRLETC:~$ nvidia-smi
Sun Mar 30 17:36:12 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:04:00.0 Off |                  N/A |
|  0%   27C    P8              8W /  350W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Python modules:

(kraken-5.3.0) incognito@DESKTOP-FVRLETC:~/kraken-train/catmus-medieval$ pip list
Package                   Version
------------------------- -----------
aiohappyeyeballs          2.4.6
aiohttp                   3.11.13
aiosignal                 1.3.2
attrs                     25.1.0
cattrs                    24.1.2
certifi                   2025.1.31
cffi                      1.17.1
charset-normalizer        3.4.1
click                     8.1.8
coremltools               8.2
filelock                  3.17.0
frozenlist                1.5.0
fsspec                    2025.2.0
idna                      3.10
imageio                   2.37.0
importlib_resources       6.5.2
Jinja2                    3.1.5
joblib                    1.4.2
jsonschema                4.23.0
jsonschema-specifications 2024.10.1
kraken                    5.3.0
lazy_loader               0.4
lightning                 2.4.0
lightning-utilities       0.12.0
lxml                      5.3.1
markdown-it-py            3.0.0
MarkupSafe                3.0.2
mdurl                     0.1.2
mpmath                    1.3.0
multidict                 6.1.0
networkx                  3.4.2
numpy                     2.0.2
nvidia-cublas-cu12        12.1.3.1
nvidia-cuda-cupti-cu12    12.1.105
nvidia-cuda-nvrtc-cu12    12.1.105
nvidia-cuda-runtime-cu12  12.1.105
nvidia-cudnn-cu12         9.1.0.70
nvidia-cufft-cu12         11.0.2.54
nvidia-curand-cu12        10.3.2.106
nvidia-cusolver-cu12      11.4.5.107
nvidia-cusparse-cu12      12.1.0.106
nvidia-nccl-cu12          2.20.5
nvidia-nvjitlink-cu12     12.8.61
nvidia-nvtx-cu12          12.1.105
packaging                 24.2
pillow                    11.1.0
pip                       25.0
propcache                 0.3.0
protobuf                  5.29.3
pyaml                     25.1.0
pyarrow                   19.0.1
pycparser                 2.22
Pygments                  2.19.1
python-bidi               0.6.6
pytorch-lightning         2.5.0.post0
pyvips                    2.2.3
PyYAML                    6.0.2
referencing               0.36.2
regex                     2024.11.6
requests                  2.32.3
rich                      13.9.4
rpds-py                   0.23.1
scikit-image              0.24.0
scikit-learn              1.5.2
scipy                     1.13.1
setuptools                75.8.0
shapely                   2.0.7
sympy                     1.13.3
threadpoolctl             3.5.0
tifffile                  2025.2.18
torch                     2.4.1
torchmetrics              1.6.1
torchvision               0.19.1
tqdm                      4.67.1
triton                    3.0.0
typing_extensions         4.12.2
urllib3                   2.3.0
wheel                     0.45.1
yarl                      1.18.3

The dataset is CATMuS/medieval-segmentation

On the same GPU and dataset I trained all yolo11 segmentation models (from nano to x large) without any issue.

@johnlockejrr
Copy link
Author

johnlockejrr commented Mar 31, 2025

Exactly the same dataset on exactly the same GPU ran on PyLaia training, no crash:

(pylaia-py3.10) incognito@DESKTOP-H1BS9PO:~/pylaia-latest-train/Catmus_medieval$ pylaia-htr-train-ctc --config config_train_model.yaml
Global seed set to 74565
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name      | Type     | Params
---------------------------------------
0 | model     | LaiaCRNN | 5.4 M
1 | criterion | CTCLoss  | 0
---------------------------------------
5.4 M     Trainable params
0         Non-trainable params
5.4 M     Total params
21.678    Total estimated model params size (MB)
Global seed set to 74565
TR - E0:  10%|████████████████████▋                                                                                                                                                                                              | 1875/19102 [04:06<37:40,  7.62it/s, running_loss=107]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant