Vicente Rodríguez

Sept. 17, 2025

Testing and workflow to train PyTorch models.

This post covers what, in my opinion, are the essential good practices and implementations to build a base line code to train PyTorch models. Although, this could be also useful for other frameworks like TensorFlow or even scikit-learn.

This is the GitHub link to the repository containing the code of this post.

Jupyter Notebooks

Most, if not all projects in the deep learning field, begin inside a Jupyter Notebook due to the practicality to prototype and test different ideas. Yet, as the project grows in complexity and the number of code lines increases; Notebooks become harder to maintain.

A simple workaround is to transform notebooks into different files and folders. By my part, I created a library to achieve this, though several libraries should exist. The idea is to wrap pieces or blocks of code by using specific tags such as or . All tags are defined by the user in a JSON file, where folders and modules are also defined:

{
    "sections": {
        "Blocks": "model/blocks",
        "Model": "model/model",
        "Train": "train",
        "TrainUtils": "utils/train",
        "TestUtils": "utils/test",
        "ManageUtils": "utils/manage",
        "HyperParameters": "hparameters",
        "DataLoader": "data/loader",
        "TestBlocks": "test/test_blocks",
        "TestModel": "test/test_model",
        "TestDataLoader": "test/test_data_loader",
        "TestTrain": "test/test_train"
    }
}

The library then creates the needed folders and files.

However, some issues remain; Libraries must be imported by hand. To prevent overloading imports, the library recognizes specific comments in the notebook which are translated to normal code:

#! from hparameters import data_config
#! from utils.train import create_model
#! import torch

Code from notebooks is converted to files, but not the other way around. Although this could be implemented.

The previous behavior leads to a specific workflow: Begin prototyping ideas on notebooks, and once the code reach a more complex state, switch to work with files and folders.

Furthermore, if the user wants to maintain notebooks in their workflow, but translate code into files and folders from time to time, git recognizes the changes in the files, allowing the use of repositories.

Code best practices

Testing is part of the next section of the post. Testing is commonly a guide of how well implemented and designed the code is. For instance, often log modules are implemented, such as wandb, which could lead to a dependency in the code, and of course, a harder flow to test. This can easily be fixed by implementing a decoy component and a training module which is decoupled from the log module:

epoch_trainer = EpochTrainer(
    train_loader=train_loader,
    batch_trainer=batch_trainer,
    device=device,
    wandb=wandb,
    ckp_manager=ckp_manager,
    metrics=train_metrics,
)

The EpochTrainer module requires a wandb module to log metrics. This is essential at training time, but not always needed at testing time.

The EpochTrainer module is not coupled to the wandb module, allowing us to create a toy/decoy module like:

class DecoyLogger():
    def __init__(self) -> None:
        self.id = -1

    def init(self, *args, **kargs):
        return DecoyRunner()

    def log(self, *args, **kargs):
        pass

Now, when EpochTrainer is being testing, one can just replace wandb with DecoyLogger, preventing calling the wandb methods at test time.

Most of the presented code follows this principle, where modules should not depend or fully understand the attributes or methods of other modules. For instance:

class MetricsManager(torch.nn.Module):
  def __init__(
      self, 
      name='Training',
      num_classes=data_config.NUM_CLASSES,
    ):
    super().__init__()

    self.accuracy = Accuracy(
      task="multiclass",
      num_classes=num_classes
    )
    self.precision = Precision(
      task="multiclass",
      num_classes=num_classes,
      average="macro"
    )

    ...

    self._metrics = {
      f"{name} accuracy": self.accuracy,
      f"{name} precision": self.precision,
      f"{name} recall": self.recall,
    }

    self._graph = {
      f"{name} precision_recall_curve": self.precision_recall_curve,
    }

  def reset_metrics(self):
    for metric in chain(self._metrics.values(), self._graph.values()):
      metric.reset()

  def update_metrics(self, predictions, targets):
    for metric in chain(self._metrics.values(), self._graph.values()):
      metric.update(predictions, targets)

  @property
  def log_metrics(self):
    for name, metric in self._metrics.items():
      yield name, metric.compute()

The previous class implements a metrics manager to record different metrics. If new metrics are added, this is the only place that needs to be updated. Even when modules like EpochTrainer employ the MetricsManager:

for name, metric in self.metrics.log_metrics:
    metric_value = metric.item()
    g_string.append(f"{name.split(' ')[1]}: {metric_value:.3f}| ")
    self.wandb.log({f"{name}": metric_value}, commit=False)

To obtain the results for each metric, one can just iterate the metrics dictionary and log the information. Even when new metrics are added EpochTrainer does not change.

An additional implementation could be a logger wrapper. Currently, some of the modules need to be aware of the different methods of the wandb logger. If a wrapper is implemented a change in the logger, for instance replace wandb with mlflow or a custom logger, would not need a change in the rest of the code, just in the wrapper. Besides, blocks of code, like the ones to log the metrics information, could be implemented inside the wrapper preventing code duplication.

for name, metric in self.metrics.log_metrics:
  metric_value = metric.item()
  log_string.append(f"{name}: {metric_value:.2f}| ")
  self.wandb.log({f"{name}": metric_value}, commit=False)

This block of code, although short, is repeated in different modules.

Another set of useful functions can be defined as:

def create_model(config):
  return CNNet(
      num_layers=config.NUM_LAYERS,
      initial_channels=config.INITIAL_CHANNELS,
      feat_size=config.FEAT_SIZE,
  )

def create_optimizer(model, config):
    return torch.optim.AdamW(
        model.parameters(),
        lr=config.LEARNING_RATE,
        weight_decay=config.WEIGHT_DECAY,
        betas=(config.BETA1, config.BETA2),
        eps=config.EPSILON,
    )

def get_loss_function():
  return torch.nn.CrossEntropyLoss()

Wrapping the model, optimizer, and loss function definitions allows us to export these functions in both the training and test scripts. If any modifications are made, such as switch to a different optimizer, this is the only one place where such modifications need to be done.

Testing

Testing is essential in any project, including machine learning ones. Moreover, the stochastic nature of machine learning adds complexity to the testing workflow.

Model Blocks Testing

To test the blocks that compose a model definition one can follow different approaches:

The unittest library allows to define code which can be shared across different tests inside the setUpClass method.

Model Testing

Once model blocks are tested. The model itself is tested in an integration testing fashion. In the same way, correct input and output shapes should be tested. Additionally, specific layers such as dropout or normalization, behave differently under training and testing steps. This behavior should also be tested to avoid unexpected results.

Dataset Testing

To ensure the correct format of input data and labels, which can be composed of data loading, transformations, augmentations, etc, dataset testing is also important to integrate.

The recommended approach is to create temporal data and do not test the training/validation/test distributions directly yet. In this set of tests it is important to assert the output types of inputs and labels, as well as the format. For instance: images can be expected to be in a range 0 to 1, in a specific size (3, 256, 256) and to be a float32 tensor.

A data loader (utils.data.DataLoader) is expected to wrap the dataset and should be also tested.

Test Training

When testing the training step, not only is it important to be able to execute it without errors, but also validate gradient computations, loss and metrics expected values, dropout, normalization layers, even the model design itself, memory leaks. All these areas involve a complex analysis of the code.

Dry/test runs

Given the nature of machine learning models, even after all the previous discussed tests, one still needs to execute a dry run to assert the correct integration between training, dataset and expected results.

The two recommended runs are:

A dataset Run

Confirm the structure of the training/validation/test datasets, such as the correct type of input data (JPG, PNG), the correct representation (RGB), labels matching the inputs, the shape of the inputs, etc.

A practical implementation is recording the diverse information from the images, by loading image by image, to a data frame to visualize the distribution. Perhaps a bunch of images are in an incorrect PNG format while the rest are in a JPEG.

A training run

A well designed model should be able to overfit a small subset of data. Gather a few samples (10) and run the training script. This not only grants information about the design of the model, but also validates the correct running of the training script. Metrics should also output results according to the training.

Debug training

Debug training goes beyond prevent errors in the execution of the training step, and focus more on the results of the model. A certain debugging approach prevalent in classification problems is to contrast class predictions made by the model, and class labels annotated in the dataset.

Often incorrect labeled data could sneak in our datasets. In these situations models tend to predict a higher confidence for the incorrect class. By debugging the predictions contrasting the label from the dataset and the predictions of the model, one can often encounter incorrect labeled data in datasets.

Experiment tracking

Experiment tracking can be considered as part of the testing or debugging of a ml project. It involves the testing of different hyper-parameters, such as the learning, rate, the number of layers, the strength of the regularization, etc; and the choose of various components, like the type of optimizer, normalization or activation layers, design decisions, replace pooling layers with stride convolutions to reduce spatial dimensionality, etc.

Seeking the best design requires ablations studies where components are added or removed to measure their impact in the final metrics. Ablations studies should not only measure the response quality of the model, but also the response time, the size of the model and the complexity of the later.

One can follow the next approach: First train the model with all its components. Then remove components, one at a time, using at least 3 different fixed random seeds. Leave the same hyper-parameters and train. Finally compute the mean of the metrics, loss to obtain the best version of the model, with our without certain components.

Often certain components require a change in the hyper-parameters to perform better, or are sensitive to random seeds. For instance batch normalization layers benefit from a larger batch size. In these situations, further tests could be required (non change vs change in hyper-parameters).

A must is the implementation of a logger tool, like weights and biasses to record all the different experiments easily.