Training

Training is the process by which the program will be able to clone any voice or sound via Self-Supervised Speech.

⚠️

Training is only for NVDIA GPUs, if this is not available Training tab will be disabled.
If you don’t have an compatible GPU, we offer alternatives to run Applio in the cloud.

Training a Model

Step 1: Dataset Preparation

Before you can start the training process you need to have an audio set of the desired voice, it is recommended to have

A minimum amount of 10 minutes of clean audio, without noise or silences.
Uncompressed audio format, this can be either .wav or .flac

The recommended duration to have a good dataset is 30 minutes, but if you have even less audio than recommended, pretraineds will be your solution to get a good model with low data.

Step 2: Dataset Processing

Once you have the audio ready, be sure to select the correct frequency for your file (32k, 40k, 48k), with that you can proceed to name your model and run the Preprocess Dataset Step.

Step 3: Feature Extraction

Now you are in the second last step!, but... what is feature extraction?

Extracting features is an essential step to train, this process will convert each audio fragment divided by the post-processing step to a file readable by the F0 (Fundamental frequency).

Several F0 models are available to choose from, but the best performer is RVMPE.

When you select your model, press Extract Features to start the process, remember to check your CMD until you see a message indicating that the process is complete.

Final Step: Model & Index Training

This is where the real work begins, to start training your model you will need to make a few small adjustments before you begin.

Save Every Epoch: Set this value between 10 and 50 to determine how often the model's state is saved during training.
Total Epochs: The number of epochs needed varies based on your dataset. Monitor progress using TensorBoard; typically, models perform well around 200-400 epochs.
Batch Size: Adjust based on your GPU's VRAM. For 8 GB VRAM, use a batch size between 6 and 8. Consider CUDA cores when experimenting with higher batch sizes.

ℹ️ Index Generation: generating an index is a must, just click on the “train index” button to perform the process.

Other Options

Pitch Guidance: Gives variation of pitch.
Pretrained: Uses the RVC pretrained, used for training common models, uncheck if you want to make a pretrain.
Save Only Latest: Replace the same D/G file newer data. This help to prevent filling up storage.
Save Every Weights: Save the weights of the model when a cycle of 'Save Every Epoch' is completed.
Custom Pretrained: Uses the Custom Pretrained that are loaded.
GPU Settings: Allows to choose GPUs (only for users who have more than one GPU).
Overtraining Detector: Mark it only if you will train for more than 200 epochs.
Overtraining Threshold: Set the maximum number of epochs you want your model to stop training if no improvement is detected.

(Optional) Resume Training

Open Applio if you have closed it.
Then, in the Applio interface, input your model name, use the same sample rate, and proceed to the last part of the train tab. Set the same batch size, pretrained (if you used) and increase the number of epochs you want to train.
Once configured, press 'Start training' to start the process, everything is registered in the CMD.

Infer & Download models TTS