mirror of
https://github.com/Anjok07/ultimatevocalremovergui.git
synced 2024-12-01 02:27:21 +01:00
Update README.md
This commit is contained in:
parent
8cf70747cf
commit
be952fc612
@ -1,6 +1,6 @@
|
|||||||
# Ultimate Vocal Remover GUI
|
# Ultimate Vocal Remover GUI
|
||||||
|
|
||||||
This is a deep-learning-based tool that extracts the instrumental track from a track containing vocals. This project is a GUI version of the vocal remover created and posted by tsurumeso. You can find the command line version [here](https://github.com/tsurumeso/vocal-remover)
|
This is a deep-learning-based tool that extracts the instrumental track from a track containing vocals. This project is a GUI version of the vocal remover created and posted by tsurumeso. This would not have been possible without tsurumeso's work and dedication! You can find the command line version [here](https://github.com/tsurumeso/vocal-remover)
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
@ -80,6 +80,7 @@ dataset/
|
|||||||
|
|
||||||
### Notes on Training
|
### Notes on Training
|
||||||
|
|
||||||
|
- FIRST AND FORMOST! - The notes I've proved below are from my own experience. I've learned as I went along. All technical questions regarding the training process should be directed to tsurumesos Vocal Remover Github [here](https://github.com/tsurumeso/vocal-remover)
|
||||||
- Training can take very long, I mean from a day to days, to a weeks straight depending on how many pairs you're using and your PC specs.
|
- Training can take very long, I mean from a day to days, to a weeks straight depending on how many pairs you're using and your PC specs.
|
||||||
- The better your PC specifications are, the quicker training will be. Also, CPU training is a lot slower than GPU training
|
- The better your PC specifications are, the quicker training will be. Also, CPU training is a lot slower than GPU training
|
||||||
- Training using your GPU is a must! CPU training is possible but prohibitavley slow. The more V-RAM, the bigger you can make your batch size. 4 is the default.
|
- Training using your GPU is a must! CPU training is possible but prohibitavley slow. The more V-RAM, the bigger you can make your batch size. 4 is the default.
|
||||||
@ -89,10 +90,10 @@ dataset/
|
|||||||
- Keep in mind, not every official studio instrumental will align with its official mix counterpart. If it doesn't align, it shouldn't be used in your dataset. What I found was if the timing is slightly different between the 2 tracks, it will render it impossible to align. If you don't know how to align tracks, know that it's the exact same process people use to extract vocals using an instrumental and mix (vocal extraction isn't necessary for this. Although, knowing if they can be using this classic method with MINIMAL to no artifacts, makes for a good litmus test regarding rather or not the pair is a good candidate for training. There are tons of tutorials on YouTube that show how to perfectly align tracks. Also, you can bypass some of that work by making an instrumental and mix out of multi-track/stems. Using multi-tracks to create pairs is the most concrete way to build a perfectly aligned dataset. However, carefully aligning wave-forms and spectrograms for instrumentals and mixes is a pretty good way too because you can easily expand your dataset.
|
- Keep in mind, not every official studio instrumental will align with its official mix counterpart. If it doesn't align, it shouldn't be used in your dataset. What I found was if the timing is slightly different between the 2 tracks, it will render it impossible to align. If you don't know how to align tracks, know that it's the exact same process people use to extract vocals using an instrumental and mix (vocal extraction isn't necessary for this. Although, knowing if they can be using this classic method with MINIMAL to no artifacts, makes for a good litmus test regarding rather or not the pair is a good candidate for training. There are tons of tutorials on YouTube that show how to perfectly align tracks. Also, you can bypass some of that work by making an instrumental and mix out of multi-track/stems. Using multi-tracks to create pairs is the most concrete way to build a perfectly aligned dataset. However, carefully aligning wave-forms and spectrograms for instrumentals and mixes is a pretty good way too because you can easily expand your dataset.
|
||||||
- From my own experience, I advise against using instrumentals with background vocals and TV tracks as they can undermine the effectiveness of the resulting models.
|
- From my own experience, I advise against using instrumentals with background vocals and TV tracks as they can undermine the effectiveness of the resulting models.
|
||||||
- I have found that you can use tracks of any quality, as long as the tracks in the pair are the exact same quality (natively) (use Spek or Audacity to confirm this). For example, if you have a lossless wav mix, but the instrumental is only a 128kbps mp3, you'll probably want to convert the lossless mix down to 128kbps so the spectrograms can match up. Don't convert a file up from 128kbps to anything higher like 320kbps, as that won't help at all. If you have an instrumental that's 320kbps and a lossless version of it doesn't exist, I recommend you convert the lossless mix wav file down to 320kbps. With that being said, using high-quality tracks does make training more efficient, but 320kbps and a lossless wave file don't appear to make much of a difference at all. However, I suggest not using ANY pair below 128kbps. You're better off keeping the dataset between 128-320kbps.
|
- I have found that you can use tracks of any quality, as long as the tracks in the pair are the exact same quality (natively) (use Spek or Audacity to confirm this). For example, if you have a lossless wav mix, but the instrumental is only a 128kbps mp3, you'll probably want to convert the lossless mix down to 128kbps so the spectrograms can match up. Don't convert a file up from 128kbps to anything higher like 320kbps, as that won't help at all. If you have an instrumental that's 320kbps and a lossless version of it doesn't exist, I recommend you convert the lossless mix wav file down to 320kbps. With that being said, using high-quality tracks does make training more efficient, but 320kbps and a lossless wave file don't appear to make much of a difference at all. However, I suggest not using ANY pair below 128kbps. You're better off keeping the dataset between 128-320kbps.
|
||||||
- When you start training you'll see it go through what are called "Epochs". 100 epochs are the standard, but more can be set if you feel it's necessary. However, if you train on more than 100 epochs you run the risk of either "overfitting" the model (which renders it unable to make accurate predictions), or stagnating the model (basically waste training time). You need to keep a close eye on the "training loss" & "validation loss" numbers. The goal is to get those numbers as low as you can. However, the "training loss" number should always be slightly (and I mean slightly) lower than the "validation loss". If the "validation loss" number is ever significantly higher than the "training loss", that means you're overfitting the model. If the "training loss" & "validation loss" number are both high and stay high after 2 or 3 epochs, then you either need to give it more time to train, or your dataset is too small (you should be training with a bare minimum of 50-75 pairs).
|
- When you start training you'll see it go through what are called "Epochs". 100 epochs are the standard, but more can be set if you feel it's necessary. However, if you train on more than 100 epochs you run the risk of the training session stagnating (basically waste training time). You need to keep a close eye on the "training loss" & "validation loss" numbers. The goal is to get those numbers as low as you can. However, the "training loss" number should always be slightly (and I mean slightly) lower than the "validation loss". If the "validation loss" number is ever significantly higher than the "training loss" after more than 10 full epochs, that means you might overfitting the model. If the "training loss" & "validation loss" number are both high and stay high after 2 or 3 epochs, then you either need to give it more time to train, have a poor dataset, or your dataset is too small (you should be training with a bare minimum of 50-75 pairs).
|
||||||
- A new model spawns after each epoch with the best validation loss, so you can get a good idea as to rather or not it's getting better based on the data you're training it on. However, I recommend you not run conversions during the training process. Load the model to an external device and test it on a separate PC, or via GoogleColab if you must. (Please note: GoogleColab is not compatible with GUI's. However, you can use the command line version of the AI for inferences)
|
- A new model spawns after each epoch with the best validation loss, so you can get a good idea as to rather or not it's getting better based on the data you're training it on. However, I recommend you not run conversions during the training process. Load the model to an external device and test it on a separate PC, or via GoogleColab if you must. (Please note: GoogleColab is not compatible with GUI's. However, you can use the command line version of the AI for inferences)
|
||||||
- I recommend dedicating a PC to training. Close all applications and clear as much RAM as you can. Even if you use your GPU. A.I. training like this is a computationally-intensive process, so make sure your PC is properly cooled and check it's temperature every so often to keep it from overheating is you're not using a commercial grade server or a high-end cooling system.
|
- I recommend dedicating a PC to training. Close all applications and clear as much RAM as you can. Even if you use your GPU. A.I. training like this is a computationally-intensive process, so make sure your PC is properly cooled and check it's temperature every so often to keep it from overheating is you're not using a commercial grade server or a high-end cooling system.
|
||||||
- Be Patient! The bigger the dataset, the longer training will take. I've yielded the best results using at least 300 pairs or more. The multi-genre model took a over a week and a half to train.
|
- Be Patient! The bigger the dataset, the longer training will take. I've yielded the best results using at least 300 pairs or more. The multi-genre model took a over a week and a half to train.
|
||||||
|
|
||||||
### Offline data augmentation
|
### Offline data augmentation
|
||||||
```
|
```
|
||||||
|
Loading…
Reference in New Issue
Block a user