Add files via upload

2025-02-18 19:44:07 +01:00 · 2020-07-20 17:24:35 -05:00 · 2020-07-20 17:24:35 -05:00 · b2cf94fcdb
commit b2cf94fcdb
parent 98349dcd50
2 changed files with 129 additions and 0 deletions
--- a/models/READMEFIRST.txt
+++ b/models/READMEFIRST.txt
@ -0,0 +1,74 @@
+    Vocal Remover Installation Instructions
+
+    **These instructions are for Windows Systems only!**
+
+    The application was made with Tkinter for cross platform compatibility, so this should work with Windows, Mac, and Linux systems. I've only personally tested this on Windows 10 & Linux Ubuntu.
+
+    Prerequisites for Running the Vocal Remover GUI
+
+    1. Download & install Python here *Make sure to check the box that says "Add Python 3.7 to PATH"
+
+
+    2. Once Python has installed, open the Windows Command Prompt and run the following installs -
+
+    - If you plan on doing conversions with your Nvidia GPU, please install the following -
+
+    pip install torch==1.5.0+cu101 torchvision==0.6.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
+
+    - If you don't have a compatible Nvidia GPU and plan on only using the CPU version please do not check the "GPU Conversion" option in the GUI and install the following -
+
+    pip install torch==1.5.0+cpu torchvision==0.6.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
+
+    - The rest need to be installed regardless! -
+
+    pip install Pillow
+    pip install tqdm==4.30.0
+    pip install librosa==0.6.3
+    pip install opencv-python
+    pip install numba==0.48.0
+    pip install SoundFile
+    pip install soundstretch
+
+
+    3. If you have a Nvidia GPU, I strongly advise that you install the following Cuda driver so you can run conversions with it. You can download the driver https://developer.nvidia.com/cuda-10.1-download-archive-base?target_os=Windows&target_arch=x86_64
+
+    4. For the ability to convert mp3, mp4, m4a, and flac files, you'll need ffmpeg installed and configured! I found a pretty good video on YouTube that explains this process; It's actually a Spleeter tutorial, but I provided a link to the relevant portion of the video that covers the ffmpeg installation process. Check this YouTube video out here - https://www.youtube.com/watch?v=tgnuOSLPwMI&feature=youtu.be&t=307
+
+    Running the Vocal Remover Application GUI
+
+    2. Place this folder where ever you wish (I put mine in my documents folder) and create a shortcut for the file labeled "VocalRemover.py" to your desktop
+    3. Open the application
+
+    Notes Regarding the GUI
+ 
+
+    - The application will automatically remember your "save to" path upon closing and reopening until you change it
+    - You can select as many files as you like. Multiple conversions are supported!
+    - Conversions on wav files should always work with no issue. However, you will need to install and configure ffmpeg in order for conversions on mp3, mp4, m4a, and FLAC formats. If you select non-wav music files without having ffmpeg configured and attempt a conversion it will freeze and you will have to restart the application.
+    - Only check the GPU box if you have the Cuda driver installed for your Nvidia GPU. Most Nvidia GPU's released prior to 2015 or with less than 4GB's of V-RAM might not be compatible.
+    - The dropdown model menu consists of the Multi-Genre Model I just finished (trained on 700 pairs), a stacked model (a model trained on converted data), & the stock model the AI originally came with (for comparison). I added the option to add your own model as well if you've trained your own. Alternatively, you can also simply add a model to the models directory and restart the application, as it will automatically show there.
+    - The SR, HOP LENGTH, and WINDOW SIZE parameters are set to the defaults. Those were the parameters used in training, so changing them may result in poor conversion performance unless the model is compatible with the changes made. Those are essentially advanced settings, so I recommend you leave them as is unless you know exactly what you're doing.
+    - The Post-Process option is a developement option. Keep it unchecked for most conversions, unless you have a model that is compatible with it.
+    - The "Save Mask PNG" option allows you to to save a copy of the spectrogram as a PNG.
+    - The Stacked Model is meant to clean up vocal residue left over in the form of vocal pinches and static. 
+    - The "Stack Passes" option should only be used with the Stacked Model. This option allows you to set the amount of times you want a track to run through the model. The amount of times you need to run it through will vary greatly by track. Most tracks won't require any more than 5 passes. If you do 5 or more passes on a track you risk quality degration. When doing stack passes the first and last "vocal" track will give you an idea of how much static was removed.
+    - Conversion times will greatly depend on your hardware. This application will NOT be friendly to older or budget hardware. Please proceed with caution! Pay attention to your PC and make sure it doesn't overheat.
+
+    Other Notes
+
+    - Since there wasn't much demand for training, I went ahead and moved the training instructions to a readme file included within this package. 
+    - I will be releasing new models over the course of the next few months. 
+
+
+    Other Notes
+
+    - Running conversions on this application is a computationally intensive processes! Please note, I do not take any responsibility for damaged hardware as a result of this application! It's highly recommended that you not use this on older or budget hardware. USE AT YOUR OWN RISK! 
+
+
+
+	Enjoy!
+
+
+
+
+Copyright@Anjok
--- a/models/Training
+++ b/models/Training
@ -0,0 +1,55 @@
+*Important Notes On Training*
+
+Training models on this AI is very easy compared to Spleeter. However, there a few things that need to be taken into consideration before going forward with training if you want the best results!
+
+1. Training can take very long, I mean from a day to days, to a week straight depending on how many pairs you're using and your PC specs.
+2. The better your PC specifications are, the quicker training will be. Also, CPU training is a lot slower than GPU training
+3. After much research, training using your GPU is recommended. The more V-RAM, the bigger you can make your batch size.
+4. If you choose to train using your GPU, you will need to do research on your GPU first to see if it can be used to train. For example, if you have a high-performance Nvidia Graphics card you'll need to install compatible Cuda drivers for it to work properly. In many cases, this might not actually be worth it due to V-RAM limitations, unless you have a badass GPU with at least 6GB's of V-RAM, or you choose to order a GPU cluster from AWS or another cloud provider (that's an expensive route).
+5. The dataset must be comprised of pairs consisting of official instrumentals & official mixes.
+6. The pairs MUST be perfectly aligned! Also, the spectrograms must match up evenly, and the pairs should have the same volume, otherwise, training will not be effective and you'll have to start over.
+7. I've found in my own experiences that not every official studio instrumental will align with its official mix counterpart. If it doesn't align, that means it's NOT usable! I can't stress this enough. A good example of a pair that wouldn't align for me was "Bring Me The Horizon - Avalanche (Official Instrumental)" & it's official mix with vocals. What I found was the timing was slightly different between the 2 tracks and this rendered it impossible to align. A good example of a track that did align was the instrumental and mix for "Brandon Flowers - Swallow It". For those of you who don't know how to align tracks, know that it's the exact same process people use to extract vocals using an instrumental and mix (vocal extraction isn't necessary for this. Although, knowing if they can be using this classic method, even with a lot of artifacts, makes for a good litmus test for rather or not the pair is a good candidate for training). There are tons of tutorials on YouTube that show how to perfectly align tracks. Also, you can bypass some of that work by making an instrumental and mix out of multitrack stems.
+8. I advise against using instrumentals with background vocals and TV tracks as they can undermine the efficiency of the training sessions and model.
+9. From my experience, you can use tracks of any quality, as long as they're both the exact same quality (natively). For example, if you have a lossless wav mix, but the instrumental is only a 128kbps mp3, you'll need to convert the lossless mix down to 128kbps so the spectrograms can match up. Don't convert a file up from 128kbps to anything higher like 320kbps, as that won't help at all and will actually make it worse. If you have an instrumental that's 320kbps and a lossless version of it doesn't exist, you're going to have to convert the lossless mix wav file down to 320kbps. With that being said, using high-quality tracks does make training slightly more efficient, but to be honest 320kbps and a lossless wave file won't really make much of a difference at all. The frequencies that are between a 320kbps file and a lossless wav file don't even really contain enough audible vocal data to make an impact. You can use Spek or iZotope to view the spectrograms and to make sure they match up. However, I suggest not using ANY pair below 128kbps. You're better off keeping the dataset between 128-320kbps.
+10. When you start training you'll see it go through what are called "Epochs". 100 epochs are the standard, but more can be set if you feel it's necessary. However, if you train on more than 100 epochs you run the risk of either "overfitting" the model (which renders it unable to make accurate predictions), or stagnating the model (basically waste training time). You need to keep a close eye on the "training loss" & "validation loss" numbers. The goal is to get those numbers as low as you can. However, the "training loss" number should always be slightly (and I mean slightly) lower than the "validation loss". If the "validation loss" number is ever significantly higher than the "training loss", that means you're overfitting the model. If the "training loss" & "validation loss" number are both high and stay high after 2 or 3 epochs, then you either need to give it more time to train, or your dataset is too small (you should be training with a bare minimum of 50-75 pairs).
+11. A new model spawns after each epoch with the best validation loss, so you can get a good idea as to rather or not it's getting better based on the data you're training it on. However, I recommend you not run conversions during the training process. Load the model to an external device and test it on a separate PC if you must.
+12. I highly recommend dedicating a PC to training. Close all applications and clear as much RAM as you can. Even if you use your GPU.
+13. Proceed with caution! I don't want anyone burning out their PC's! A.I. training like this is a computationally-intensive process, so make sure your PC is properly cooled and check it's temperature every so often to keep it from overheating. In fact, this is exactly why I use non-gui Linux Ubuntu to do all my training. Using that platform allows me to allocate nearly most of my system resources to training, and training only.
+14. Be patient! I feel like I have to stress this because it can take a VERY long time if you're using your CPU, or if your training with a dataset consisting of more than 200 pairs on a GPU. This is why you want to prepare a good dataset before starting. The model my friend and I created took nearly a week and a half straight to complete! This means not running any resource sucking applications in the background, like browsers and music players. Prepare to essentially surrender usage of that PC until the training has completed.
+
+
+Note: First and foremost, please read the *Important Notes On Training* above before starting any of this!
+
+1. Download the Vocal Remover package from here that includes the training script - https://github.com/tsurumeso/vocal-remover
+Training Instructions
+
+2. Place your pairs accordingly in the following directories in the main vocal-remover folder
+
+dataset/
+- instrumentals/
+| +- 01_foo_inst.wav
+| +- 02_bar_inst.mp3
+| +- ...
+- mixtures/
+| +- 01_foo_mix.wav
+| +- 02_bar_mix.mp3
+| +- ...
+
+3. Run one of the following commands:
+
+-To run the training process on your CPU, run the following -
+
+~To finetune the baseline model, run the following -
+
+python train.py -i dataset/instrumentals -m dataset/mixtures -P models/baseline.pth -M 0.5 -g 0
+
+~To create a model from scratch, run the following -
+
+python train.py -i dataset/instrumentals -m dataset/mixtures -M 0.5
+
+NOTE: This is not guaranteed to work on on graphics cards 4GB's and below! You might need to change your batch settings.
+
+4. Once you see the following on the command prompt it means the training process has begun:
+
+# epoch 0
+* inner epoch 0