I have multiple images diagram, all of which contains labels as alphanumeric characters instead of just the text label itself. I want my YOLO model to identify all the numbers &
What you're describing appears to be OCR (Optical character recognition). One OCR engine I know of is tesseract, although there is also this one from IBM and others.
As YOLO was originally trained for a very different task, to use it for localizing text will likely require to retrain it from scratch. One could try to use existing packages (adapted to your specific setting) for ground truth (although it is worth to remember that the model would generally be only at most as good as the ground truth). Or, perhaps more easily, generate synthetic data for training (i.e. add text in positions you choose to existing drawings then train to localize it).
Alternatively, if all of your target images are structured similar to the above, one could try to create ground truth using classic CV heuristics as you did above to separate/segment out symbols, followed by classification using a CNN trained on MNIST or similar to determine if a given blob contains a symbol.
For the case you do opt for YOLO - there are existing implementations in python, e.g. I had some experience with this one - should be fairly straightforward to set up training with your own ground truth.
Finally, if using YOLO or CNN is not a goal in itself but rather only the solution, any of the above "ground truth" could be used directly as a solution, and not for training a model.
Hope I understood your question correctly