Docker and Nvidia GPU monitoring
We have seen a major shift in how models are created and used (cough, ChatGPT, cough). Nowadays, the models are much larger and require more extensive computational power. This trend is expected to continue into the future. Furthermore, companies with the most computational power will create and deploy AI-powered applications and services quickly and efficiently, allowing them to gain a competitive advantage over their competitors. As AI becomes increasingly prevalent and integrated into our lives, those companies with the most computational resources will likely be the ones that dominate the AI landscape. I highly recommend reading this post on the cost of training ML models. Spoiler alert it might cost over 1 Billion dollars by 2030.
Back to the purpose of this blog, to make the most of your GPUs, it’s essential to understand how to use them effectively over time. Knowing how to maximize the power of your GPUs can make all the difference when it comes to getting optimal performance. Data Scientists often leverage Nvidia GPUs and Docker to power their work. Docker is the go-to to train deep learning models due to their ability to avoid the dreaded “dependency hell.”
Both technologies provide (separately) monitoring and observability features. For example, you can find official and non-official exporters for Docker to view the metrics via a Grafana or a Prometheus dashboard. The same for Nvidia.
If you are using Docker and Nvidia GPU on a small scale you might not need to use those fancy exporters. You can run docker stats
to be able to see the metrics. On the other hand, for your Nvidia GPU, you can run nvidia-smi
to see GPU utilization and other metrics.
What this post is about is how to know exactly how much each container is utilizing a GPU, also which GPU it is using. However, to my knowledge, there are currently no plugins available from Docker or Nvidia to connect these two pieces, making it difficult to observe the utilization of training infrastructure. Therefore, developing your own methods for understanding and observing the utilization of training infrastructure such as Docker and GPU is important. Moreover, MLOps frameworks may not provide all the features and capabilities needed to effectively deploy, manage, and monitor machine learning models.
The repo is to solve the problem we mentioned earlier (we need to know which container utilities which GPU). Utilizing Prometheus to export the results can easily view container utilization through a Grafana dashboard. However, you can use the bash scripts individually for a quick view.