机器学习框架介绍
CUDA
NVIDIA CUDA Toolkit Release Notes
Linux 安装指导
Nvidia驱动优化
Debian Nvidia安装配置
本机系统
1 | ~$ nvidia-debugdump -l |
安装CUDA
- 根据它官方的指导选择安装适合版本.
Table 1. CUDA Toolkit and Compatible Driver Versions
CUDA Toolkit | Linux x86_64 Driver Version | Windows x86_64 Driver Version |
---|---|---|
CUDA 10.0.130 | >= 410.48 | >= 411.31 |
CUDA 9.2 (9.2.148 Update 1) | >= 396.37 | >= 398.26 |
CUDA 9.2 (9.2.88) | >= 396.26 | >= 397.44 |
CUDA 9.1 (9.1.85) | >= 390.46 | >= 391.29 |
CUDA 9.0 (9.0.76) | >= 384.81 | >= 385.54 |
CUDA 8.0 (8.0.61 GA2) | >= 375.26 | >= 376.51 |
CUDA 8.0 (8.0.44) | >= 367.48 | >= 369.30 |
CUDA 7.5 (7.5.16) | >= 352.31 | >= 353.66 |
CUDA 7.0 (7.0.28) | >= 346.46 | >= 347.62 |
- 下载最低要求版本的
Nvidia
官方驱动.NVIDIA-Linux-x86_64-410.78.run - 卸载系统原来安装的驱动.
1 | ~$ dpkg -l | grep "nvidia" | awk '{print $2}' | xargs sudo dpkg --purge |
- 下载最新版本的
CUDA
工具包.cuda_10.0.130_410.48_linux.run,或者安装cuda-repo-ubuntu1804-10-0-local-10.0.130-410.48_1.0-1_amd64.deb - 安装步骤,先安装显卡驱动,再安装
NVIDIA CUDA Toolkit
,如果是安装cuda_10.0.130_410.48_linux.run
,安装过程中注意交互选项.
- 测试安装
1 | ~$ nvidia-smi |
- 安装相应版本的
cuDNN
库,cudnn-10.0-linux-x64-v7.4.1.5.tgz
- 安装相应版本的
NCCL
.
1 | ~$ tar xvf cudnn-10.0-linux-x64-v7.4.1.5.tgz -C /usr/local |
Debian Buster下安装Nvidia-tesla
驱动
1 | ~$ apt-get install nvidia-tesla-460-kernel-dkms nvidia-tesla-460-driver libnvidia-tesla-460-cuda1 nvidia-xconfig nvidia-tesla-460-smi |
安装dkms
驱动出错
1 | /var/lib/dkms/nvidia-tesla-460/460.91.03/build/common/inc/nv-misc.h:20:12: fatal error: stddef.h: No such file or directory |
- 上面错误,是没有包含
/usr/src/<linux-5.17-SRC>/include/linux
的头文件,下面是一个驱动的patch
集合,需要把下面这个文件加入到/usr/src/nvidia-tesla-460-460.91.03/dkms.conf
内。
1 | ~$ cat nvidia-tesla-460-460.91.03/patches/nvidia-tesla-460-linux-5.17-combind.patch |
安装 PyCuda
- Pip install error with PyCUDA
1
2~$ export PATH=/usr/local/cuda-10.0/bin:$PATH
~$ pip install pycuda
Nvidia Docker
安装
- 这里按照上面链接的文档安装如下.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15# If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
sudo apt-get purge -y nvidia-docker
# Add the package repositories
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd # 如果之前有安装过Docker,这一步很重要,停掉之前的进程.
用Docker
安装Tensorflow GPU
1 | ~$ docker pull tensorflow/tensorflow:latest-gpu \ |
Nvidia GPU CLOUD
容器云
安装与运行Caffe2
容器,
1 | ~$ docker run --runtime=nvidia -it caffe2ai/caffe2:latest python -m caffe2.python.operator_test.relu_op_test |
-
1
2
3~$ docker pull nvcr.io/nvidia/caffe2:18.08-py3
# 运行测试.
~$ nvidia-docker run --runtime=nvidia -it nvcr.io/nvidia/caffe2:18.08-py3 python -m caffe2.python.operator_test.relu_op_test 下面在
Docker
里运行jupyter notebook
的实例,把它在docker
里的8888端口映射到宿主机的9999端口,通过宿主机的浏览器,输入http://127.0.0.1:9999/
可以访问到它.--rm
当容器关闭后删除它--it
以交互模式运行-v
映射宿主机的目录到容器里.如上文就是把宿主机的/data/AI-DIR/TensorFlow/jupyter-notebook
,映射到容器里的/data/jupyter
目录.
1 | ~$ nvidia-docker run --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /data/AI-DIR/TensorFlow/jupyter-notebook:/data/jupyter -it-p 9999:8888 nvcr.io/nvidia/caffe2:18.08-py3 sh -c "jupyter notebook --no-browser --allow-root --ip 0.0.0.0 /data/jupyter" |
安装PyTorch
1 | ~$ docker pull nvcr.io/nvidia/pytorch:18.11-py3 |
从源码安装PyTorch
1 | ~$ pyenv activate py3dev # 通过 pyenv 进入 Python 3.6的虚拟环境. |
- 如果在使用
numpy
中出现libmkl_rt.so: cannot open shared object file: No such file or directory
错误,要安装libmkl_rt
再重新安装numpy
的库. - 现在
caffe2
已经进入到PyTorch
源码里如果导入下面模块时,出现**ModuleNotFoundError: No module named ‘past’**错误,要先安装依赖pip install future
1 | import matplotlib.pyplot as plt |
- 警告
net_drawer will not run correctly. Please install the correct dependencies.
,该警告是因没有安装pydot
.安装pip install pydot
.
源码安装TensorFlow (支持 CUDA 10)
安装Bazel
1 | ~$ echo 'deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8' | sudo tee /etc/apt/sources.list.d/bazel.list |
- 因为墙的原因,可能上述安装会很慢,可以直接从https://github.com/bazelbuild/bazel/releases一个安装脚本.现在如果使用
apt-get install bazel
会安装最新的0.20.0
版本,但是现在的TensorFlow 1.12.0
只支持bazel 0.19.2
的版本编译.
1 | $ bazel version |
下载TensorFlow
源码
1 | ~$ export PATH=/usr/local/cuda/bin:$PATH |
- 如果出现下面failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error这个错误,要在终端里先运行
apt-get install nvidia-modprobe
这个命令,并且重启系统.
1 | In [1]: import tensorflow as tf |
运行TensorBoard
可视化前端
1 | import tensorflow as tf |
- 运行上面示例代码片断,打开终端运行
tensorboard --logdir='./graph' --port=6006
,它的 WEB 服务器运行之后,可以通过浏览器访问可视端了.
Tensorflow
使用笔记
TFRecord
读写
1 | import tensorflow as tf |
1 | # 定义函数转化变量类型. |
- 用同样的格式读取
TFRecord
文件记录.1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23reader = tf.TFRecordReader()
filename_queue = tf.train.string_input_producer(['./mnist/output.tfrecords'])
_,serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
serialized_example,
features={
'pixels': tf.FixedLenFeature([],tf.int64),
'label':tf.FixedLenFeature([],tf.int64),
'image_raw': tf.FixedLenFeature([],tf.string),
})
images = tf.decode_raw(features['image_raw'],tf.uint8)
labels = tf.cast(features['label'],tf.int32)
pixels = tf.cast(features['pixels'],tf.int32)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(10):
image,label,pixel = sess.run([images,labels,pixels])
coord.request_stop()
coord.join(threads)
读取原始图片
1 | import matplotlib.pyplot as plt |
- 调整图片的尺寸
Method 取值 调整算法 0 双线性插值法(Bilinear interploation)
| 1 | 最近邻居法(Nearest neighbor interpolation) |
| 2 | 双三次插值法(Bicubic interpolation) |
| 3 | 面积插值法(Area interpolation) |
1 | with tf.Session() as sess: |
裁剪与填充图片
1
2
3
4
5
6
7with tf.Session() as sess:
croped = tf.image.resize_image_with_crop_or_pad(img_data,300,300)
padded = tf.image.resize_image_with_crop_or_pad(img_data,520,520)
plt.imshow(croped.eval())
plt.show()
plt.imshow(padded.eval())
plt.show()截取中心
%50
区域1
2
3
4with tf.Session() as sess:
central_cropped = tf.image.central_crop(img_data, 0.5)
plt.imshow(central_cropped.eval())
plt.show()
安装使用Keras
- 链接:
Keras
是提供一些高可用的Python API
,能帮助你快速的构建和训练自己的深度学习模型,它的后端是TensorFlow
或者Theano
.它很简约, 模块化的方法使建立并运行神经网络变得轻巧.1
2
3
4
5
6
7
8
9
10
11
12
13In [1]: import keras
Using TensorFlow backend.
In [2]: keras.__version__
Out[2]: '2.2.4'
In [3]: !cat /home/lcy/.keras/keras.json
{
"floatx": "float32",
"epsilon": 1e-07,
"backend": "tensorflow",
"image_data_format": "channels_last"
}
Keras MNIST
手写数据测试
1 | import numpy as np |
错误
导入tkinter
模块
- 提示无法导入
tkinter
模块,这里安装可能就比较麻烦.错误如下:1
2
3
4
5
6
7
8
In [4]: import matplotlib.pyplot as plt
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-4-a0d2faabd9e9> in <module>()
----> 1 import matplotlib.pyplot as plt
[...]
ModuleNotFoundError: No module named '_tkinter' - 解块方法如下:
1
2
3
4
5
6~$ apt-get install tk-dev
~$ pyenv uninstall 3.6.6
~$ pyenv install 3.6.6
~$ pyenv virtualenv py3dev
~$ pyenv activate py3dev
~$ python -m tkinter # 测试模块.
导入ggplot
模块
from ggplot import *
语句导入 ggplot 包时报错如下:
1 | ~/.pyenv/versions/3.6.6/envs/py3dev/lib/python3.6/site-packages/ggplot/stats/smoothers.py in <module> |
- 解决方法:编辑文件
.../site-packages/ggplot/stats/smoothers.py
.把原来的from pandas.lib import Timestamp
改成from pandas import Timestamp
,保存 OK.
安装Kaggle API
- Kaggle API
- 在
www.kaggle.com
上面注册一个帐号,进入到帐号管理页面,在API
一栏会有两个按钮Create New API Token
,Expire API Token
,点击Create New API Token
浏览器就会自动下载一个名为kaggle.json
的文件,并且会有 Toast 提示Ensure kaggle.json is in the location ~/.kaggle/kaggle.json to use the API.
1 | ~$ pip install kaggle |
下载数据
- 进入到
https://www.kaggle.com/competitions
页面,选择一行具体的项目进去,在页面底部有一个必须接受的对话框,I Understand and Accept
,否则不能下载该项目的数据.
1 | # 下载数据到指定目录. |
使用FFmpeg
支持Cuda
硬件编解码
- Install necessary packages.
1 | ~$ sudo apt-get install nvidia-cuda-toolkit nvidia-cuda-toolkit-gcc yasm cmake libtool \ |
For a list of supported GPUs, refer to https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new. for example:
GeForce Mx350, GeForce GT1030
do have any encoder.GeForce GT1030
just haveMPEG-1,MPEG-2,VC-1,H.264,H.265(4:2:0)
decoder.FFmpeg with NVIDIA GPU acceleration is supported on all Linux platforms.
To compile FFmpeg on Linux, do the following:
1 | ~$ git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git // or mirror https://github.com/FFmpeg/nv-codec-headers |
- Clone FFmpeg’s public GIT repository.
1 | ~$ git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg/ |
- If your cuda installed into
/usr/local/cuda
. you need append following to theconfigure
1 | ~$ ./configure .... |
运行错误
1 | ~$ ffmpeg -hwaccel cuda -hwaccel_output_format cuda -f v4l2 -i /dev/video0 -c:a copy -c:v h264_nvenc -b:v 5M output.mp4 -y -loglevel debug |
- Reinstall
nv-codec-headers
for a suitable version of branch.
1 | ~$ dpkg -l | grep "cuda" |
- And then reinstall
ffmpeg
again.
1 | In file included from libavutil/hwcontext_cuda.c:27: |
安装VA-API
支持
- nvidia-vaapi-driver
Debian 12(bookworm)
直接可以安装nvidia-vaapi-driver
.
1 | ~$ sudo apt-get install libnvcuvid1 libgstreamer-plugins-bad1.0-dev \ |
- To compile FFmpeg on Linux, do the following:
1 | ~$ git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git |
1 | ~$ git clone https://github.com/elFarto/nvidia-vaapi-driver |
- 运行
1 | LIBGL_DEBUG=verbose |
vainfo
错误详情输出
1 | ~$ NVD_LOG=1 vainfo |
- 上面的错误是在
hibernate
之后出现,需要重加载nvidia_uvm
内核驱动
1 | sudo rmmod nvidia_uvm |
- 创建
systemd
服务处理hiberante
后的调设置与模块重载
1 | ~$ cat /etc/pm/sleep.d/after-hibernate.sh |
systemd service
1 | ~$ cat /etc/systemd/system/rfh.service |
Gstreamer使用
Nvidia Hardware accelerated video Encoding/Decoding (nvcodec) — GStreamer
上面
vainfo
是NVIDIA Corporation GP108 [GeForce GT 1030]
显卡的输出,只支持上述的解码,不支持任何的硬件编码。下面再看对于gstreamer
的支持。
1 | export LIBVA_DRIVER_NAME=nvidia |
如上面所示,只持上面的解码,并且
nvidia-vaapi-driver
还不支持vaapipostproc: VA-API video postprocessing
,因此还无法使用vaapisink
硬件解码播放。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29GST_DEBUG=nvdec*:6,nvenc*:6 gst-inspect-1.0 nvdec
~$ gst-inspect-1.0 nvcodec
Plugin Details:
Name nvcodec
Description GStreamer NVCODEC plugin
Filename /lib/x86_64-linux-gnu/gstreamer-1.0/libgstnvcodec.so
Version 1.22.0
License LGPL
Source module gst-plugins-bad
Documentation https://gstreamer.freedesktop.org/documentation/nvcodec/
Source release date 2023-01-23
Binary package GStreamer Bad Plugins (Debian)
Origin URL https://tracker.debian.org/pkg/gst-plugins-bad1.0
cudaconvert: CUDA colorspace converter
cudaconvertscale: CUDA colorspace converter and scaler
cudadownload: CUDA downloader
cudascale: CUDA video scaler
cudaupload: CUDA uploader
nvh264dec: NVDEC h264 Video Decoder
nvh264sldec: NVDEC H.264 Stateless Decoder
nvh265dec: NVDEC h265 Video Decoder
nvh265sldec: NVDEC H.265 Stateless Decoder
nvjpegdec: NVDEC jpeg Video Decoder
nvmpeg2videodec: NVDEC mpeg2video Video Decoder
nvmpeg4videodec: NVDEC mpeg4video Video Decoder
nvmpegvideodec: NVDEC mpegvideo Video Decoder
nvvp9dec: NVDEC vp9 Video Decoder
nvvp9sldec: NVDEC VP9 Stateless Decoder测式硬解播放
h264
文件
1 | ~$ sudo apt-get install gstreamer1.0-plugins-base-apps |
1 |
|
NVDEC H.264 Stateless Decoder
解码,
1 | export LIBVA_DRIVER_NAME=nvidia |
- 使用
libav h264 decoder
, CPU会比上面的高10%
左右。
1 | ~$ gst-launch-1.0 filesrc location=test4.mp4 ! parsebin ! avdec_h264 ! videoconvert ! xvimagesink |
- 使用
glimagesink
测试速度,fpsdisplaysink
会显示当前的帧率。
1 | ~$ sudo apt-get install gstreamer1.0-gl |
错误记录分析
- 下面的错误是在我本机已经存在
/usr/local/include/va
,并且它的版本或者比较低,没有这些结构体。但是在/usr/include/va
目录下是系统安装了libva-dev
所生成,里面的结构体是符合要求。meson
可能先找到/usr/local/include/va
,并且#include <va/va.h>
就会忽略掉/usr/include/va/va.h
.
1 | ~$ cc -Invidia_drv_video.so.p -I. -I.. -I../nvidia-include -I/usr/local/include -I/usr/include -I/usr/include/libdrm -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/x86_64-linux-gnu -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=gnu11 -g -Wno-missing-field-initializers -Wno-unused-parameter -Werror=format -Werror=init-self -Werror=int-conversion -Werror=missing-declarations -Werror=missing-prototypes -Werror=pointer-arith -Werror=undef -Werror=vla -Wsuggest-attribute=format -Wwrite-strings -fPIC -pthread -MD -MQ nvidia_drv_video.so.p/src_h264.c.o -MF nvidia_drv_video.so.p/src_h264.c.o.d -o nvidia_drv_video.so.p/src_h264.c.o -c ../src/h264.c^C |
- 下面这个也是因为在本地存在
/usr/local/include/EGL
所导致的。
1 | FAILED: nvidia_drv_video.so.p/src_export-buf.c.o |
Nvidia-drm
错误
1 | ~$ dmesg |
- 上面错误是因为设置了
options nvidia-drm modeset=1
, 请确保在下面这些文件,或者/etc/modprobe.d,/usr/lib/modproble.d
目录中的文件的,都不能有此设置。
1 | ~$ grep --include=*.conf -rnw /usr/lib/ -e "nvidia-drm" |
- 出现上面错误,有可能是与系统的库有兼容问题,或者重新也是不能覆盖旧的文件,需要执行下面三个步骤:
sudo apt-get remove -y policykit-1
;dpkg --purge policykit-1
;- 再手动删除
/etc/policykit-1
,仔细确认上面的命令是否完全删除干净。 sudo apt-get install policykit-1 -y
;
谢谢支持
- 微信二维码: