0%

机器学习(ML)环境的搭建

机器学习框架介绍

CUDA

NVIDIA CUDA Toolkit Release Notes
Linux 安装指导

本机系统

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
~$ nvidia-debugdump -l
Found 1 NVIDIA devices
Device ID: 0
Device name: Quadro P600 (*PrimaryCard)
GPU internal ID: 0422018092726

~$ cat /etc/*release
PRETTY_NAME="Debian GNU/Linux 9 (stretch)"
NAME="Debian GNU/Linux"
VERSION_ID="9"
VERSION="9 (stretch)"
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

~$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/6/lto-wrapper
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 6.3.0-18+deb9u1' --with-bugurl=file:///usr/share/doc/gcc-6/README.Bugs --enable-languages=c,ada,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-6 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-6-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-6-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-6-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)

安装CUDA

  • 根据它官方的指导选择安装适合版本.

Table 1. CUDA Toolkit and Compatible Driver Versions

CUDA Toolkit Linux x86_64 Driver Version Windows x86_64 Driver Version
CUDA 10.0.130 >= 410.48 >= 411.31
CUDA 9.2 (9.2.148 Update 1) >= 396.37 >= 398.26
CUDA 9.2 (9.2.88) >= 396.26 >= 397.44
CUDA 9.1 (9.1.85) >= 390.46 >= 391.29
CUDA 9.0 (9.0.76) >= 384.81 >= 385.54
CUDA 8.0 (8.0.61 GA2) >= 375.26 >= 376.51
CUDA 8.0 (8.0.44) >= 367.48 >= 369.30
CUDA 7.5 (7.5.16) >= 352.31 >= 353.66
CUDA 7.0 (7.0.28) >= 346.46 >= 347.62
1
~$ dpkg -l | grep "nvidia" | awk '{print $2}' | xargs sudo dpkg --purge
  • 测试安装
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
~$ nvidia-smi
Fri Nov 23 11:00:29 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48 Driver Version: 410.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P600 Off | 00000000:01:00.0 On | N/A |
| 34% 31C P8 N/A / N/A | 401MiB / 1999MiB | 4% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 898 G /usr/lib/xorg/Xorg 156MiB |
| 0 13787 G ...uest-channel-token=15869920746181936845 96MiB |
| 0 16550 G ...-token=D890EF91A7BB8E03F6D8D7795CC12E48 145MiB |
+-----------------------------------------------------------------------------+

  • 安装相应版本的NCCL.
1
~$ tar xvf cudnn-10.0-linux-x64-v7.4.1.5.tgz -C /usr/local

Debian Buster下安装Nvidia-tesla驱动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
~$ apt-get install nvidia-tesla-460-kernel-dkms nvidia-tesla-460-driver libnvidia-tesla-460-cuda1 nvidia-xconfig nvidia-tesla-460-smi
~$ nvidia-smi
Sat Mar 27 21:01:55 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 On | 00000000:05:00.0 On | N/A |
| N/A 38C P5 N/A / 30W | 449MiB / 2000MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3365 G /usr/lib/xorg/Xorg 326MiB |
| 0 N/A N/A 10847 G ...AAAAAAAAA= --shared-files 22MiB |
| 0 N/A N/A 15542 G ...chael/firefox/firefox-bin 0MiB |
| 0 N/A N/A 16429 G ...AAAAAAAA== --shared-files 97MiB |
+-----------------------------------------------------------------------------+

~$ sudo nvidia-xconfig # 创建/etc/X11/xorg.conf , 手动修改会导致不能启动图形界面,或者使用vdpauinfo.

安装dkms驱动出错

1
2
3
4
/var/lib/dkms/nvidia-tesla-460/460.91.03/build/common/inc/nv-misc.h:20:12: fatal error: stddef.h: No such file or directory
514 20 | #include <stddef.h> // NULL
515 | ^~~~~~~~~~

  • 上面错误,是没有包含/usr/src/<linux-5.17-SRC>/include/linux的头文件,下面是一个驱动的patch集合,需要把下面这个文件加入到/usr/src/nvidia-tesla-460-460.91.03/dkms.conf内。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
~$ cat nvidia-tesla-460-460.91.03/patches/nvidia-tesla-460-linux-5.17-combind.patch
diff -u a/Kbuild b/Kbuild
--- a/Kbuild 2021-07-02 14:04:57.000000000 +0800
+++ a/Kbuild 2022-05-15 12:38:09.968486119 +0800
@@ -68,7 +68,7 @@

EXTRA_CFLAGS += -I$(src)/common/inc
EXTRA_CFLAGS += -I$(src)
-EXTRA_CFLAGS += -Wall -MD $(DEFINES) $(INCLUDES) -Wno-cast-qual -Wno-error -Wno-format-extra-args
+EXTRA_CFLAGS += -Wall -MD $(DEFINES) $(INCLUDES) -Wno-cast-qual -Wno-error -Wno-format-extra-args -I./include/linux
EXTRA_CFLAGS += -D__KERNEL__ -DMODULE -DNVRM -DNV_VERSION_STRING=\"460.91.03\" -Wno-unused-function -Wuninitialized -fno-strict-aliasing -mno-red-zone -mcmodel=kernel -DNV_UVM_ENABLE
EXTRA_CFLAGS += $(call cc-option,-Werror=undef,)
EXTRA_CFLAGS += -DNV_SPECTRE_V2=$(NV_SPECTRE_V2)

diff -ruN a/nvidia-uvm/uvm_linux.h b/nvidia-uvm/uvm_linux.h
--- a/nvidia-uvm/uvm_linux.h 2021-07-02 14:07:31.000000000 +0800
+++ b/nvidia-uvm/uvm_linux.h 2021-09-04 00:24:32.426673346 +0800
@@ -485,7 +485,7 @@
#elif (NV_WAIT_ON_BIT_LOCK_ARGUMENT_COUNT == 4)
static __sched int uvm_bit_wait(void *word)
{
- if (signal_pending_state(current->state, current))
+ if (signal_pending_state(current->__state, current))
return 1;
schedule();
return 0;

diff -u nvidia-tesla-460-460.91.03{,.old}/nvidia-drm/nvidia-drm-format.c
--- nvidia-tesla-460-460.91.03/nvidia-drm/nvidia-drm-format.c 2021-07-02 14:07:31.000000000 +0800
+++ nvidia-tesla-460-460.91.03.old/nvidia-drm/nvidia-drm-format.c 2022-05-15 15:17:23.498152286 +0800
@@ -29,6 +29,7 @@
#endif
#include <linux/kernel.h>

+#include "nvidia-uvm/uvm_linux.h"
#include "nvidia-drm-format.h"
#include "nvidia-drm-os-interface.h"

diff -u nvidia-tesla-460-460.91.03/common/inc/nv-procfs.h nvidia-tesla-460-460.91.03.old/common/inc/nv-procfs.h
--- nvidia-tesla-460-460.91.03/common/inc/nv-procfs.h 2021-07-02 14:07:32.000000000 +0800
+++ nvidia-tesla-460-460.91.03.old/common/inc/nv-procfs.h 2022-05-15 15:52:20.475063183 +0800
@@ -11,6 +11,11 @@
#define _NV_PROCFS_H

#include "conftest.h"
+#include <linux/version.h>
+#if (LINUX_VERSION_CODE > KERNEL_VERSION(5,17,0))
+#define NV_PDE_DATA_PRESENT
+#define PDE_DATA(inode) pde_data(inode)
+#endif

#ifdef CONFIG_PROC_FS
#include <linux/proc_fs.h>
diff --git a/common/inc/nv-time.h b/common/inc/nv-time.h
index dc80806..cc343a5 100644
--- a/common/inc/nv-time.h
+++ b/common/inc/nv-time.h
@@ -23,6 +23,7 @@
#ifndef __NV_TIME_H__
#define __NV_TIME_H__

+#include <linux/version.h>
#include "conftest.h"
#include <linux/sched.h>
#include <linux/delay.h>
@@ -205,7 +206,12 @@ static inline NV_STATUS nv_sleep_ms(unsigned int ms)
// the requested timeout has expired, loop until less
// than a jiffie of the desired delay remains.
//
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(5, 14, 0))
current->state = TASK_INTERRUPTIBLE;
+#else
+ // Rel. commit "sched: Change task_struct::state" (Peter Zijlstra, Jun 11 2021)
+ WRITE_ONCE(current->__state, TASK_INTERRUPTIBLE);
+#endif
do
{
schedule_timeout(jiffies);

diff --git a/nvidia-drm/nvidia-drm-drv.c b/nvidia-drm/nvidia-drm-drv.c
index 84d4479..99ea552 100644
--- a/nvidia-drm/nvidia-drm-drv.c
+++ b/nvidia-drm/nvidia-drm-drv.c
@@ -20,6 +20,7 @@
* DEALINGS IN THE SOFTWARE.
*/

+#include <linux/version.h>
#include "nvidia-drm-conftest.h" /* NV_DRM_AVAILABLE and NV_DRM_DRM_GEM_H_PRESENT */

#include "nvidia-drm-priv.h"
@@ -903,9 +904,12 @@ static void nv_drm_register_drm_device(const nv_gpu_info_t *gpu_info)

dev->dev_private = nv_dev;
nv_dev->dev = dev;
+#if (LINUX_VERSION_CODE < KERNEL_VERSION(5, 14, 0))
+ // Rel. commit "drm: Remove pdev field from struct drm_device" (Thomas Zimmermann, 3 May 2021)
if (device->bus == &pci_bus_type) {
dev->pdev = to_pci_dev(device);
}
+#endif

/* Register DRM device to DRM sub-system */

安装 PyCuda

Nvidia Docker安装

nvidia-docker

  • 这里按照上面链接的文档安装如下.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    # If you have nvidia-docker 1.0 installed: we need to remove it and all existing GPU containers
    docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1 docker ps -q -a -f volume={} | xargs -r docker rm -f
    sudo apt-get purge -y nvidia-docker

    # Add the package repositories
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
    sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update

    # Install nvidia-docker2 and reload the Docker daemon configuration
    sudo apt-get install -y nvidia-docker2
    sudo pkill -SIGHUP dockerd # 如果之前有安装过Docker,这一步很重要,停掉之前的进程.

Docker安装Tensorflow GPU

1
2
3
4
5
6
7
8
9
~$ docker pull tensorflow/tensorflow:latest-gpu \
~$ nvidia-docker run -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
2018-11-23 03:45:33.118841: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-23 03:45:33.196995: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node,so returning NUMA node zero
2018-11-23 03:45:33.197568: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Quadro P600 major: 6 minor: 1 memoryClockRate(GHz): 1.5565
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.57GiB
[...]

Nvidia GPU CLOUD容器云

安装与运行Caffe2容器,

1
2
3
4
5
6
~$ docker run --runtime=nvidia -it caffe2ai/caffe2:latest python -m caffe2.python.operator_test.relu_op_test
Trying example: test_relu(self=<__main__.TestRelu testMethod=test_relu>, X=array([[[-0.42894608],
[-0.65820682],
[ 0.39978197],
[...]

  • 从 Nvidia GPU CLOUD (NGC) 上安装

    1
    2
    3
    ~$ docker pull nvcr.io/nvidia/caffe2:18.08-py3
    # 运行测试.
    ~$ nvidia-docker run --runtime=nvidia -it nvcr.io/nvidia/caffe2:18.08-py3 python -m caffe2.python.operator_test.relu_op_test
  • 下面在Docker里运行jupyter notebook的实例,把它在docker里的8888端口映射到宿主机的9999端口,通过宿主机的浏览器,输入http://127.0.0.1:9999/可以访问到它.

  • --rm 当容器关闭后删除它

  • --it 以交互模式运行

  • -v 映射宿主机的目录到容器里.如上文就是把宿主机的/data/AI-DIR/TensorFlow/jupyter-notebook,映射到容器里的/data/jupyter目录.

1
2
3
4
5
6
7
8
9
10
11
~$ nvidia-docker run --rm  --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /data/AI-DIR/TensorFlow/jupyter-notebook:/data/jupyter  -it-p 9999:8888 nvcr.io/nvidia/caffe2:18.08-py3  sh -c "jupyter notebook --no-browser --allow-root --ip 0.0.0.0 /data/jupyter"

============
== Caffe2 ==
============

NVIDIA Release 18.08 (build 599137)

Container image Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
[...]

安装PyTorch

1
~$ docker pull nvcr.io/nvidia/pytorch:18.11-py3

从源码安装PyTorch

1
2
3
4
5
6
7
8
~$ pyenv activate  py3dev  # 通过 pyenv 进入 Python 3.6的虚拟环境.
~$ pip install numpy pyyaml mkl mkl-include setuptools cmake cffi typing # py3dev 环境中安装依赖库.
~$ export PATH=/usr/local/cuda-10.0/bin:$PATH
~$ export CUDA=1
~$ pip install pycuda
~$ git clone --recursive https://github.com/pytorch/pytorch
~$ cd pytorch
~$ python setup.py install
  • 如果在使用numpy中出现libmkl_rt.so: cannot open shared object file: No such file or directory错误,要安装libmkl_rt再重新安装numpy的库.
  • 现在caffe2已经进入到PyTorch源码里如果导入下面模块时,出现**ModuleNotFoundError: No module named ‘past’**错误,要先安装依赖pip install future
1
2
3
4
5
6
7
import matplotlib.pyplot as plt
import numpy as np
import os
import shutil
import caffe2.python.predictor.predictor_exporter as pe
from caffe2.python import core,model_helper,net_drawer,workspace,visualize,brew
ModuleNotFoundError: No module named 'past'
  • 警告net_drawer will not run correctly. Please install the correct dependencies.,该警告是因没有安装pydot.安装pip install pydot.

源码安装TensorFlow (支持 CUDA 10)

安装Bazel

1
2
3
4
~$ echo 'deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8' | sudo tee /etc/apt/sources.list.d/bazel.list
~$ curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
~$ sudo apt-get update
~$ sudo apt-get install bazel
  • 因为墙的原因,可能上述安装会很慢,可以直接从https://github.com/bazelbuild/bazel/releases一个安装脚本.现在如果使用apt-get install bazel会安装最新的0.20.0版本,但是现在的TensorFlow 1.12.0只支持bazel 0.19.2的版本编译.
1
2
3
4
5
6
7
$ bazel version
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
Build label: 0.19.2
Build target: bazel-out/k8-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Mon Nov 19 16:25:09 2018 (1542644709)
Build timestamp: 1542644709
Build timestamp as int: 1542644709

下载TensorFlow源码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
~$ export PATH=/usr/local/cuda/bin:$PATH
~$ export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH
~$ git clone https://github.com/tensorflow/tensorflow.git
~$ pyenv activate py3dev
~$ pip install wheel
~$ ./configure
WARNING: --batch mode is deprecated. Please instead explicitly shut down your Bazel server using the command "bazel shutdown".
You have bazel 0.19.2 installed.
Please specify the location of python. [Default is fullpath/.pyenv/versions/py3dev/bin/python]:

Found possible Python library paths:
/fullpath/.pyenv/versions/py3dev/lib/python3.6/site-packages
Please input the desired Python library path to use. Default is [/fullpath/.pyenv/versions/py3dev/lib/python3.6/site-packages]
Do you wish to build TensorFlow with XLA JIT support? [Y/n]: Y
XLA JIT support will be enabled for TensorFlow.
Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: N
No OpenCL SYCL support will be enabled for TensorFlow.
Do you wish to build TensorFlow with ROCm support? [y/N]: N
No ROCm support will be enabled for TensorFlow.
Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.
Please specify the CUDA SDK version you want to use. [Leave empty to default to CUDA 9.0]: 10.0
Please specify the location where CUDA 10.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:
Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7]:
Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:

Do you wish to build TensorFlow with TensorRT support? [y/N]: N
No TensorRT support will be enabled for TensorFlow.

Please specify the locally installed NCCL version you want to use. [Default is to use https://github.com/nvidia/nccl]:

Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,7.0]:

Do you want to use clang as CUDA compiler? [y/N]: N
nvcc will be used as CUDA compiler.
Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:
Do you wish to build TensorFlow with MPI support? [y/N]: N
No MPI support will be enabled for TensorFlow.

Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native -Wno-sign-compare]:
Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: y
Please specify the home path of the Android NDK to use. [Default is /fullpath/Android/Sdk/ndk-bundle]:
Please specify the home path of the Android SDK to use. [Default is /fullpath/Android/Sdk]:
Please specify the Android SDK API level to use. [Available levels: ['13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28']] [Default is 28]:

Please specify an Android build tools version to use. [Available versions: ['21.1.2', '23.0.3', '24.0.3', '25.0.0', '25.0.2', '25.0.3', '26.0.2', '27.0.0', '27.0.3', '28.0.0-rc2', '28.0.2', '28.0.3']] [Default is 28.0.3]:

Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See .bazelrc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
--config=gdr # Build with GDR support.
--config=verbs # Build with libverbs support.
--config=ngraph # Build with Intel nGraph support.
--config=dynamic_kernels # (Experimental) Build kernels into separate shared objects.
Preconfigured Bazel build configs to DISABLE default on features:
--config=noaws # Disable AWS S3 filesystem support.
--config=nogcp # Disable GCP support.
--config=nohdfs # Disable HDFS support.
--config=noignite # Disable Apacha Ignite support.
--config=nokafka # Disable Apache Kafka support.
--config=nonccl # Disable NVIDIA NCCL support.
Configuration finished

~$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package

~$ ./bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg # 编译Python安装包.
Wed Dec 5 10:54:27 CST 2018 : === Preparing sources in dir: /tmp/tmp.OZdwkuc2YO
~/github/tensorflow ~/github/tensorflow
[...]

~$ pip install /tmp/tensorflow_pkg/tensorflow-1.12.0rc0-cp36-cp36m-linux_x86_64.whl # 安装Python
~$ LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda-10.0/extras/CUPTI/lib64:$LD_LIBRARY_PATH jupyter notebook # 在notebook里测试运行tensorflow
  • 如果出现下面failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error这个错误,要在终端里先运行apt-get install nvidia-modprobe这个命令,并且重启系统.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
In [1]: import tensorflow as tf
In [2]: tf.test.is_built_with_cuda()
Out[2]: True
In [3]: tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2018-12-05 12:03:06.128401: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2018-12-05 12:03:06.128442: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:161] retrieving CUDA diagnostic information for host: debian
2018-12-05 12:03:06.128448: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:168] hostname: debian
2018-12-05 12:03:06.128470: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:192] libcuda reported version is: 410.48.0
2018-12-05 12:03:06.128488: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:196] kernel reported version is: 410.48.0
2018-12-05 12:03:06.128493: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version seems to match DSO: 410.48.0
Out[3]: False

In [4]: tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
Out[4]: False

# 重启系统之后,可以正常使用GPU了.

In [1]: import tensorflow as tf

In [2]: tf.Session().list_devices()
2018-12-05 15:02:22.981018: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-12-05 15:02:22.982813: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55d255632230 executing computations on platform CUDA. Devices:
2018-12-05 15:02:22.982835: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): Quadro P600, Compute Capability 6.1
2018-12-05 15:02:22.983889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1431] Found device 0 with properties:
name: Quadro P600 major: 6 minor: 1 memoryClockRate(GHz): 1.5565
pciBusID: 0000:01:00.0
totalMemory: 1.95GiB freeMemory: 1.74GiB
2018-12-05 15:02:22.983931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Adding visible gpu devices: 0
2018-12-05 15:02:22.986678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-05 15:02:22.986711: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-05 15:02:22.986726: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-05 15:02:22.986953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1113] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1560 MB memory) -> physical GPU (device: 0, name: Quadro P600, pci bus id: 0000:01:00.0, compute capability: 6.1)
Out[2]:
[_DeviceAttributes(/job:localhost/replica:0/task:0/device:CPU:0, CPU, 268435456, 4411150611837152607),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_CPU:0, XLA_CPU, 17179869184, 8331037032149977949),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:XLA_GPU:0, XLA_GPU, 17179869184, 1279689307458374322),
_DeviceAttributes(/job:localhost/replica:0/task:0/device:GPU:0, GPU, 1636106240, 7170667474598106347)]

Out[3]: tf.test.is_gpu_available(cuda_only=False,min_cuda_compute_capability=None)
2018-12-05 15:05:52.037618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Adding visible gpu devices: 0
2018-12-05 15:05:52.037647: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-12-05 15:05:52.037652: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2018-12-05 15:05:52.037656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2018-12-05 15:05:52.037737: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1113] Created TensorFlow device (/device:GPU:0 with 1560 MB memory) -> physical GPU (device: 0, name: Quadro P600, pci bus id: 0000:01:00.0, compute capability: 6.1)
Out[4]: True

运行TensorBoard可视化前端

1
2
3
4
5
6
7
import tensorflow as tf
input1 = tf.constant([1.0,2.0,3.0],name='input1')
input2 = tf.constant([2.0,3.0,4.0],name='input2')
output = tf.add_n([input1,input2],name='add')
with tf.Session() as sess:
writer = tf.summary.FileWriter(graph=sess.graph,logdir='./graph')
sess.run(output)
  • 运行上面示例代码片断,打开终端运行tensorboard --logdir='./graph' --port=6006,它的 WEB 服务器运行之后,可以通过浏览器访问可视端了.

Tensorflow使用笔记

TFRecord读写

1
2
3
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
import numpy as np
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 定义函数转化变量类型.
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# mnist/data 是存放从网上下载的mnist数据位置
mnist = input_data.read_data_sets('./mnist/data',dtype=tf.uint8,one_hot=True)
images = mnist.train.images
labels = mnist.train.labels

pixels = images.shape[1]
num_examples = mnist.train.num_examples

filename = './mnist/output.tfrecords'

# 将数据转化为tf.train.Example格式.
def _make_example(pixels, label, image):
image_raw = image.tostring()
example = tf.train.Example(features=tf.train.Features(feature={
'pixels': _int64_feature(pixels),
'label': _int64_feature(np.argmax(label)),
'image_raw': _bytes_feature(image_raw)
}))
return example

with tf.python_io.TFRecordWriter(filename) as writer:
for index in range(num_examples):
example = _make_example(pixels,labels[index],images[index])
writer.write(example.SerializeToString())
print('写入TFRecord测试文件')
  • 用同样的格式读取TFRecord文件记录.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    reader = tf.TFRecordReader()
    filename_queue = tf.train.string_input_producer(['./mnist/output.tfrecords'])
    _,serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(
    serialized_example,
    features={
    'pixels': tf.FixedLenFeature([],tf.int64),
    'label':tf.FixedLenFeature([],tf.int64),
    'image_raw': tf.FixedLenFeature([],tf.string),
    })

    images = tf.decode_raw(features['image_raw'],tf.uint8)
    labels = tf.cast(features['label'],tf.int32)
    pixels = tf.cast(features['pixels'],tf.int32)


    with tf.Session() as sess:
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)
    for i in range(10):
    image,label,pixel = sess.run([images,labels,pixels])
    coord.request_stop()
    coord.join(threads)

读取原始图片

1
2
3
4
5
6
7
8
9
10
11
12
import matplotlib.pyplot as plt

image_raw_data = tf.gfile.FastGFile('../img3.png','rb').read()
with tf.Session() as sess:
img_data = tf.image.decode_png(image_raw_data)
# 输出解码之后的三维矩阵.
print(img_data.eval())
plt.imshow(img_data.eval())
plt.show()
img_data.set_shape([420,420,3])
print(img_data.get_shape())

  • 调整图片的尺寸
    Method 取值 调整算法
    0 双线性插值法(Bilinear interploation)

| 1 | 最近邻居法(Nearest neighbor interpolation) |
| 2 | 双三次插值法(Bicubic interpolation) |
| 3 | 面积插值法(Area interpolation) |

1
2
3
4
5
6
7
with tf.Session() as sess:
# 如果直接以0-255范围的整数数据输入resize_images,那么输出将是0-255之间的实数,
# 不利于后续处理.本书建议在调整图片大小前,先将图片转为0-1范围的实数.
image_float = tf.image.convert_image_dtype(img_data,tf.float32)
resized = tf.image.resize_images(image_float,[400,400],method=0)
plt.imshow(resized.eval())
plt.show()
  • 裁剪与填充图片

    1
    2
    3
    4
    5
    6
    7
    with tf.Session() as sess:
    croped = tf.image.resize_image_with_crop_or_pad(img_data,300,300)
    padded = tf.image.resize_image_with_crop_or_pad(img_data,520,520)
    plt.imshow(croped.eval())
    plt.show()
    plt.imshow(padded.eval())
    plt.show()
  • 截取中心%50区域

    1
    2
    3
    4
    with tf.Session() as sess:
    central_cropped = tf.image.central_crop(img_data, 0.5)
    plt.imshow(central_cropped.eval())
    plt.show()

安装使用Keras

  • Keras是提供一些高可用的Python API,能帮助你快速的构建和训练自己的深度学习模型,它的后端是TensorFlow或者Theano.它很简约, 模块化的方法使建立并运行神经网络变得轻巧.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    In [1]: import keras
    Using TensorFlow backend.

    In [2]: keras.__version__
    Out[2]: '2.2.4'

    In [3]: !cat /home/lcy/.keras/keras.json
    {
    "floatx": "float32",
    "epsilon": 1e-07,
    "backend": "tensorflow",
    "image_data_format": "channels_last"
    }

Keras MNIST手写数据测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import keras
from keras.models import Sequential,load_model
from keras.layers import Dense,Dropout,Conv2D,Flatten,MaxPooling2D,Activation
from keras.datasets.mnist import load_data
import os
# 清除GPU的会话数据
keras.backend.clear_session()
(X_train,Y_train),(x_test,y_test) = load_data()
X_train = X_train.reshape(X_train.shape[0],28,28,1)
x_test = x_test.reshape(x_test.shape[0],28,28,1)
input_shape=(28,28,1)
X_train = X_train.astype('float32')
x_test = x_test.astype('float32')
X_train /= 255
x_test /= 255
print('x_train shape:',X_train.shape)
print('Number of images in x_train',X_train.shape[0])
print('Number of images in x_test',x_test.shape[0])

# 卷积网络
model = Sequential()
model.add(Conv2D(28,kernel_size=(3,3),input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.02))
model.add(Dense(10))
model.add(Activation('softmax'))

# 编译多分类函数分类器
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(x=X_train,y=Y_train,epochs=10)

# 保存训练模型到HDF5文件里
history = model.fit(X_train,Y_train,batch_size=128,epochs=20,verbose=2,validation_data=(x_test,y_test))
save_dir = './results/'
mode_name='keras_mnist.h5'
mode_path = os.path.join(save_dir,mode_name)
model.save(mode_path)
print('Saved trained model at %s' % mode_path)

# 打印它的训练图形
fig = plt.figure()
plt.subplot(2,1,1)
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='lower right')

plt.subplot(2,1,2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper right')

plt.tight_layout()
fig

# 使用一部分测试图片验证模型
mnist_model = load_model('./results/keras_mnist.h5')
loss_and_metrics = mnist_model.evaluate(x_test,y_test,verbose=2)
print('Test Loss',loss_and_metrics[0])
print('Test Accuracy',loss_and_metrics[1])
predicted_classes =mnist_model.predict_classes(x_test)
corrent_indices = np.nonzero(predicted_classes == y_test)[0]
incorrent_indices = np.nonzero(predicted_classes != y_test)[0]
print()
print(len(corrent_indices),' classifed correctly')
print(len(incorrent_indices),' classified incorrectly')

plt.rcParams['figure.figsize'] = (7,14)
figure_evaluation = plt.figure()

# 打印9个正确预测的图片
for i ,correct in enumerate(corrent_indices[:9]):
plt.subplot(6,3,i+1)
plt.imshow(x_test[correct].reshape(28,28),cmap='gray',interpolation='none')
plt.title('Predicted: {},Trutch: {}'.format(predicted_classes[correct],y_test[correct]))
plt.xticks([])
plt.yticks([])
# 打印9个错误预测的图片
for i ,correct in enumerate(incorrent_indices[:9]):
plt.subplot(6,3,i+1)
plt.imshow(x_test[correct].reshape(28,28),cmap='gray',interpolation='none')
plt.title('Predicted: {},Truth: {}'.format(predicted_classes[correct],y_test[correct]))
plt.xticks([])
plt.yticks([])
figure_evaluation


# 读入自己的手动生成的图片做测试

from PIL import Image
from keras_preprocessing.image import img_to_array
from keras_applications import imagenet_utils
data_dir = '/data/AI-DIR/TensorFlow/mnist/test-data/'
list_dir = os.listdir(data_dir)
print(len(list_dir))
image_height = 28
image_width = 28
channels = 1
img_data = np.ndarray(shape=(len(list_dir),image_height,image_width,channels))
label_data = np.zeros(len(list_dir),dtype='uint8')
i = 0
for file in list_dir:
# 读取目录下的图片,并把它转换成灰度图片.
png = Image.open(os.path.join(data_dir,file),'r').convert('L')
gray = png.point(lambda x: 0 if x == 255 else 255)
image = img_to_array(gray)
img_data[i] = image
label_data[i] = int(file[4])
# print('png file: ',file)
i+=1

print('test_data len',len(img_data))
print('test_data shape',img_data.shape)
print('test label ',label_data)

loss_and_metrics = mnist_model.evaluate(img_data,label_data,verbose=2)
print('Test Loss',loss_and_metrics[0])
print('Test Accuracy',loss_and_metrics[1])

predicted_classes =mnist_model.predict_classes(img_data)
corrent_indices = np.nonzero(predicted_classes == label_data)[0]
incorrent_indices = np.nonzero(predicted_classes != label_data)[0]
print()
print(len(corrent_indices),' classifed correctly')
print(len(incorrent_indices),' classified incorrectly')
plt.rcParams['figure.figsize'] = (7,14)
figure_evaluation = plt.figure()

# 打印9个错误预测的图片
for i ,correct in enumerate(incorrent_indices[:9]):
plt.subplot(6,3,i+1)
plt.imshow(img_data[correct].reshape(28,28),cmap='gray',interpolation='none')
plt.title('Predicted: {},Truth: {}'.format(predicted_classes[correct],label_data[correct]))
plt.xticks([])
plt.yticks([])

figure_evaluation

错误

导入tkinter模块

  • 提示无法导入tkinter模块,这里安装可能就比较麻烦.错误如下:
    1
    2
    3
    4
    5
    6
    7
    8

    In [4]: import matplotlib.pyplot as plt
    ---------------------------------------------------------------------------
    ModuleNotFoundError Traceback (most recent call last)
    <ipython-input-4-a0d2faabd9e9> in <module>()
    ----> 1 import matplotlib.pyplot as plt
    [...]
    ModuleNotFoundError: No module named '_tkinter'
  • 解块方法如下:
    1
    2
    3
    4
    5
    6
    ~$ apt-get install tk-dev
    ~$ pyenv uninstall 3.6.6
    ~$ pyenv install 3.6.6
    ~$ pyenv virtualenv py3dev
    ~$ pyenv activate py3dev
    ~$ python -m tkinter # 测试模块.

导入ggplot模块

  • from ggplot import *语句导入 ggplot 包时报错如下:
1
2
3
4
5
6
7
8
9
~/.pyenv/versions/3.6.6/envs/py3dev/lib/python3.6/site-packages/ggplot/stats/smoothers.py in <module>
2 unicode_literals)
3 import numpy as np
----> 4 from pandas.lib import Timestamp
5 import pandas as pd
6 import statsmodels.api as sm

ImportError: cannot import name 'Timestamp'

  • 解决方法:编辑文件.../site-packages/ggplot/stats/smoothers.py.把原来的from pandas.lib import Timestamp改成from pandas import Timestamp,保存 OK.

安装Kaggle API

  • Kaggle API
  • www.kaggle.com上面注册一个帐号,进入到帐号管理页面,在API一栏会有两个按钮Create New API Token,Expire API Token,点击Create New API Token浏览器就会自动下载一个名为kaggle.json的文件,并且会有 Toast 提示Ensure kaggle.json is in the location ~/.kaggle/kaggle.json to use the API.
1
2
3
4
~$ pip install kaggle
~$ mkdir ~/.kaggle
~$ mv ~/Download/kaggle.json ~/.kaggle/
~$ chmod 600 ~/.kaggle/kaggle.json

下载数据

  • 进入到https://www.kaggle.com/competitions页面,选择一行具体的项目进去,在页面底部有一个必须接受的对话框,I Understand and Accept,否则不能下载该项目的数据.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 下载数据到指定目录.
~$ kaggle competitions download -c traveling-santa-2018-prime-paths -p /fullpath/Traveling-Santa-2018-Prime-Paths/
~$ kaggle competitions list
ref deadline category reward teamCount userHasEntered
--------------------------------------------- ------------------- --------------- --------- --------- --------------
digit-recognizer 2030-01-01 00:00:00 Getting Started Knowledge 2708 True
titanic 2030-01-01 00:00:00 Getting Started Knowledge 10578 False
house-prices-advanced-regression-techniques 2030-01-01 00:00:00 Getting Started Knowledge 4519 False
imagenet-object-localization-challenge 2029-12-31 07:00:00 Research Knowledge 30 False
competitive-data-science-predict-future-sales 2019-12-31 23:59:00 Playground Kudos 1869 False
histopathologic-cancer-detection 2019-03-31 23:59:00 Playground Knowledge 140 False
humpback-whale-identification 2019-02-28 23:59:00 Featured $25,000 144 False
elo-merchant-category-recommendation 2019-02-26 23:59:00 Featured $50,000 630 False
ga-customer-revenue-prediction 2019-02-15 23:59:00 Featured $45,000 1104 False
quora-insincere-questions-classification 2019-02-05 23:59:00 Featured $25,000 1666 False
pubg-finish-placement-prediction 2019-01-30 23:59:00 Playground Swag 857 False
human-protein-atlas-image-classification 2019-01-10 23:59:00 Featured $37,000 1388 False
traveling-santa-2018-prime-paths 2019-01-10 23:59:00 Featured $25,000 958 True
[...]

使用FFmpeg支持Cuda硬件编解码

  • Install necessary packages.
1
2
~$ sudo apt-get install nvidia-cuda-toolkit nvidia-cuda-toolkit-gcc  yasm cmake libtool \
libc6 libc6-dev unzip wget libnuma1 libnuma-dev libnvidia-encode1
1
2
~$ git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git // or mirror https://github.com/FFmpeg/nv-codec-headers
~$ cd nv-codec-headers && sudo make install
  • Clone FFmpeg’s public GIT repository.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
~$ git clone https://git.ffmpeg.org/ffmpeg.git ffmpeg/

~$ cd ffmpeg
~$ ./configure --enable-nonfree --enable-cuda-nvcc --enable-libnpp \
--enable-libmp3lame --enable-v4l2-m2m --enable-vdpau --enable-vaapi \
--enable-libdrm --enable-libx264 --enable-libvpx --enable-libwebp \
--enable-libv4l2 --enable-libopus --enable-libopencore-amrnb --enable-libopencore-amrwb \
--enable-librtmp --enable-gpl --enable-version3 --enable-libvorbis \
--disable-doc --disable-htmlpages --disable-manpages --disable-podpages \
--disable-txtpages --enable-shared

./configure --enable-nonfree --enable-amf --enable-libnpp \
--enable-libmp3lame --enable-v4l2-m2m --enable-vdpau --enable-vaapi \
--enable-libdrm --enable-libx264 --enable-libvpx --enable-libwebp \
--enable-libv4l2 --enable-libopus --enable-libopencore-amrnb --enable-libopencore-amrwb \
--enable-librtmp --enable-gpl --enable-version3 --enable-libvorbis \
--disable-doc --disable-htmlpages --disable-manpages --disable-podpages \
--disable-txtpages --disable-static --enable-shared

~$ make -j$(nproc) && sudo make install


~$ LD_LIBRARY_PATH=/usr/local/lib ffmpeg -hwaccels
ffmpeg version N-110065-g30cea1d39b Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 10 (Debian 10.2.1-6)
configuration: --enable-nonfree --enable-cuda-nvcc --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --disable-static --enable-shared
libavutil 58. 5.100 / 58. 5.100
libavcodec 60. 6.101 / 60. 6.101
libavformat 60. 4.100 / 60. 4.100
libavdevice 60. 2.100 / 60. 2.100
libavfilter 9. 4.100 / 9. 4.100
libswscale 7. 2.100 / 7. 2.100
libswresample 4. 11.100 / 4. 11.100
Hardware acceleration methods:
vdpau
cuda
vaapi

  • If your cuda installed into /usr/local/cuda. you need append following to the configure
1
2
3
~$ ./configure ....
--extra-cflags=-I/usr/local/cuda/include \
--extra-ldflags=-L/usr/local/cuda/lib64

运行错误

1
2
3
4
5
~$ ffmpeg  -hwaccel cuda -hwaccel_output_format cuda -f v4l2  -i /dev/video0  -c:a copy -c:v h264_nvenc -b:v 5M output.mp4 -y -loglevel debug
[h264_nvenc @ 0x55aacc4caf00] Driver does not support the required nvenc API version. Required: 12.0 Found: 11.1
[h264_nvenc @ 0x55aacc4caf00] The minimum required Nvidia driver for nvenc is 520.56.06 or newer
[h264_nvenc @ 0x55aacc4caf00] Nvenc unloaded

  • Reinstall nv-codec-headers for a suitable version of branch.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
~$ dpkg -l | grep "cuda"
ii cuda-keyring 1.0-1 all GPG keyring for the CUDA repository
ii libcuda1:amd64 470.161.03-1 amd64 NVIDIA CUDA Driver Library
ii libcudart11.0:amd64 11.2.152~11.2.2-3+deb11u3 amd64 NVIDIA CUDA Runtime Library
ii nvidia-cuda-dev:amd64 11.2.2-3+deb11u3 amd64 NVIDIA CUDA development files
ii nvidia-cuda-toolkit 11.2.2-3+deb11u3 amd64 NVIDIA CUDA development toolkit


~$ cd nv-codec-headers
~$ git tag
n10.0.26.0
n10.0.26.1
n10.0.26.2
n11.0.10.0
n11.0.10.1
n11.0.10.2
n11.1.5.0
n11.1.5.1
n11.1.5.2
n12.0.16.0
[...]
~$ git checkout n11.1.5.2
~$ sudo make install
  • And then reinstall ffmpeg again.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
In file included from libavutil/hwcontext_cuda.c:27:
libavutil/hwcontext_cuda.c: In function ‘cuda_context_init’:
libavutil/hwcontext_cuda.c:365:28: error: ‘CudaFunctions’ has no member named ‘cuCtxGetCurrent’; did you mean ‘cuCtxPopCurrent’?
365 | ret = CHECK_CU(cu->cuCtxGetCurrent(&hwctx->cuda_ctx));
| ^~~~~~~~~~~~~~~
libavutil/cuda_check.h:65:114: note: in definition of macro ‘FF_CUDA_CHECK_DL’
65 | #define FF_CUDA_CHECK_DL(avclass, cudl, x) ff_cuda_check(avclass, cudl->cuGetErrorName, cudl->cuGetErrorString, (x), #x)
| ^
libavutil/hwcontext_cuda.c:365:15: note: in expansion of macro ‘CHECK_CU’
365 | ret = CHECK_CU(cu->cuCtxGetCurrent(&hwctx->cuda_ctx));
| ^~~~~~~~
make: *** [ffbuild/common.mak:81: libavutil/hwcontext_cuda.o] Error 1
make: *** Waiting for unfinished jobs....
CC libavutil/hwcontext_vaapi.o
In file included from /usr/include/CL/cl.h:20,
from libavutil/hwcontext_opencl.h:25,
from libavutil/hwcontext_opencl.c:30:
/usr/include/CL/cl_version.h:22:9: note: ‘#pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)’
22 | #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 300 (OpenCL 3.0)")
| ^~~~~~~
STRIP libswscale/x86/output.o

安装VA-API支持

1
2
~$ sudo apt-get install libnvcuvid1  libgstreamer-plugins-bad1.0-dev \
meson gstreamer1.0-plugins-bad libva-dev -y
  • To compile FFmpeg on Linux, do the following:
1
2
~$ git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git
~$ cd nv-codec-headers && sudo make install
1
2
3
~$ git clone https://github.com/elFarto/nvidia-vaapi-driver
~$ cd nvidia-vaapi-driver && meson setup build
~$ meson install -c build
  • 运行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
LIBGL_DEBUG=verbose
export LIBVA_DRIVER_NAME=nvidia
~$ vainfo
libva info: VA-API version 1.17.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
libva info: Found init function __vaDriverInit_1_0
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: VA-API NVDEC driver [egl backend]
vainfo: Supported profile and entrypoints
VAProfileMPEG2Simple : VAEntrypointVLD
VAProfileMPEG2Main : VAEntrypointVLD
VAProfileVC1Simple : VAEntrypointVLD
VAProfileVC1Main : VAEntrypointVLD
VAProfileVC1Advanced : VAEntrypointVLD
VAProfileH264Main : VAEntrypointVLD
VAProfileH264High : VAEntrypointVLD
VAProfileH264ConstrainedBaseline: VAEntrypointVLD
VAProfileHEVCMain : VAEntrypointVLD
VAProfileVP9Profile0 : VAEntrypointVLD
VAProfileHEVCMain10 : VAEntrypointVLD
VAProfileHEVCMain12 : VAEntrypointVLD
VAProfileVP9Profile2 : VAEntrypointVLD
  • vainfo错误详情输出
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
~$ NVD_LOG=1 vainfo
libva info: VA-API version 1.17.0
libva info: User environment variable requested driver 'nvidia'
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so
214878.067671198 [2010296-2010296] ../src/vabackend.c: 108 init CUDA ERROR 'unknown error' (999)

libva info: Found init function __vaDriverInit_1_0
214878.067694762 [2010296-2010296] ../src/vabackend.c:1872 __vaDriverInit_1_0 Initialising NVIDIA VA-API Driver: 0x55a1fedadf50 10
214878.067698198 [2010296-2010296] ../src/vabackend.c:1894 __vaDriverInit_1_0 Now have 0 (0 max) instances
214878.067700042 [2010296-2010296] ../src/vabackend.c:1916 __vaDriverInit_1_0 Selecting EGL backend
214878.071761148 [2010296-2010296] ../src/export-buf.c: 150 findGPUIndexFromFd Defaulting to CUDA GPU ID 0. Use NVD_GPU to select a specific CUDA GPU
214878.071770746 [2010296-2010296] ../src/export-buf.c: 163 findGPUIndexFromFd Looking for GPU index: 0
214878.073034516 [2010296-2010296] ../src/export-buf.c: 175 findGPUIndexFromFd Found 3 EGL devices
214878.074061472 [2010296-2010296] ../src/export-buf.c: 229 findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 0
214878.074069096 [2010296-2010296] ../src/export-buf.c: 229 findGPUIndexFromFd No EGL_CUDA_DEVICE_NV support for EGLDevice 1
214878.074074135 [2010296-2010296] ../src/export-buf.c: 232 findGPUIndexFromFd No DRM device file for EGLDevice 2
214878.074076840 [2010296-2010296] ../src/export-buf.c: 235 findGPUIndexFromFd No match found, falling back to default device
214878.075083408 [2010296-2010296] ../src/export-buf.c: 289 egl_initExporter Driver supports 16-bit surfaces
214878.075096823 [2010296-2010296] ../src/vabackend.c:1948 __vaDriverInit_1_0 CUDA ERROR 'initialization error' (3)

214878.075100831 [2010296-2010296] ../src/export-buf.c: 65 egl_releaseExporter Releasing exporter, 0 outstanding frames
214878.075109497 [2010296-2010296] ../src/export-buf.c: 82 egl_releaseExporter Done releasing frames
libva error: /usr/lib/x86_64-linux-gnu/dri/nvidia_drv_video.so init failed
libva info: va_openDriver() returns 1
vaInitialize failed with error code 1 (operation failed),exit
  • 上面的错误是在hibernate之后出现,需要重加载nvidia_uvm内核驱动
1
2
sudo rmmod nvidia_uvm
sudo modprobe nvidia_uvm
  • 创建systemd服务处理hiberante后的调设置与模块重载
1
2
3
4
5
6
7
8
~$ cat /etc/pm/sleep.d/after-hibernate.sh
#!/bin/bash
# on bookworm will get vaInitialize failed with error code 1 after hibernate.
rmmod nvidia_uvm
modprobe nvidia_uvm

exit 0

  • systemd service
1
2
3
4
5
6
7
8
9
10
11
12
13
14
~$ cat /etc/systemd/system/rfh.service
[Unit]
Description=Run script after hibernate recovery
#After=suspend.target
After=hibernate.target
#After=hybrid-sleep.target
[Service]
ExecStart=/etc/pm/sleep.d/after-hibernate.sh
[Install]
#WantedBy=suspend.target
WantedBy=hibernate.target
#WantedBy=hybrid-sleep.target

~$ systemctl enable rfh

Gstreamer使用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
export LIBVA_DRIVER_NAME=nvidia
export GST_VAAPI_ALL_DRIVERS=1

~$ gst-inspect-1.0 vaapi
Plugin Details:
Name vaapi
Description VA-API based elements
Filename /lib/x86_64-linux-gnu/gstreamer-1.0/libgstvaapi.so
Version 1.22.0
License LGPL
Source module gstreamer-vaapi
Documentation https://gstreamer.freedesktop.org/documentation/vaapi/
Source release date 2023-01-23
Binary package gstreamer-vaapi
Origin URL https://tracker.debian.org/pkg/gstreamer-vaapi

vaapidecodebin: VA-API Decode Bin
vaapih264dec: VA-API H264 decoder
vaapih265dec: VA-API H265 decoder
vaapimpeg2dec: VA-API MPEG2 decoder
vaapisink: VA-API sink
vaapivc1dec: VA-API VC1 decoder
vaapivp9dec: VA-API VP9 decoder

7 features:
+-- 7 elements

  • 如上面所示,只持上面的解码,并且nvidia-vaapi-driver还不支持vaapipostproc: VA-API video postprocessing,因此还无法使用vaapisink硬件解码播放。

  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    GST_DEBUG=nvdec*:6,nvenc*:6 gst-inspect-1.0 nvdec
    ~$ gst-inspect-1.0 nvcodec
    Plugin Details:
    Name nvcodec
    Description GStreamer NVCODEC plugin
    Filename /lib/x86_64-linux-gnu/gstreamer-1.0/libgstnvcodec.so
    Version 1.22.0
    License LGPL
    Source module gst-plugins-bad
    Documentation https://gstreamer.freedesktop.org/documentation/nvcodec/
    Source release date 2023-01-23
    Binary package GStreamer Bad Plugins (Debian)
    Origin URL https://tracker.debian.org/pkg/gst-plugins-bad1.0

    cudaconvert: CUDA colorspace converter
    cudaconvertscale: CUDA colorspace converter and scaler
    cudadownload: CUDA downloader
    cudascale: CUDA video scaler
    cudaupload: CUDA uploader
    nvh264dec: NVDEC h264 Video Decoder
    nvh264sldec: NVDEC H.264 Stateless Decoder
    nvh265dec: NVDEC h265 Video Decoder
    nvh265sldec: NVDEC H.265 Stateless Decoder
    nvjpegdec: NVDEC jpeg Video Decoder
    nvmpeg2videodec: NVDEC mpeg2video Video Decoder
    nvmpeg4videodec: NVDEC mpeg4video Video Decoder
    nvmpegvideodec: NVDEC mpegvideo Video Decoder
    nvvp9dec: NVDEC vp9 Video Decoder
    nvvp9sldec: NVDEC VP9 Stateless Decoder
  • 测式硬解播放h264文件

1
~$ sudo apt-get install gstreamer1.0-plugins-base-apps
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

~$ gst-discoverer-1.0 test.x264.AAC5.1.mp4
Done discovering test.x264.AAC5.1.mp4
Missing plugins
(gstreamer|1.0|gst-discoverer-1.0|GStreamer element vaapipostproc|element-vaapipostproc)

Properties:
Duration: 1:39:44.736000000
Seekable: yes
Live: no
container #0: Quicktime
video #1: H.264 (High Profile)
Stream ID: 1a5271d9ce1c168fa86e3c3727d54d189e469c400ed56a5836c778c4ddd01ac6/001
Width: 1920
Height: 1036
Depth: 24
Frame rate: 24000/1001
Pixel aspect ratio: 1/1
Interlaced: false
Bitrate: 2249690
Max bitrate: 31250000
audio #2: MPEG-4 AAC
Stream ID: 1a5271d9ce1c168fa86e3c3727d54d189e469c400ed56a5836c778c4ddd01ac6/002
Language: <unknown>
Channels: 6 (front-left, front-right, front-center, lfe1, rear-left, rear-right)
Sample rate: 48000
Depth: 32
Bitrate: 384000
Max bitrate: 384000

  • NVDEC H.264 Stateless Decoder解码,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
export LIBVA_DRIVER_NAME=nvidia
export GST_VAAPI_ALL_DRIVERS=1
export GST_VAAPI_DRM_DEVICE=/dev/dri/renderD128

~$ gst-launch-1.0 filesrc location=test.x264.AAC5.1.mp4 ! parsebin ! nvh264sldec ! videoconvert ! xvimagesink

~$ nvidia-smi
Sun May 7 22:40:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:07:00.0 On | N/A |
| N/A 49C P0 N/A / 30W | 726MiB / 2048MiB | 18% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 11453 G /usr/lib/xorg/Xorg 294MiB |
| 0 N/A N/A 12305 G ...e/michael/firefox/firefox 268MiB |
| 0 N/A N/A 93886 G ...AAAAAAAAA= --shared-files 1MiB |
| 0 N/A N/A 95739 G ...RendererForSitePerProcess 74MiB |
| 0 N/A N/A 119708 C gst-launch-1.0 82MiB |
+-----------------------------------------------------------------------------+

  • 使用libav h264 decoder, CPU会比上面的高10%左右。
1
~$ gst-launch-1.0 filesrc location=test4.mp4 ! parsebin ! avdec_h264 ! videoconvert ! xvimagesink
  • 使用glimagesink测试速度,fpsdisplaysink会显示当前的帧率。
1
2
~$ sudo apt-get install gstreamer1.0-gl
~$ GST_VAAPI_DRM_DEVICE=/dev/dri/renderD128 gst-launch-1.0 filesrc location=test4.mp4 ! parsebin ! nvh264sldec ! videoconvert ! fpsdisplaysink video-sink=glimagesink sync=false

错误记录分析

  • 下面的错误是在我本机已经存在/usr/local/include/va,并且它的版本或者比较低,没有这些结构体。但是在/usr/include/va目录下是系统安装了libva-dev所生成,里面的结构体是符合要求。meson可能先找到/usr/local/include/va,并且#include <va/va.h>就会忽略掉/usr/include/va/va.h.
1
2
3
4
5
6
7
8
9
10
11
~$ cc -Invidia_drv_video.so.p -I. -I.. -I../nvidia-include  -I/usr/local/include -I/usr/include -I/usr/include/libdrm -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/x86_64-linux-gnu -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=gnu11 -g -Wno-missing-field-initializers -Wno-unused-parameter -Werror=format -Werror=init-self -Werror=int-conversion -Werror=missing-declarations -Werror=missing-prototypes -Werror=pointer-arith -Werror=undef -Werror=vla -Wsuggest-attribute=format -Wwrite-strings -fPIC -pthread -MD -MQ nvidia_drv_video.so.p/src_h264.c.o -MF nvidia_drv_video.so.p/src_h264.c.o.d -o nvidia_drv_video.so.p/src_h264.c.o -c ../src/h264.c^C
michael@debian:~/3TB-DISK/github/nvidia-driver/nvidia-vaapi-driver/build$ ^C
michael@debian:~/3TB-DISK/github/nvidia-driver/nvidia-vaapi-driver/build$ cc -Invidia_drv_video.so.p -I. -I.. -I../nvidia-include -I/usr/local/include -I/usr/include -I/usr/include/libdrm -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=gnu11 -g -Wno-missing-field-initializers -Wno-unused-parameter -Werror=format -Werror=init-self -Werror=int-conversion -Werror=missing-declarations -Werror=missing-prototypes -Werror=pointer-arith -Werror=undef -Werror=vla -Wsuggest-attribute=format -Wwrite-strings -fPIC -pthread -MD -MQ nvidia_drv_video.so.p/src_h264.c.o -MF nvidia_drv_video.so.p/src_h264.c.o.d -o nvidia_drv_video.so.p/src_h264.c.o -c ../src/h264.c
In file included from ../src/h264.c:1:
../src/vabackend.h:123:77: error: unknown type name ‘VADRMPRIMESurfaceDescriptor’
123 | bool (*fillExportDescriptor)(struct _NVDriver *drv, NVSurface *surface, VADRMPRIMESurfaceDescriptor *desc);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~
../src/h264.c:133:1: warning: ‘retain’ attribute directive ignored [-Wattributes]
133 | const DECLARE_CODEC(h264Codec) = {
| ^~~~~

  • 下面这个也是因为在本地存在/usr/local/include/EGL所导致的。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
FAILED: nvidia_drv_video.so.p/src_export-buf.c.o
cc -Invidia_drv_video.so.p -I. -I.. -I../nvidia-include -I/usr/local/include -I/usr/include/libdrm -I/usr/include/gstreamer-1.0 -I/usr/include/glib-2.0 -I/usr/lib/x86_64-linux-gnu/glib-2.0/include -I/usr/include/x86_64-linux-gnu -fvisibility=hidden -fdiagnostics-color=always -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -std=c11 -g -Wno-missing-field-initializers -Wno-unused-parameter -Werror=format -Werror=incompatible-pointer-types -Werror=init-self -Werror=int-conversion -Werror=missing-declarations -Werror=missing-prototypes -Werror=pointer-arith -Werror=undef -Werror=vla -Wsuggest-attribute=format -Wwrite-strings -fPIC -pthread -MD -MQ nvidia_drv_video.so.p/src_export-buf.c.o -MF nvidia_drv_video.so.p/src_export-buf.c.o.d -o nvidia_drv_video.so.p/src_export-buf.c.o -c ../src/export-buf.c
../src/export-buf.c: In function ‘egl_initExporter’:
../src/export-buf.c:242:5: error: unknown type name ‘PFNEGLQUERYDMABUFFORMATSEXTPROC’; did you mean ‘PFNEGLQUERYOUTPUTPORTATTRIBEXTPROC’?
242 | PFNEGLQUERYDMABUFFORMATSEXTPROC eglQueryDmaBufFormatsEXT = (PFNEGLQUERYDMABUFFORMATSEXTPROC) eglGetProcAddress("eglQueryDmaBufFormatsEXT");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| PFNEGLQUERYOUTPUTPORTATTRIBEXTPROC
../src/export-buf.c:242:65: error: ‘PFNEGLQUERYDMABUFFORMATSEXTPROC’ undeclared (first use in this function); did you mean ‘PFNEGLQUERYOUTPUTPORTATTRIBEXTPROC’?
242 | PFNEGLQUERYDMABUFFORMATSEXTPROC eglQueryDmaBufFormatsEXT = (PFNEGLQUERYDMABUFFORMATSEXTPROC) eglGetProcAddress("eglQueryDmaBufFormatsEXT");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| PFNEGLQUERYOUTPUTPORTATTRIBEXTPROC
../src/export-buf.c:242:65: note: each undeclared identifier is reported only once for each function it appears in
../src/export-buf.c:242:98: error: expected ‘,’ or ‘;’ before ‘eglGetProcAddress’
242 | PFNEGLQUERYDMABUFFORMATSEXTPROC eglQueryDmaBufFormatsEXT = (PFNEGLQUERYDMABUFFORMATSEXTPROC) eglGetProcAddress("eglQueryDmaBufFormatsEXT");
| ^~~~~~~~~~~~~~~~~
../src/export-buf.c:265:9: error: called object ‘eglQueryDmaBufFormatsEXT’ is not a function or function pointer
265 | if (eglQueryDmaBufFormatsEXT(drv->eglDisplay, 64, formats, &formatCount)) {

  • Nvidia-drm错误
1
2
3
4
5
6
7
~$ dmesg
[...]
[38810.269044] [drm:drm_new_set_master [drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
[41522.270711] [drm:drm_new_set_master [drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
[42735.271307] [drm:drm_new_set_master [drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership
[44347.269266] [drm:drm_new_set_master [drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000700] Failed to grab modeset ownership

  • 上面错误是因为设置了options nvidia-drm modeset=1, 请确保在下面这些文件,或者/etc/modprobe.d,/usr/lib/modproble.d目录中的文件的,都不能有此设置。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
~$ grep --include=*.conf -rnw /usr/lib/  -e "nvidia-drm"
/usr/lib/modprobe.d/nvidia-installer-disable-nouveau.conf

~$ sudo grep --include=*.conf -rnw /etc/ -e "nvidia-drm"
/etc/nvidia/current/nvidia-modprobe.conf:2:options nvidia-drm modset=1
/etc/nvidia/current/nvidia-modprobe.conf:5:install nvidia-drm modprobe nvidia-modeset ; modprobe -i nvidia-current-drm $CMDLINE_OPTS
/etc/nvidia/current/nvidia-modprobe.conf:13:remove nvidia modprobe -r -i nvidia-drm nvidia-modeset nvidia-peermem nvidia-uvm nvidia
/etc/nvidia/current/nvidia-modprobe.conf:15:remove nvidia-modeset modprobe -r -i nvidia-drm nvidia-modeset
/etc/nvidia/current/nvidia-load.conf:1:nvidia-drm
/etc/nvidia/current/nvidia-drm-outputclass.conf:3:# nvidia-drm.ko kernel module. Please note that this only works on Linux kernels
/etc/nvidia/current/nvidia-drm-outputclass.conf:4:# version 3.9 or higher with CONFIG_DRM enabled, and only if the nvidia-drm.ko
/etc/nvidia/current/nvidia-drm-outputclass.conf:9: MatchDriver "nvidia-drm"
````

* `polkitd segfault`
* [interpreting-segfault-messages](https://stackoverflow.com/questions/2549214/interpreting-segfault-messages/2549363#2549363)


```sh
~$ dmesg
polkitd[99838]: segfault at 8 ip 0000564f56f95736 sp 00007ffe8b5fa800 error 4 in polkitd[564f56f91000+e000] likely on CPU 1 (core 1, socket 0)
  • 出现上面错误,有可能是与系统的库有兼容问题,或者重新也是不能覆盖旧的文件,需要执行下面三个步骤:
    • sudo apt-get remove -y policykit-1;
    • dpkg --purge policykit-1;
    • 再手动删除/etc/policykit-1,仔细确认上面的命令是否完全删除干净。
    • sudo apt-get install policykit-1 -y;

谢谢支持

  • 微信二维码: