0%

Kubernetes基于CephFS的存储

数据卷(Volumes)

  • Kubernetes对于数据卷重新定义,提供了丰富的强大的功能.按照功能可分为三类:本地数据卷,网络数据卷,信息数据卷.
  • Kubernetes提供支持的数据卷类型,最新版本会有增减:
    • 本地数据卷:
      • EmptyDir
      • HostPath
    • 网络数据卷:
      • NFS
      • iSCSI
      • GlusterFS
      • RBD
      • Flocker
      • GCE Perisistent Disk
      • Aws Elastic Block Store
      • azureDisk
      • CephFS
      • fc (fibre Channel)
      • Persistent Volume Claim
    • 信息数据卷:
      • Git Repo(deprecated)
      • Secret
      • Downward API

本地数据卷

HostPath

  • 大多数的Pod应该忽略它们主机节点,因此它们不应该访问节点里文件系统上的任何文件.但是某些系统级别的Pod(通常是由DaemonSet管理)确实需要读取节点里的文件,还有在测试环境中可以HostPath来代替一些PV.HostPath卷指向节点里文件系统是的特定文件或目录.在同一个节点上运行并在其HostPath卷中使用相同路径的Pod可以看到相同的文件.如果要在集群里使用HostPath,需要把--enable-hostpath-provisioner参数标志加到kube-controller-manager里启动.

Ceph集群

概要

  • Ceph是一个开源项目,它提供软件定义的(SDS),统一的存储解决方案.具有高度可伸缩性,容量可扩展到EB级别.Ceph的技术特性,总体表现在集群的可靠性,集群扩展性,数据安全性,接口统一性4个方面.

数据卷

  • 后端存储可以分为filestorebluestore:

  • FileStore:

    • FileStore is the legacy approach to storing objects in Ceph. It relies on a standard file system (normally XFS) in combination with a key/value database (traditionally LevelDB, now RocksDB) for some metadata.
    • FileStore is well-tested and widely used in production. However, it suffers from many performance deficiencies due to its overall design and its reliance on a traditional file system for object data storage.
    • Although FileStore is capable of functioning on most POSIX-compatible file systems (including btrfs and ext4), we recommend that only the XFS file system be used with Ceph. Both btrfs and ext4 have known bugs and deficiencies and their use may lead to data loss. By default, all Ceph provisioning tools use XFS.
  • BlueStore:

    • Key BlueStore features include:
      • Direct management of storage devices. BlueStore consumes raw block devices or partitions. This avoids intervening layers of abstraction (such as local file systems like XFS) that can limit performance or add complexity.
      • Metadata management with RocksDB. RocksDB’s key/value database is embedded in order to manage internal metadata, including the mapping of object names to block locations on disk.
      • Full data and metadata checksumming. By default, all data and metadata written to BlueStore is protected by one or more checksums. No data or metadata is read from disk or returned to the user without being verified.
      • Inline compression. Data can be optionally compressed before being written to disk.
      • Multi-device metadata tiering. BlueStore allows its internal journal (write-ahead log) to be written to a separate, high-speed device (like an SSD, NVMe, or NVDIMM) for increased performance. If a significant amount of faster storage is available, internal metadata can be stored on the faster device.
      • Efficient copy-on-write. RBD and CephFS snapshots rely on a copy-on-write clone mechanism that is implemented efficiently in BlueStore. This results in efficient I/O both for regular snapshots and for erasure-coded pools (which rely on cloning to implement efficient two-phase commits).
    • 支持下面的配置:
      • A block device, a block.wal, and a block.db device
      • A block device and a block.wal device
      • A block device and a block.db device
      • A single block device
    • block device 也有三种选项:
      • 整个磁盘
      • 磁盘分区
      • 逻辑卷 (a logical volumen of LVM)
  • 注意:

    1. 不可以使用磁盘作为block.db或者block.wal,否则会报错:blkid could not detect a PARTUUID for device;
    2. 若使用磁盘或者分区作block,则ceph-volume会在其上创建LV来使用.若使用分区作block.dbblock.wal,则直接使用分区而不创建LV.
  • BlueFs将整个BlueStore的存储空间分为三个层次:

    • 慢速(Slow)空间:主要用于存储对象数据,可由普通大容量机械盘提供,由BlueStore自行管理
    • 高速(DB)空间:存储BlueStore内部产生的元数据,可由普通SSD提供,需求小于(慢速空间).
    • 超高速(WAL)空间:主要存储RocksDB内部产生的.log 文件,可由SSD或者NVRAM等时延相较普通SSD更小的设备充当.容量需求和(高速空间)相当,同样由Bluefs直接管理.

Ceph的功能组件

  • Ceph OSD:(Object Storage Device),主要功能包括存储数据,处理数据的复制,恢复,回补平衡数据分布,并将一些相关数据提供给Ceph Monitor,例如 Ceph OSD心跳等.一个Ceph的存储集群,至少需要1个Ceph OSD来实现active+clean健康状态和有效的保存数据的副本(默认情况下是双副本,可以调整).注意:每一个Disk,分区都可以成为一个OSD.
  • Ceph Monitor:Ceph的监视器,主要功能是维护整个集群健康状态,提供一致性的决策,包含了 Monitor map,OSD map,PG(Placement Group) mapCRUSH map.
  • Ceph MDS:(Ceph Metadata Server),主要保存的是Ceph文件系统的元数据(metadata).注意:Ceph的块存储与Ceph的对象存储都不需要Ceph MDS.Ceph MDS为基于POSIX文件系统的用户提供了一些基础命令,如:ls,find等.如果需要创建CephFS才需要用到MDS,但是CephFS离生产使用还有一段距离.

Ceph 功能特性

RADOS

  • RADOS具备自我修复等特性,提供了一个可靠,自动,智能的分布式存储.它的灵魂CRUSH(Controlled Replication Under Scalable Hashing,可扩展哈希算法的可控复制)算法.

Ceph文件系统

  • CephFS功能特性是基于RADOS来现实分布式的文件系统,引入了MDS(Metadata Server),主要为兼容POSIX文件系统提供元数据.一般都是当体系文件系统来挂载.
  • Ceph文件系统
    ditaa-b5a320fc160057a1a7da010b4215489fa66de242

Ceph块设备

  • RBD(Rados Block Device)功能特性是基于Librados之上,通过Librbd创建一个块设备,通过QEMU/KVM附加到VM上,作为传统的块设备来使用.目前OpenStack,CloudStack等都是采用这种方式来为VM提供块设备,同时也支持快照同COW(Copy On Write)等功能.

  • Ceph块设备
    ditaa-dc9f80d771b55f2daa5cbbfdb2dd0d3e6dfc17c0

Ceph对象网关

  • RADWOGW的功能特性是基于LibRADOS之上,提供当前流行的RESTful协议的网关,并且兼容AWS S3Swift接口,作为对象存储,可以对接网盘类应用以及HLS的流媒体应用等.
    ditaa-50d12451eb76c5c72c4574b08f0320b39a42e5f1

  • 体系结构
    stack

通过Ceph/ceph-ansiable安装

  • 关于Releases版本的特别说明:
    • stable-3.0支持ceph jewel 和 luminous 版本.该branch需要Ansible 2.4`版本.
    • stable-3.1支持ceph luminous 和 mimic 版本.该branch需要Ansible 2.4`版本.
    • stable-3.2支持ceph luminous 和 mimic 版本.该branch需要Ansible 2.6`版本.
    • stable-4.0支持ceph nautilus 版本.该branch需要Ansible 2.8`版本.
    • master支持Ceph@master版本.该branch需要Ansible 2.8版本.
1
2
3
~$ git clone  https://github.com/ceph/ceph-ansible
~$ git checkout -b v3.2.9
~$ pip install -r ceph-ansible/requirements.txt

通过Ceph/ceph-deploy安装

快速安装(apt)

  • Ceph-deploy是比较旧的部署方式,过程稍复杂.经测试使用apt的方式安装不到ceph-deploy,这里通过pip install ceph-deploy安装成功.
  • 下面使用VirtualBox创建虚拟机来做实验.创建一个Linux虚拟机,安装debian9,设置两个网卡,一个是NAT(10.0.2.0/24)用于外网下载软件使用,一个是Vboxnet1(192.168.99.0/24)集群通信使用.安装一些常的工具软件,并克隆4个新的虚拟机,更改它的主机名与IP地址.下面会用到Ansible来批量操作这些虚拟机.安装虚假机结构如下:

cffd08dd3e192a5f1d724ad7930cb04200b9b425

  • 在上述的节点虚拟器机里安装apt get ntp ntpdate ntp-doc时间服务器相关包,并配置好SSH公钥免农密登录.这里可以使用Ansible变量操作.
  • 注意,各个节点里的/etc/hosts要与ceph-deploy操作的主机是一致的.否则会出现ceph-deploy mon create-initial无法进行的错误.

清理旧节点

1
2
3
4
~$ ceph-deploy purge {ceph-node} [{ceph-node}]
~$ ceph-deploy purgedata {ceph-node} [{ceph-node}]
~$ ceph-deploy forgetkeys
~$ rm ceph.*

安装节点

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
~$ ceph-deploy new --help
usage: ceph-deploy new [-h] [--no-ssh-copykey] [--fsid FSID]
[--cluster-network CLUSTER_NETWORK]
[--public-network PUBLIC_NETWORK]
MON [MON ...]

Start deploying a new cluster, and write a CLUSTER.conf and keyring for it.

positional arguments:
MON initial monitor hostname, fqdn, or hostname:fqdn pair

optional arguments:
-h, --help show this help message and exit
--no-ssh-copykey do not attempt to copy SSH keys
--fsid FSID provide an alternate FSID for ceph.conf generation
--cluster-network CLUSTER_NETWORK
specify the (internal) cluster network
--public-network PUBLIC_NETWORK
specify the public network for a cluster

~$ ceph-deploy new node1 node2 node3 --public-network 192.168.99.0/24

# 下面命令,脚本会通过ssh到每个节点上安装相应的ceph包.类似于 apt install ceph ceph-base ceph-common ceph-mds ceph-mon ceph-osd radosgw
# 因为使用的是cepy-deplopy 2.0.x,必须要指定为luminous(v12)以上版本,否则它是默认安装Mimic(v10)的版本.现在最新版是 Nautilus(v14.0.2)
~$ ceph-deploy install --release luminous node1 node2 node3
# 可以加入这两个参数,加速安装 --repo-url http://mirrors.ustc.edu.cn/ceph/debian-luminous --gpg-url http://mirrors.ustc.edu.cn/ceph/keys/release.asc
# 上述命令后,会在当前目录下创建ceph.conf,ceph-mon.keyring
  • Pool,PG(Placement Groups)CRUSH配置参考,官方文档,下面PG参数的调校可以参照这里PgCalc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# --> ceph.conf
[global]

# By default, Ceph makes 3 replicas of objects. If you want to make four
# copies of an object the default value--a primary copy and three replica
# copies--reset the default values as shown in 'osd pool default size'.
# If you want to allow Ceph to write a lesser number of copies in a degraded
# state, set 'osd pool default min size' to a number less than the
# 'osd pool default size' value.

osd pool default size = 3 # Write an object 3 times.
osd pool default min size = 2 # Allow writing two copies in a degraded state.

# Ensure you have a realistic number of placement groups. We recommend
# approximately 100 per OSD. E.g., total number of OSDs multiplied by 100
# divided by the number of replicas (i.e., osd pool default size). So for
# 10 OSDs and osd pool default size = 4, we'd recommend approximately
# (100 * 10) / 4 = 250.

osd pool default pg num = 250
osd pool default pgp num = 250
  • 初始化Monitors

    1
    2
    ~$ ceph-deploy mon create node1 node2 node3
    ~$ ceph-deploy gatherkeys node1 node2 node3
  • 注意:如果出现下面的错误,可能是系统的空间小于5%.具体错误细节可以查看/var/log/ceph/ceph-mon.DB001.log

    1
    2
    [DB001][INFO  ] Running command: sudo ceph --cluster=ceph --admin-daemon /var/run/ceph/ceph-mon.DB001.asok mon_status
    [DB001][ERROR ] b'admin_socket: exception getting command descriptions: [Errno 2] No such file or directory'
  • 会在当前目录下,创建如下文件

    • {cluster-name}.client.admin.keyring
    • {cluster-name}.mon.keyring
    • {cluster-name}.bootstrap-osd.keyring
    • {cluster-name}.bootstrap-mds.keyring
    • {cluster-name}.bootstrap-rgw.keyring
    • {cluster-name}.bootstrap-mgr.keyring
  • 分发 ceph 配置和 keys 到集群的节点中去:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    $ ceph-deploy admin node1 node2 node3
    [ceph_deploy.conf][DEBUG ] found configuration file at: /home/lcy/.cephdeploy.conf
    [ceph_deploy.cli][INFO ] Invoked (2.0.1): /home/lcy/.pyenv/versions/py3dev/bin/ceph-deploy admin node1 node2 node3
    [ceph_deploy.cli][INFO ] ceph-deploy options:
    [ceph_deploy.cli][INFO ] verbose : False
    [ceph_deploy.cli][INFO ] quiet : False
    [ceph_deploy.cli][INFO ] username : None
    [ceph_deploy.cli][INFO ] overwrite_conf : False
    [ceph_deploy.cli][INFO ] ceph_conf : None
    [ceph_deploy.cli][INFO ] cluster : ceph
    [ceph_deploy.cli][INFO ] client : ['node1', 'node2', 'node3']
    [ceph_deploy.cli][INFO ] cd_conf : <ceph_deploy.conf.cephdeploy.Conf object at 0x7fef98f77390>
    [ceph_deploy.cli][INFO ] default_release : False
    [ceph_deploy.cli][INFO ] func : <function admin at 0x7fef99bdd6a8>
    [ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node1
    [node1][DEBUG ] connection detected need for sudo
    [node1][DEBUG ] connected to host: node1
    [ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node2
    [node2][DEBUG ] connection detected need for sudo
    [node2][DEBUG ] connected to host: node2
    [ceph_deploy.admin][DEBUG ] Pushing admin keys and conf to node3
    [node3][DEBUG ] connection detected need for sudo
    [node3][DEBUG ] connected to host: node3
  • 查看集群的状态,可直接登录用root权限运行,或者Ansible命令运行:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    ~$ ansible -i ../hosts node1 -b -m command -a "ceph -s"
    192.168.99.101 | CHANGED | rc=0 >>
    cluster:
    id: 0bf150da-b691-4382-bf3d-600e90c19fba
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: no daemons active
    osd: 0 osds: 0 up, 0 in

    data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0B
    usage: 0B used, 0B / 0B avail
    pgs:

Ceph Manager部署

  • 参考文档
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    ~$ ceph-deploy mgr create node1 node2 node3
    [...]
    ~$ ansible -i ../hosts node1 -b -m command -a "ceph -s"
    192.168.99.101 | CHANGED | rc=0 >>
    cluster:
    id: 0bf150da-b691-4382-bf3d-600e90c19fba
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node3, node2
    osd: 0 osds: 0 up, 0 in

    data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0B
    usage: 0B used, 0B / 0B avail
    pgs:

Ceph OSD部署

  • ceph-volume
  • Ceph Luminous 12.2.2开始,ceph-disk被弃用,使用ceph-volume代替.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
~$ ceph-deploy osd -h
usage: ceph-deploy osd [-h] {list,create} ...

Create OSDs from a data disk on a remote host:

ceph-deploy osd create {node} --data /path/to/device

For bluestore, optional devices can be used::

ceph-deploy osd create {node} --data /path/to/data --block-db /path/to/db-device
ceph-deploy osd create {node} --data /path/to/data --block-wal /path/to/wal-device
ceph-deploy osd create {node} --data /path/to/data --block-db /path/to/db-device --block-wal /path/to/wal-device

For filestore, the journal must be specified, as well as the objectstore::

ceph-deploy osd create {node} --filestore --data /path/to/data --journal /path/to/journal

For data devices, it can be an existing logical volume in the format of:
vg/lv, or a device. For other OSD components like wal, db, and journal, it
can be logical volume (in vg/lv format) or it must be a GPT partition.

positional arguments:
{list,create}
list List OSD info from remote host(s)
create Create new Ceph OSD daemon by preparing and activating a
device

optional arguments:
-h, --help show this help message and exit

~$ ceph-deploy osd create -h
usage: ceph-deploy osd create [-h] [--data DATA] [--journal JOURNAL]
[--zap-disk] [--fs-type FS_TYPE] [--dmcrypt]
[--dmcrypt-key-dir KEYDIR] [--filestore]
[--bluestore] [--block-db BLOCK_DB]
[--block-wal BLOCK_WAL] [--debug]
[HOST]

positional arguments:
HOST Remote host to connect

optional arguments:
-h, --help show this help message and exit
--data DATA The OSD data logical volume (vg/lv) or absolute path
to device
--journal JOURNAL Logical Volume (vg/lv) or path to GPT partition
--zap-disk DEPRECATED - cannot zap when creating an OSD
--fs-type FS_TYPE filesystem to use to format DEVICE (xfs, btrfs)
--dmcrypt use dm-crypt on DEVICE
--dmcrypt-key-dir KEYDIR
directory where dm-crypt keys are stored
--filestore filestore objectstore
--bluestore bluestore objectstore
--block-db BLOCK_DB bluestore block.db path
--block-wal BLOCK_WAL
bluestore block.wal path
--debug Enable debug mode on remote ceph-volume calls
  • node1添加了一个磁盘作OSD盘,下面把它整个盘创建成一个块设备.
1
2
3
~$ ceph-deploy osd create node1 --data /dev/vdb
~$ ceph-deploy osd create node2 --data /dev/vdb
~$ ceph-deploy osd create node3 --data /dev/vdb
  • 出错消息,如果原盘里面有LVM的信息,要先手动清除原LVM信息,不然会出现下面错误.先用lvdisplay查看,再用lvremove --force,vgdisplay,vgremove --force,清除原有的 LVM 信息.
1
2
3
[DB001][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy.osd][ERROR ] Failed to execute command: /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/vdb
[ceph_deploy][ERROR ] GenericError: Failed to create 1 OSDs
  • 清除原LVM信息

    1
    ~$ ansible -i hosts all  -b -m shell  -a "lvdisplay | awk 'NR==2 {print $3}'| xargs  lvremove --force ;  vgdisplay | awk 'NR==2 {print $3}' | xargs  vgremove"
  • 查看osd状态

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    ~$ ansible -i ../hosts node1 -b -m command -a "ceph osd stat"
    3 osds: 3 up, 3 in

    ~$ ansible -i ../hosts node1 -b -m command -a "ceph df"
    GLOBAL:
    SIZE AVAIL RAW USED %RAW USED
    180GiB 177GiB 3.02GiB 1.68
    POOLS:
    NAME ID USED %USED MAX AVAIL OBJECTS
    hdd 1 375B 0 84.0GiB 8
    cephfs_data 2 0B 0 84.0GiB 0
    cephfs_metadata 3 2.19KiB 0 84.0GiB 21
  • 查看osd

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    $ ansible -i hosts node1  -b -m command -a "ceph osd tree"

    node1 | CHANGED | rc=0 >>
    ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
    -1 0.17578 root default
    -3 0.05859 host node1
    0 hdd 0.05859 osd.0 up 1.00000 1.00000
    -7 0.05859 host node2
    2 hdd 0.05859 osd.2 up 1.00000 1.00000
    -5 0.05859 host node3
    1 hdd 0.05859 osd.1 up 1.00000 1.00000
  • 查看系统状态

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    $ ansible -i ../hosts node1 -b -m command -a "ceph -s"
    192.168.99.101 | CHANGED | rc=0 >>
    cluster:
    id: 0bf150da-b691-4382-bf3d-600e90c19fba
    health: HEALTH_OK

    services:
    mon: 3 daemons, quorum node1,node2,node3
    mgr: node1(active), standbys: node3, node2
    osd: 1 osds: 1 up, 1 in

    data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0B
    usage: 1.00GiB used, 9.00GiB / 10.0GiB avail # 这里是node1:/dev/sdb刚创建的.
    pgs:
    # 如果要使用管道操作,必须使用shell模块,command模块会出错.
    ~$ ansible -i ../hosts node1 -b -m shell -a "mount | grep ceph"
    tmpfs on /var/lib/ceph/osd/ceph-0 type tmpfs (rw,relatime)

    ~$ ansible -i ../hosts node1 -b -m command -a "ls -l /var/lib/ceph/"
    192.168.99.101 | CHANGED | rc=0 >>
    total 44
    drwxr-xr-x 2 ceph ceph 4096 Apr 11 08:44 bootstrap-mds
    drwxr-xr-x 2 ceph ceph 4096 May 9 21:33 bootstrap-mgr
    drwxr-xr-x 2 ceph ceph 4096 May 9 22:27 bootstrap-osd
    drwxr-xr-x 2 ceph ceph 4096 Apr 11 08:44 bootstrap-rbd
    drwxr-xr-x 2 ceph ceph 4096 Apr 11 08:44 bootstrap-rgw
    drwxr-xr-x 2 ceph ceph 4096 Apr 11 08:44 mds
    drwxr-xr-x 3 ceph ceph 4096 May 9 21:33 mgr
    drwxr-xr-x 3 ceph ceph 4096 May 9 21:22 mon
    drwxr-xr-x 3 ceph ceph 4096 May 9 22:27 osd
    drwxr-xr-x 2 ceph ceph 4096 Apr 11 08:44 radosgw
    drwxr-xr-x 2 ceph ceph 4096 May 9 21:22 tmp

    ~$ ansible -i ../hosts node1 -b -m command -a "ls -l /var/lib/ceph/osd/ceph-0"
    192.168.99.101 | CHANGED | rc=0 >>
    total 48
    -rw-r--r-- 1 ceph ceph 393 May 9 22:27 activate.monmap
    lrwxrwxrwx 1 ceph ceph 93 May 9 22:27 block -> /dev/ceph-195012d6-0c8a-45bf-964c-3ac15f2cd024/osd-block-261c9455-fbc4-4eba-9783-5fba4290048d
    -rw-r--r-- 1 ceph ceph 2 May 9 22:27 bluefs
    -rw-r--r-- 1 ceph ceph 37 May 9 22:27 ceph_fsid
    -rw-r--r-- 1 ceph ceph 37 May 9 22:27 fsid
    -rw------- 1 ceph ceph 55 May 9 22:27 keyring
    -rw-r--r-- 1 ceph ceph 8 May 9 22:27 kv_backend
    -rw-r--r-- 1 ceph ceph 21 May 9 22:27 magic
    -rw-r--r-- 1 ceph ceph 4 May 9 22:27 mkfs_done
    -rw-r--r-- 1 ceph ceph 41 May 9 22:27 osd_key
    -rw-r--r-- 1 ceph ceph 6 May 9 22:27 ready
    -rw-r--r-- 1 ceph ceph 10 May 9 22:27 type
    -rw-r--r-- 1 ceph ceph 2 May 9 22:27 whoami
  • 查看Ceph的参数配置项

    1
    2
    3
    4
    5
    6
    ~$ ansible -i ../hosts node1 -b -m command -a "ceph --show-config"
    name = client.admin
    cluster = ceph
    debug_none = 0/5
    debug_lockdep = 0/1
    [....]
  • 查看LVM相关信息

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    ~$ ansible -i ../hosts node1 -b -m command -a "pvdisplay"
    192.168.99.101 | CHANGED | rc=0 >>
    --- Physical volume ---
    PV Name /dev/sdb
    VG Name ceph-195012d6-0c8a-45bf-964c-3ac15f2cd024
    PV Size 10.00 GiB / not usable 4.00 MiB
    Allocatable yes (but full)
    PE Size 4.00 MiB
    Total PE 2559
    Free PE 0
    Allocated PE 2559
    PV UUID Qd6kSs-Ivbp-3APy-21Tv-XQgx-EhBn-XfioVa

    ~$ ansible -i ../hosts node1 -b -m command -a "vgdisplay"
    192.168.99.101 | CHANGED | rc=0 >>
    --- Volume group ---
    VG Name ceph-195012d6-0c8a-45bf-964c-3ac15f2cd024
    System ID
    Format lvm2
    Metadata Areas 1
    Metadata Sequence No 17
    VG Access read/write
    VG Status resizable
    MAX LV 0
    Cur LV 1
    Open LV 1
    Max PV 0
    Cur PV 1
    Act PV 1
    VG Size 10.00 GiB
    PE Size 4.00 MiB
    Total PE 2559
    Alloc PE / Size 2559 / 10.00 GiB
    Free PE / Size 0 / 0
    VG UUID XiVkQ6-aUPv-3BRw-Gj1N-jdG4-HRxf-hCS3Mg

    ~$ ansible -i ../hosts node1 -b -m command -a "lvdisplay"
    192.168.99.101 | CHANGED | rc=0 >>
    --- Logical volume ---
    LV Path /dev/ceph-195012d6-0c8a-45bf-964c-3ac15f2cd024/osd-block-261c9455-fbc4-4eba-9783-5fba4290048d
    LV Name osd-block-261c9455-fbc4-4eba-9783-5fba4290048d
    VG Name ceph-195012d6-0c8a-45bf-964c-3ac15f2cd024
    LV UUID F9dF0S-qwb7-LJC0-vld2-TF6g-nP8q-9ncsdI
    LV Write Access read/write
    LV Creation host, time node1, 2019-05-09 22:27:44 -0400
    LV Status available
    # open 4
    LV Size 10.00 GiB
    Current LE 2559
    Segments 1
    Allocation inherit
    Read ahead sectors auto
    - currently set to 256
    Block device 253:0
  • 关于LVM的关系.简单描述一下:LV是建立在VG上,VG建立在PV上面.

  • 下面关闭node2,为添加一块20G的盘,测试其它BlueStore类型.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    ~$ ansible -i ../hosts node1 -b -m command -a "ceph -s"
    192.168.99.101 | CHANGED | rc=0 >>
    cluster:
    id: 0bf150da-b691-4382-bf3d-600e90c19fba
    health: HEALTH_WARN
    1/3 mons down, quorum node1,node3 # 警告有一个节点shutdown.

    services:
    mon: 3 daemons, quorum node1,node3, out of quorum: node2
    mgr: node1(active), standbys: node3
    osd: 1 osds: 1 up, 1 in

    data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0B
    usage: 1.00GiB used, 9.00GiB / 10.0GiB avail
    pgs:

Parted(GPT 分区)

  • 如果使用fdisk(MBR)分区会报错,下面使用parted(GPT)分区.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    root@node2:~# parted /dev/sdb
    GNU Parted 3.2
    Using /dev/sdb
    Welcome to GNU Parted! Type 'help' to view a list of commands.
    (parted) mklabel gpt
    (parted) print
    Model: ATA VBOX HARDDISK (scsi)
    Disk /dev/sdb: 21.5GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags:

    Number Start End Size File system Name Flags

    (parted) mkpart parimary 0 10G
    Warning: The resulting partition is not properly aligned for best performance.
    Ignore/Cancel?
    Ignore/Cancel? Ignore
    (parted) print
    Model: ATA VBOX HARDDISK (scsi)
    Disk /dev/sdb: 21.5GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags:

    Number Start End Size File system Name Flags
    1 17.4kB 10.0GB 10000MB parimary

    (parted) mkpart parimary 10G 21.5G
    (parted) p
    Model: ATA VBOX HARDDISK (scsi)
    Disk /dev/sdb: 21.5GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags:

    Number Start End Size File system Name Flags
    1 17.4kB 10.0GB 10000MB parimary
    2 10.0GB 21.5GB 11.5GB parimary
    (parted) q

    root@node2:~# partx /dev/sdb
    NR START END SECTORS SIZE NAME UUID
    1 34 19531250 19531217 9.3G parimary a8c625b7-ebf2-4ceb-a9fd-5371dde59b35
    2 19531776 41940991 22409216 10.7G parimary 6463703f-c1f3-4ad7-8870-ed634db64131

    root@node2:~# lsblk /dev/sdb
    NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
    sdb 8:16 0 20G 0 disk
    ├─sdb1 8:17 0 9.3G 0 part
    └─sdb2 8:18 0 10.7G 0 part
  • 添加osd.1(block,block.db)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
~$ ceph-deploy osd create node2 --data /dev/sdb2 --block-db /dev/sdb1
[...]
[node2][INFO ] Running command: sudo /usr/sbin/ceph-volume --cluster ceph lvm create --bluestore --data /dev/sdb2 --block.db /dev/sdb1
[node2][DEBUG ] Running command: /usr/bin/ceph-authtool --gen-print-key
[node2][INFO ] checking OSD status...
[node2][INFO ] Running command: sudo /usr/bin/ceph --cluster=ceph osd stat --format=json
[ceph_deploy.osd][DEBUG ] Host node2 is now ready for osd use.

# 查看集群状态.
~$ ansible -i ../hosts node1 -b -m command -a "ceph -s"
192.168.99.101 | CHANGED | rc=0 >>
cluster:
id: 0bf150da-b691-4382-bf3d-600e90c19fba
health: HEALTH_OK

services:
mon: 3 daemons, quorum node1,node2,node3
mgr: node1(active), standbys: node3, node2
osd: 2 osds: 2 up, 2 in

data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0B
usage: 2.00GiB used, 18.7GiB / 20.7GiB avail
pgs:
# 查看node1上机的ceph-1
~$ ansible -i ../hosts node2 -b -m command -a "ls -l /var/lib/ceph/osd/ceph-1"
192.168.99.102 | CHANGED | rc=0 >>
total 48
-rw-r--r-- 1 ceph ceph 393 May 9 23:53 activate.monmap
lrwxrwxrwx 1 ceph ceph 93 May 9 23:53 block -> /dev/ceph-98f53d51-8e74-4ca3-8b7a-87570c01733e/osd-block-f572ef53-805e-48ff-b936-da520e46be6b
lrwxrwxrwx 1 ceph ceph 9 May 9 23:53 block.db -> /dev/sdb1
-rw-r--r-- 1 ceph ceph 2 May 9 23:53 bluefs
-rw-r--r-- 1 ceph ceph 37 May 9 23:53 ceph_fsid
-rw-r--r-- 1 ceph ceph 37 May 9 23:53 fsid
-rw------- 1 ceph ceph 55 May 9 23:53 keyring
-rw-r--r-- 1 ceph ceph 8 May 9 23:53 kv_backend
-rw-r--r-- 1 ceph ceph 21 May 9 23:53 magic
-rw-r--r-- 1 ceph ceph 4 May 9 23:53 mkfs_done
-rw-r--r-- 1 ceph ceph 41 May 9 23:53 osd_key
-rw-r--r-- 1 ceph ceph 6 May 9 23:53 ready
-rw-r--r-- 1 ceph ceph 10 May 9 23:53 type
-rw-r--r-- 1 ceph ceph 2 May 9 23:53 whoami

创建MDS服务器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
~$ ceph-deploy mds create  FE001 DIG001
# 查看状态
~$ ansible -i ../hosts node1 -b -m command -a "ceph mds stat"

~$ ansible -i ../hosts node1 -b -m command -a "ceph osd pool create cephfs_data 64 64"
pool 'cephfs_data' created

~$ ansible -i ../hosts node1 -b -m command -a "ceph osd pool create cephfs_metadata 64 64"
pool 'cephfs_metadata' created

# 创建文件系统
~$ ansible -i ../hosts node1 -b -m command -a "ceph fs new cephfs cephfs_metadata cephfs_data"
new fs with metadata pool 3 and data pool 2

~$ ansible -i ../hosts node1 -b -m command -a "ceph mds stat"
cephfs-1/1/1 up {0=DIG001=up:active}, 1 up:standby
~$ ansible -i ../hosts node1 -b -m command -a "ceph fs ls"
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]

~$ ansible -i ../hosts node1 -b -m command -a "ceph fs status"
cephfs - 0 clients
======
+------+--------+--------+---------------+-------+-------+
| Rank | State | MDS | Activity | dns | inos |
+------+--------+--------+---------------+-------+-------+
| 0 | active | DIG001 | Reqs: 0 /s | 10 | 12 |
+------+--------+--------+---------------+-------+-------+
+-----------------+----------+-------+-------+
| Pool | type | used | avail |
+-----------------+----------+-------+-------+
| cephfs_metadata | metadata | 2246 | 83.9G |
| cephfs_data | data | 0 | 83.9G |
+-----------------+----------+-------+-------+

+-------------+
| Standby MDS |
+-------------+
| FE001 |
+-------------+
MDS version: ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)

# 查看元数据.
~$ sudo ceph osd metadata osd.2
  • 挂载到文件系统,挂载文件系统,可以使用/etc/ceph/ceph.client.admin.keyring里的 key,也可以按照下面,新建一个用户与 key.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    ~$ sudo ceph auth get-or-create client.cephfs mon 'allow r' mds 'allow rw' osd 'allow rw pool=cephfs-data, allow rw pool=cephfs-metadata'
    ~$ sudo ceph auth get client.cephfs
    exported keyring for client.cephfs
    [client.cephfs]
    key = AQDAwhldGXL3GhAAGsHu3XYUIwzS6z0SOcLMFA==
    caps mds = "allow rw"
    caps mon = "allow r"
    caps osd = "allow rw pool=cephfs-data, allow rw pool=cephfs-metadata"

    ~$ sudo mount.ceph node1:6789:/ /data -o name=cephfs,secret=AQDAwhldGXL3GhAAGsHu3XYUIwzS6z0SOcLMFA==
  • 上述的挂载的方式,会在Shell里看到key,不安全.可以把AQDAwhldGXL3GhAAGsHu3XYUIwzS6z0SOcLMFA==这个Base64的密钥字段保存成一个文件,加上chmod 400的权限.

    1
    2
    3
    4
    ~$ sudo mount.ceph node1:6789:/ /data -o name=cephfs,secretfile=/etc/ceph/cephfs.secret

    # 加入自动挂载
    ~$ echo "mon1:6789,mon2:6789,mon3:6789:/ /cephfs ceph name=cephfs,secretfile=/etc/ceph/cephfs.secret,_netdev,noatime 0 0" | sudo tee -a /etc/fstab
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[2019-07-01 14:20:45,567][ceph_volume.process][INFO ] stdout ceph.block_device=/dev/ceph-bd417a6a-cef6-4ff5-828a-5b68ec8843f0/osd-block-dcde5f54-c555-41ee-8c20-586f1069bcb7,ceph.block_uuid=wHZD0b-lU7P-vYFg-XOBI-zknV-Q181-0xKtt3,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d7f63adc-33d1-4ae9-9ba7-ae401950d965,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=dcde5f54-c555-41ee-8c20-586f1069bcb7,ceph.osd_id=1,ceph.type=block,ceph.vdo=0";"/dev/ceph-bd417a6a-cef6-4ff5-828a-5b68ec8843f0/osd-block-dcde5f54-c555-41ee-8c20-586f1069bcb7";"osd-block-dcde5f54-c555-41ee-8c20-586f1069bcb7";"ceph-bd417a6a-cef6-4ff5-828a-5b68ec8843f0";"wHZD0b-lU7P-vYFg-XOBI-zknV-Q181-0xKtt3";"60.00g
[2019-07-01 14:20:45,567][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 59, in newfunc
return f(*a, \*\*kw)
File "/usr/lib/python2.7/dist-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in dispatch
instance.main()
File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/main.py", line 40, in main
terminal.dispatch(self.mapper, self.argv)
File "/usr/lib/python2.7/dist-packages/ceph_volume/terminal.py", line 182, in dispatch
instance.main()
File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, in is_root
return func(*a, **kw)
File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/trigger.py", line 70, in main
Activate(['--auto-detect-objectstore', osd_id, osd_uuid]).main()
File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 339, in main
self.activate(args)
File "/usr/lib/python2.7/dist-packages/ceph_volume/decorators.py", line 16, in is_root
return func(\*a, **kw)
File "/usr/lib/python2.7/dist-packages/ceph_volume/devices/lvm/activate.py", line 249, in activate
raise RuntimeError('could not find osd.%s with fsid %s' % (osd_id, osd_fsid))
RuntimeError: could not find osd.1 with fsid 3aeba7b7-f539-4b6a-afac-fc9fd62b90fa
1
2
3
~$ sudo lvs -o lv_tags
LV Tags
ceph.block_device=/dev/ceph-9c0a0bae-d6db-498a-bf20-fe4cd8bdb3a9/osd-block-5c5a950b-8b36-4935-be8c-b59c24073874,ceph.block_uuid=yY970H-ztZ4-VtfA-2L9d-k3cF-Zi44-0i8MB1,ceph.cephx_lockbox_secret=,ceph.cluster_fsid=d7f63adc-33d1-4ae9-9ba7-ae401950d965,ceph.cluster_name=ceph,ceph.crush_device_class=None,ceph.encrypted=0,ceph.osd_fsid=5c5a950b-8b36-4935-be8c-b59c24073874,ceph.osd_id=2,ceph.type=block,ceph.vdo=0
  • 可以在Ceph集群之外的服务器来安装RGW,需要安装ceph-radosgw包,如:ceph-deploy install --rgw <rgw-node> [<rgw-node>...].下面为了方便起见,我直接在node3,node4上面安装RGW.

  • 添加一个mon节点

    1
    ~$ ceph-deploy mon add node4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# node3在前面被管理过了.
~$ ceph-deploy admin node4
~$ ceph-deploy rgw create node3 node4
[...]
~$ ansible -i ../hosts node3 -b -m shell -a "ps -ef | grep rgw"
192.168.99.103 | CHANGED | rc=0 >>
ceph 4272 1 0 02:05 ? 00:00:00 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node3 --setuser ceph --setgroup ceph
root 5040 5039 0 02:06 pts/0 00:00:00 /bin/sh -c ps -ef | grep rgw
root 5042 5040 0 02:06 pts/0 00:00:00 grep rgw

~$ ansible -i ../hosts node4 -b -m shell -a "ps -ef | grep rgw"
192.168.99.104 | CHANGED | rc=0 >>
ceph 3411 1 0 02:05 ? 00:00:00 /usr/bin/radosgw -f --cluster ceph --name client.rgw.node4 --setuser ceph --setgroup ceph
root 4211 4210 0 02:07 pts/0 00:00:00 /bin/sh -c ps -ef | grep rgw
root 4213 4211 0 02:07 pts/0 00:00:00 grep rgw

# http 测试访问
~$ curl node3:7480
<?xml version="1.0" encoding="UTF-8"?><ListAllMyBucketsResult xmlns="http://s3.amazonaws.com/doc/2006-03-01/"><Owner><ID>anonymous</ID><DisplayName></DisplayName></Owner><Buckets></Buckets></ListAllMyBucketsResult>
  • 修改 RGW 的默认端口.在 ceph.conf 加入下面两行
1
2
3
[client.rgw.node4]
# rgw_frontends = "civetweb port=80"
rgw_frontends = civetweb port=80+443s ssl_certificate=/etc/ceph/private/keyandcert.pem
  • 上传当前目录下的配置文件到指定的在节点上去.
    1
    2
    ~$ ceph-deploy --overwrite-conf config push node4
    ~$ ansible -i hosts node4 -b -m systemd -a "name=radosgw state=restarted daemon_reload=yes"

civetweb配置

  • 如果不是通过ceph-deploy部署的集群需要通过下面的流程,手动配置添加RGW

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # 创建keyring
    ~$ sudo ceph-authtool --create-keyring /etc/ceph/ceph.client.radosgw.keyring
    # 生成密钥匙
    ~$ sudo ceph-authtool /etc/ceph/ceph.client.radosgw.keyring -n client.rgw.node3 --gen-key
    # 设置权限
    ~$ sudo ceph-authtool -n client.rgw.node3 --cap osd 'allow rwx' --cap mon 'allow rwx' /etc/ceph/ceph.client.radosgw.keyring
    # 导入keyring到集群中
    ~$ sudo ceph -k /etc/ceph/ceph.client.admin.keyring auth add client.rgw.node3 -i /etc/ceph/ceph.client.radosgw.keyring
    ~$ cat /etc/ceph/ceph.conf
    [...]
    [client.rgw.node3]
    rgw_frontends = civetweb port=80
    host=node3
    rgw_s3_auth_use_keystone=false
    keyring=/etc/ceph/ceph.client.radosgw.keyring
    log file=/var/log/ceph/client.radosgw.gateway.log
  • 这里是通过ceph-deploy部署的,只需导出相应到/etc/ceph/ceph.client.radosgw.keyring

    1
    2
    3
    4
    5
    6
    7
    ~$ sudo ceph auth get client.rgw.node3
    exported keyring for client.rgw.node3
    # 把下面这行复制,并创建到 /etc/ceph/ceph.client.radosgw.keyring
    [client.rgw.node3]
    key = AQC8FNVcl07ALRAAfhr+APpuKW/VvknEzD7hpg==
    caps mon = "allow rw"
    caps osd = "allow rwx"
  • 测试访问

    1
    ~$ sudo radosgw --cluster ceph --name client.rgw.node3 --setuser ceph --setgroup ceph -d --debug_ms 1 --keyring /etc/ceph/ceph.client.radosgw.keyring
  • 如果一切正常开启,就使用systemctl restart ceph-radosgw@rgw.node3重启它的服务,如果服务有错,使用journalctl -u ceph-radosgw@rgw.node3查看.

客户端访问

创建S3用户

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
$ ansible -i ../hosts node4 -b -m command -a "radosgw-admin user create --uid=\"lcy\" --display-name=\"admin user test\""
192.168.99.104 | CHANGED | rc=0 >>
{
"user_id": "lcy",
"display_name": "admin user test",
"email": "",
"suspended": 0,
"max_buckets": 1000,
"auid": 0,
"subusers": [],
"keys": [
{
"user": "lcy",
"access_key": "74I2DQ89N5EL1OGCCSCV", # s3cmd必须提供
"secret_key": "ePz9ONOrZS4BB8RN44KBYxCzRA0UNz8Kyu5kXzvE" # s3cmd必须提供
}
],
"swift_keys": [],
"caps": [],
"op_mask": "read, write, delete",
"default_placement": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw"
}
~$ ansible -i ../hosts node4 -b -m command -a "radosgw-admin user list"
192.168.99.104 | CHANGED | rc=0 >>
[
"testuser",
"lcy"
]

创建Swift用户

  • Swift用户是作为子用户被创建,因此要先创建用户,如下:lcy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    ~$ ansible -i ../hosts node4 -b -m command -a "radosgw-admin subuser create --uid=lcy --subuser=lcy:swift --access=full"
    192.168.99.104 | CHANGED | rc=0 >>
    {
    "user_id": "lcy",
    "display_name": "admin user test",
    "email": "",
    "suspended": 0,
    "max_buckets": 1000,
    "auid": 0,
    "subusers": [
    {
    "id": "lcy:swift",
    "permissions": "full-control"
    }
    ],
    "keys": [
    {
    "user": "lcy",
    "access_key": "74I2DQ89N5EL1OGCCSCV",
    "secret_key": "ePz9ONOrZS4BB8RN44KBYxCzRA0UNz8Kyu5kXzvE"
    }
    ],
    "swift_keys": [
    {
    "user": "lcy:swift",
    "secret_key": "bw2zByEnhZMzpSvrb9tYi5rjOT8mK69SkuuWFN8j"
    }
    ],
    "caps": [],
    "op_mask": "read, write, delete",
    "default_placement": "",
    "placement_tags": [],
    "bucket_quota": {
    "enabled": false,
    "check_on_raw": false,
    "max_size": -1,
    "max_size_kb": 0,
    "max_objects": -1
    },
    "user_quota": {
    "enabled": false,
    "check_on_raw": false,
    "max_size": -1,
    "max_size_kb": 0,
    "max_objects": -1
    },
    "temp_url_keys": [],
    "type": "rgw"
    }

使用Python客户端库测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
~$ pip install boto python-swiftclient
~$ ipython
In [1]: access_key = '74I2DQ89N5EL1OGCCSCV'
In [2]: secret_key = 'ePz9ONOrZS4BB8RN44KBYxCzRA0UNz8Kyu5kXzvE'
In [3]: import boto.s3.connection
In [4]: conn = boto.connect_s3(aws_access_key_id=access_key,aws_secret_access_key=secret_key,host='192.168.99.103',port=7480,is_secure=False,calling_format=boto.s3.connection.OrdinaryCallingFormat())
In [5]: bkt = conn.create_bucket('ooo-bucket')
In [6]: for bkt in conn.get_all_buckets():
...: print("{name} {created}".format(name=bkt.name,created=bkt.creation_date))
...: # 创建并获取成功.
ooo-bucket 2019-05-10T07:08:26.456Z
# 使用swift client 测试.
~$ swift -A http://node4:7480/auth/1.0 -U lcy:swift -K 'bw2zByEnhZMzpSvrb9tYi5rjOT8mK69SkuuWFN8j' list
ooo-bucket

使用s3cmd测试

  • 使用s3cmd之前,需要先使用s3cmd --configure交互设置好相应的参数.这里跳过直接写入一些必要的连接参数.这里可以配置任一rgw(node3,node4)节点测试.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    ~$ sudo apt instal s3cmd
    ~$ cat <<EOF > ~/.s3cfg
    [default]
    access_key = 74I2DQ89N5EL1OGCCSCV
    host_base = node3:7480
    host_bucket = node3:7480/%(bucket)
    secret_key = ePz9ONOrZS4BB8RN44KBYxCzRA0UNz8Kyu5kXzvE
    cloudfront_host = node3:7480
    use_https = False
    bucket_location = US
    EOF

    # 列出所有桶
    ~$ s3cmd ls
    2019-05-10 07:08 s3://ooo-bucket
    # 创建桶
    ~$ s3cmd mb s3://sql
    # 上传文件进桶
    ~$ s3cmd put ~/wxdb-20190422-1638.sql s3://sql
    upload: '/home/lcy/wxdb-20190422-1638.sql' -> 's3://sql/wxdb-20190422-1638.sql' [1 of 1]
    197980 of 197980 100% in 1s 104.33 kB/s done
    # 列出桶里的文件
    ~$ s3cmd ls s3://sql
    2019-05-10 08:12 197980 s3://sql/wxdb-20190422-1638.sql
    # 下载桶里的文件到本地.
    ~$ s3cmd get s3://sql/wxdb-20190422-1638.sql
    download: 's3://sql/wxdb-20190422-1638.sql' -> './wxdb-20190422-1638.sql' [1 of 1]
    197980 of 197980 100% in 0s 57.23 MB/s done
  • 查看集群利用率统计

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    ~$ ansible -i ../hosts node2 -b -m command -a "ceph df"
    192.168.99.102 | CHANGED | rc=0 >>
    GLOBAL:
    SIZE AVAIL RAW USED %RAW USED
    20.7GiB 18.7GiB 2.01GiB 9.72
    POOLS:
    NAME ID USED %USED MAX AVAIL OBJECTS
    .rgw.root 1 2.08KiB 0 5.83GiB 6
    default.rgw.control 2 0B 0 5.83GiB 8
    default.rgw.meta 3 2.13KiB 0 5.83GiB 12
    default.rgw.log 4 0B 0 5.83GiB 207
    default.rgw.buckets.index 5 0B 0 5.83GiB 3
    default.rgw.buckets.data 6 193KiB 0 5.83GiB 1

s3fs-fuse挂载文件系统

  • s3fs-fuse
  • Django 使用 AWS S3 存储文件参考
  • 原本想直接把ceph s3 bucket做为一个卷挂到 docker 上面,暂时没试验成功.下是如在的宿主机里挂载,再通过-v挂到 docker 上.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    ~$ sudo apt install s3fs fuse

    # 也可把它放在/etc/passwd-s3fs
    ~$ echo ACCESS_KEY_ID:SECRET_ACCESS_KEY > ${HOME}/.passwd-s3fs && chmod 600 ${HOME}/.passwd-s3fs
    ~$ s3cmd ls
    2019-05-10 08:10 s3://iso
    2019-05-16 03:50 s3://media # 下面将把它挂载成一个文件目录.
    2019-05-10 07:08 s3://ooo-bucket
    2019-05-16 06:44 s3://public
    2019-05-10 08:10 s3://sql

    # 这里需注,ceph s3 必需使用use_path_request_style参数,因为它不是AWS原生的.
    ~$ s3fs media /data/s3fs -o allow_other,umask=022,use_path_request_style,url=http://node3

    ~$ df -h | grep s3fs
    s3fs 256T 0 256T 0% /data/s3fs

    ~$ grep s3fs /etc/mtab
    s3fs /data/s3fs fuse.s3fs rw,nosuid,nodev,relatime,user_id=1000,group_id=120,allow_other 0 0

    # 如是挂载不加umask,默认是0000,无访问权限.
    ~$ ls -l /data/s3fs/
    total 9397
    drwxr-xr-x 1 root root 0 Jan 1 1970 hls
    -rwxr-xr-x 1 root root 3100721 May 16 14:24 video.mp4
  • 如果需要调试问题,可加入-o dbglevel=info -f -o curldbg启动,具体还有其它功能,可以详查看它的github以及它的帮助命令.

警告错误类

  • 参考文档

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    $ ansible -i ../hosts node1 -b -m command -a "ceph -s"
    192.168.99.101 | CHANGED | rc=0 >>
    cluster:
    id: 0bf150da-b691-4382-bf3d-600e90c19fba
    health: HEALTH_WARN
    Degraded data redundancy: 237/711 objects degraded (33.333%), 27 pgs degraded, 48 pgs undersized

    services:
    mon: 4 daemons, quorum node1,node2,node3,node4
    mgr: node1(active), standbys: node2, node3
    osd: 2 osds: 2 up, 2 in
    rgw: 2 daemons active

    data:
    pools: 6 pools, 48 pgs
    objects: 237 objects, 198KiB
    usage: 2.01GiB used, 18.7GiB / 20.7GiB avail
    pgs: 237/711 objects degraded (33.333%)
    27 active+undersized+degraded
    21 active+undersized
  • 根据上面的警告,数据中的pg降级,重启OSD的节点服务systemctl restart ceph-osd.target之后再看.

  • 后来仔细查看发现,是因为osd备份数量是3,而我这里只创建了两个osd,所以才会出现上述降级警告.可以修改备份数量为2,也可以再增加一个osd节点.

  • 下面也参照node2一样添加一个 20G 的盘,分成两个区,使用(block,block.wal)方式创建.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
~$ ceph-deploy osd create node3 --data /dev/sdb2 --block-wal /dev/sdb1

~$ ansible -i ../hosts node1 -b -m command -a "ceph osd tree"
192.168.99.101 | CHANGED | rc=0 >>
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.03058 root default
-3 0.00980 host node1
0 hdd 0.00980 osd.0 up 1.00000 1.00000
-5 0.01039 host node2
1 hdd 0.01039 osd.1 up 1.00000 1.00000
-7 0.01039 host node3
2 hdd 0.01039 osd.2 up 1.00000 1.00000

# 再次查看,状态已经正常了.
~$ ansible -i ../hosts node1 -b -m command -a "ceph health"
192.168.99.101 | CHANGED | rc=0 >>
HEALTH_OK
  • Ceph: HEALTH_WARN clock skew detected
1
2
# 把所有节点的ntp默认开机启动.
~$ ansible -i hosts all -b -m systemd -a "name=ntp enabled=yes state=started"
  • application not enabled on 1 pool(s) 警告处理
1
2
3
4
5
6
7
~$ sudo ceph health detail
HEALTH_WARN application not enabled on 1 pool(s)
POOL_APP_NOT_ENABLED application not enabled on 1 pool(s)
application not enabled on pool 'kube'
use 'ceph osd pool application enable <pool-name> <app-name>', where <app-name> is 'cephfs', 'rbd', 'rgw', or freeform for custom applications.
~$ sudo ceph osd pool application enable kube rbd
enabled application 'rbd' on pool 'kube'
  • 安装完成后各节点的服务列表如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    $ ansible -i hosts node -b -m command -a "netstat -tnlp"
    192.168.99.102 | CHANGED | rc=0 >>
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 192.168.99.102:6789 0.0.0.0:* LISTEN 476/ceph-mon
    tcp 0 0 192.168.99.102:6800 0.0.0.0:* LISTEN 875/ceph-osd
    tcp 0 0 192.168.99.102:6801 0.0.0.0:* LISTEN 875/ceph-osd
    tcp 0 0 192.168.99.102:6802 0.0.0.0:* LISTEN 875/ceph-osd
    tcp 0 0 192.168.99.102:6803 0.0.0.0:* LISTEN 875/ceph-osd
    tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 529/sshd
    tcp6 0 0 :::22 :::* LISTEN 529/sshd

    192.168.99.101 | CHANGED | rc=0 >>
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 192.168.99.101:6789 0.0.0.0:* LISTEN 480/ceph-mon
    tcp 0 0 192.168.99.101:6800 0.0.0.0:* LISTEN 1015/ceph-osd
    tcp 0 0 192.168.99.101:6801 0.0.0.0:* LISTEN 1015/ceph-osd
    tcp 0 0 192.168.99.101:6802 0.0.0.0:* LISTEN 1015/ceph-osd
    tcp 0 0 192.168.99.101:6803 0.0.0.0:* LISTEN 1015/ceph-osd
    tcp 0 0 192.168.99.101:6804 0.0.0.0:* LISTEN 476/ceph-mgr
    tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 537/sshd
    tcp6 0 0 :::22 :::* LISTEN 537/sshd

    192.168.99.103 | CHANGED | rc=0 >>
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 192.168.99.103:6789 0.0.0.0:* LISTEN 479/ceph-mon
    tcp 0 0 192.168.99.103:6800 0.0.0.0:* LISTEN 965/ceph-osd
    tcp 0 0 192.168.99.103:6801 0.0.0.0:* LISTEN 965/ceph-osd
    tcp 0 0 192.168.99.103:6802 0.0.0.0:* LISTEN 965/ceph-osd
    tcp 0 0 192.168.99.103:6803 0.0.0.0:* LISTEN 965/ceph-osd
    tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 527/sshd
    tcp 0 0 0.0.0.0:7480 0.0.0.0:* LISTEN 480/radosgw
    tcp6 0 0 :::22 :::* LISTEN 527/sshd

    192.168.99.104 | CHANGED | rc=0 >>
    Active Internet connections (only servers)
    Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
    tcp 0 0 192.168.99.104:6789 0.0.0.0:* LISTEN 445/ceph-mon
    tcp 0 0 0.0.0.0:80 0.0.0.0:* LISTEN 447/radosgw
    tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 515/sshd
    tcp6 0 0 :::22 :::* LISTEN 515/sshd

Kubernetes集成

创建RBD

  • 操作RBD必须直接登录到服务器里操作,ceph-deploy没有提供相关的接口.可以使用Ansible进行远程指操作.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    # 关于如何计算池的pg数,可以参考 https://ceph.com/pgcalc/
    ~$ sudo ceph osd pool create kube 64 64
    pool 'kube' created

    # 设置存储池的副本数
    ~$ sudo ceph osd pool set kube size 2

    ~$ sudo ceph osd lspools
    1 .rgw.root,2 default.rgw.control,3 default.rgw.meta,4 default.rgw.log,5 default.rgw.buckets.index,6 default.rgw.buckets.data,7 volumes,8 kube,

    ~$ sudo rbd create kube/cephimage2 --size 40960
    ~$ sudo rbd list kube
    cephimage2

    ~$ sudo rbd info kube/cephimage2
    rbd image 'cephimage2':
    size 40GiB in 10240 objects
    order 22 (4MiB objects)
    block_name_prefix: rbd_data.519a06b8b4567
    format: 2
    #
    features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
    flags:
    create_timestamp: Mon May 13 01:44:35 2019

    ~$ sudo rbd create kube/cephimage1 --size 10240

    # 把 cepimage1原来10G大小,扩展至20G
    ~$ sudo rbd resize kube/cephimage1 --size 20480

    -$ sudo rbd create kube/cephimage3 --size 4096 --image-feature layering
  • 默认创建RBD的会开启(layering, exclusive-lock, object-map, fast-diff, deep-flatten)特性,低版本的linux kernel会不支持,一般低版本仅支持layering特性.如果内核版本过低,创建Pod时会出现下面的要错误.

1
2
3
4
MountVolume.WaitForAttach failed for volume "ceph-rbd-pv" : rbd: map failed exit status 6, rbd output: rbd: sysfs write failed RBD image feature set mismatch. Try disabling features unsupported by the kernel with "rbd feature disable". In some cases useful info is found in syslog - try "dmesg | tail". rbd: map failed: (6) No such device or address

~# dmesg
[1355258.253726] rbd: image foo: image uses unsupported features: 0x38
  • 创建集群Pod

    1
    2
    3
    4
    5
    6
    7
    8
    9
    ~$ git clone https://github.com/kubernetes/examples.git
    ~$ cd examples/staging/volumes/rbd/
    ~$ tree
    .
    ├── rbd-with-secret.yaml
    ├── rbd.yaml
    ├── README.md
    └── secret
    └── ceph-secret.yaml
  • 修改rbd-with-secret.yaml的内容如下:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    apiVersion: v1
    kind: Pod
    metadata:
    name: rbd2
    spec:
    containers:
    - image: busybox
    command: ["sleep", "60000"]
    name: rbd-rw
    volumeMounts:
    - name: rbdpd
    mountPath: /mnt/rbd
    volumes:
    - name: rbdpd
    rbd:
    monitors:
    - '192.168.99.101:6789'
    - '192.168.99.102:6789'
    - '192.168.99.103:6789'
    - '192.168.99.104:6789'
    pool: kube
    image: cephimage3
    fsType: ext4
    readOnly: false
    user: admin
    secretRef:
    name: ceph-secret
  • 修改ceph-secret,注意替换文件里的key字段.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    ~$ ansible -i hosts node1 -b -m command -a "cat /etc/ceph/ceph.client.admin.keyring" | grep key | awk '{printf "%s",$NF}' | base64
    QVFESDB0UmNFSStwR3hBQUJ4aW1ZT1VXRWVTckdzSStpZklCOWc9PQ==
    ~$ cat secret/ceph-secret.yaml
    apiVersion: v1
    kind: Secret
    metadata:
    name: ceph-secret
    type: "kubernetes.io/rbd"
    data:
    key: QVFESDB0UmNFSStwR3hBQUJ4aW1ZT1VXRWVTckdzSStpZklCOWc9PQ== # 来源于上面的命令输出.
  • 创建PodSecret

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    ~$ kubectl create -f secret/ceph-secret.yaml
    ~$ kubectl create -f rbd-with-secret

    ~$ kubectl get pods -o wide
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
    rbd2 1/1 Running 0 60m 10.244.1.2 node2 <none> <none>
    ~$ kubectl get secret
    NAME TYPE DATA AGE
    ceph-secret kubernetes.io/rbd 1 17h

    # 这样就像使用本的盘一样了.
    ~$ kubectl exec -it rbd2 -- df -h | grep -e "rbd0" -e "secret"
    /dev/rbd0 3.9G 16.0M 3.8G 0% /mnt/rbd
    tmpfs 498.2M 12.0K 498.2M 0% /var/run/secrets/kubernetes.io/serviceaccount
  • 创建基于RBDPV以及PVC测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
~$ cat rbd-pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: ceph-rbd-pv
spec:
capacity:
storage: 4Gi
accessModes:
- ReadWriteOnce
rbd:
monitors:
- '192.168.99.101:6789'
- '192.168.99.102:6789'
- '192.168.99.103:6789'
- '192.168.99.104:6789'
pool: kube
image: cephimage1
fsType: ext4
readOnly: false
user: admin
secretRef:
name: ceph-secret
persistentVolumeReclaimPolicy: Recycle

~$ cat rbd-pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ceph-rbd-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 4Gi

~$ kubectl create -f rbd-pv.yaml
~$ kubectl create -f rbd-pvc.yaml
~$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
ceph-rbd-pv 4Gi RWO Recycle Bound default/ceph-rbd-pvc 17h
~$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
ceph-rbd-pvc Bound ceph-rbd-pv 4Gi RWO 17h

Ceph-Ansible安装方式

关于新以太网名命方式

1
2
3
4
5
6
7
8
enp0s10:
| | |
v | |
en| | --> ethernet
v |
p0| --> bus number (0)
v
s10 --> slot number (10)
  • 如果不习惯新式的命名可以通过下面三方法改成旧式的命名方式

  • You basically have three options:

    1. You disable the assignment of fixed names, so that the unpredictable kernel names are used again. For this, simply mask udev’s .link file for the default policy: ln -s /dev/null /etc/systemd/network/99-default.link
    2. You create your own manual naming scheme, for example by naming your interfaces “internet0”, “dmz0” or “lan0”. For that create your own .link files in /etc/systemd/network/, that choose an explicit name or a better naming scheme for one, some, or all of your interfaces. See systemd.link(5) for more information.
    3. You pass the net.ifnames=0 on the kernel command line
  • 查看虚拟机

    1
    2
    3
    4
    5
    6
    7
    8
    9
    ~$ VBoxManage list vms
    "k8s-master" {7bfb1ca4-3ccc-4a1a-8548-7759424df181}
    "k8s-node1" {4c29c029-4f93-4463-b83d-4ae9e728e9df}
    "k8s-node2" {87a2196c-cf3c-472a-9ffa-f5b8c3e09009}
    "k8s-node3" {af9e34cf-a7c9-45d8-ad15-f37d409bcdac}
    "k8s-node4" {1f46e865-01c1-4a81-a947-cc267c744756}

    # 使用 VBoxHeadles启动上述虚拟机,它不会出现窗口.
    ~$ VBoxHeadless --startvm k8s-master
  • 下面是参照官网来安装ceph-deploy.但是使用apt找不到ceph-deploy包名.

    1
    2
    3
    4
    ~$ wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
    # 用Ceph稳定版(如 cuttlefish 、 dumpling 、 emperor 、 firefly,nautilus 等等)替换掉 {ceph-stable-release}
    ~$ echo deb http://download.ceph.com/debian-{ceph-stable-release}/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list
    ~$ sudo apt-get update && sudo apt-get install ceph-deploy
  • 下面把它转换成Ansible playbook的方式来安装.在Github上的Ceph有一个ceph-ansible,没有用过,它的 Star 将近 1k 了.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
---
- name: 安装基础软件
hosts: all
become: yes
# user: root 这里可以直接用root,但是关闭root远程登录后要使用sudo.
tasks:
# 参考文档 https://docs.ansible.com/ansible/latest/modules/command_module.html#command-module
- name: 读取系统发行版本号
command: lsb_release -sc
register: result

- name: 安装公钥匙
apt_key:
url: https://download.ceph.com/keys/release.asc
state: present

# 参照文档 https://docs.ansible.com/ansible/latest/modules/apt_repository_module.html?highlight=add%20apt%20repository
- name: ceph-deploy
apt_repository:
repo: deb http://download.ceph.com/debian-nautilus {{ result.stdout }} main
state: present
filename: ceph

# 参照文档 https://docs.ansible.com/ansible/latest/modules/apt_key_module.html?highlight=apt%20key
- name: 添加docker-ce的公钥
apt_key:
url: https://download.docker.com/linux/debian/gpg
state: present

# 参照文档 https://docs.ansible.com/ansible/latest/modules/apt_repository_module.html?highlight=add%20apt%20repository
- name: docker-ce
apt_repository:
repo: deb [arch=amd64] https://download.docker.com/linux/debian {{ result.stdout }} stable
state: present
filename: docker-ce

# 参照文档 https://docs.ansible.com/ansible/latest/modules/apt_module.html
- name: 更新并安装
apt:
name:
['ntp', 'ntpdate', 'ntp-doc', 'docker-ce', 'bridge-utils', 'ipvsadm']
allow_unauthenticated: yes
update_cache: yes

# 参照文档 https://docs.ansible.com/ansible/latest/modules/lineinfile_module.html?highlight=sudoers
# 如查使用ansible的sysctl模块,可以参照这里 https://docs.ansible.com/ansible/latest/modules/sysctl_module.html?highlight=sysctl
- name: 更新并安装sysctl
lineinfile:
path: /etc/sysctl.d/80-k8s.conf
create: yes
line: '{{ item }}'
with_items:
- 'net.bridge.bridge-nf-call-ip6tables = 1'
- 'net.bridge.bridge-nf-call-iptables = 1'
- 'net.bridge.bridge-nf-call-arptables = 1'
- 'net.ipv4.ip_forward = 1'

- name: 更新sysctl
command: sysctl --system

- block:
# 命名方式参考这里https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
# https://major.io/2015/08/21/understanding-systemds-predictable-network-device-names/
- name: 使用旧式网卡命名方式
file:
src: /dev/null
dest: /etc/systemd/network/99-default.link
state: link

# 好像上述改回成旧式命名的方法在debian里不成功.使用下面修改内核参数的方式.
- name: 更新内核参数
lineinfile:
path: /etc/default/grub
regexp: '^GRUB_CMDLINE_LINUX='
line: 'GRUB_CMDLINE_LINUX="net.ifnames=0 biosdevname=0"'

- name: 更新grub.cfg
command: grub-mkconfig -o /boot/grub/grub.cfg

安装KubernetesMaster

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
~$ sudo kubeadm init --image-repository registry.cn-hangzhou.aliyuncs.com/google_containers  --kubernetes-version v1.14.1 --pod-network-cidr=10.244.0.0/16 --ignore-preflight-errors=NumCPU --apiserver-advertise-address=192.168.99.100
[...]
Your`Kubernetes`control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user: # 注意需按照下述步骤进行.先安装对应的网络插件再加入其它的节点

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/

Then you can join any number of worker nodes by running the following on each as root:

kubeadm join 192.168.99.100:6443 --token ejtj7f.oth6on2k6y0qcj2k \
--discovery-token-ca-cert-hash sha256:d162721230250668a4296aca699867126314a9ecd2418f9c70110b6b02bd01de

# 继续安装网络插件.
~$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/a70459be0084506e4ec919aa1c114638878db11b/Documentation/kube-flannel.yml
  • 默认安装集群是使用kube-proxy+iptables模式,需要手动修改为ipvs模式.使用kubectl -n kube-system edit cm kube-proxy打开 ConfigMap 文件,把mode=""替换成mode="ipvs,再把旧的 pod 删掉,kubectl -n kube-system delete pod kube-proxy-xxx,它会再生成一个新的 pod.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
~$ kubectl -n kube-system logs kube-proxy-t27xd
I0514 06:33:30.681150 1 server_others.go:177] Using ipvs Proxier. ---> 切换成ipvs模式.
W0514 06:33:30.738710 1 proxier.go:381] IPVS scheduler not specified, use rr by default
I0514 06:33:30.747818 1 server.go:555] Version: v1.14.1
[...]

# 查看ipvs的列表.
~$ sudo ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 172.17.0.1:32047 rr
TCP 192.168.99.100:32047 rr
TCP 10.0.2.15:32047 rr
TCP 10.96.0.1:443 rr
-> 192.168.99.100:6443 Masq 1 3 0
[...]
  • 使用Ansible批量加入 k8s 集群.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    ~$ cat hosts
    [master]
    192.168.99.100
    [node1]
    192.168.99.101
    [node2]
    192.168.99.102
    [node3]
    192.168.99.103
    [node4]
    192.168.99.104
    [node]
    192.168.99.101
    192.168.99.102
    192.168.99.103
    192.168.99.104

    ~$ansible -i hosts node -b -m command -a "kubeadm join 192.168.99.100:6443 --token ejtj7f.oth6on2k6y0qcj2k --discovery-token-ca-cert-hash sha256:d162721230250668a4296aca699867126314a9ecd2418f9c70110b6b02bd01de"
  • 查看主节点的状态

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    ~$ kubectl get nodes
    NAME STATUS ROLES AGE VERSION
    k8s-master NotReady master 15h v1.14.1
    node1 NotReady <none> 15h v1.14.1
    node2 NotReady <none> 15h v1.14.1
    node3 NotReady <none> 15h v1.14.1
    node4 NotReady <none> 15h v1.14.1

    # 查看所有节点为什么是NotReady状态?
    ~$ kubectl get pods -n kube-system
    NAME READY STATUS RESTARTS AGE
    coredns-d5947d4b-kfhlp 0/1 Pending 0 15h
    coredns-d5947d4b-sq95j 0/1 Pending 0 15h
    etcd-k8s-master 1/1 Running 2 15h
    kube-apiserver-k8s-master 1/1 Running 2 15h
    kube-controller-manager-k8s-master 1/1 Running 2 15h
    kube-proxy-25vgp 1/1 Running 2 15h
    kube-proxy-75xjc 1/1 Running 1 15h
    kube-proxy-bvdh6 1/1 Running 1 15h
    kube-proxy-lzp8m 1/1 Running 1 15h
    kube-proxy-wnmwk 1/1 Running 1 15h
    kube-scheduler-k8s-master 1/1 Running 2 15h

    # 查看coredns为什么Pending?
    ~$ kubectl describe pod coredns -n kube-system
    [...]
    Events:
    Type Reason Age From Message
    ---- ------ ---- ---- -------
    Warning FailedScheduling 10m (x49 over 81m) default-scheduler 0/5 nodes are available: 5 node(s) had taints that the pod didn\'t tolerate.
    Warning FailedScheduling 75s (x4 over 5m21s) default-scheduler 0/5 nodes are available: 5 node(s) had taints that the pod didn\'t tolerate.

    # 查看系统 journalctl
    ~$ sudo journalctl -u kubelet
    # 发现是因为没有安装网络插件的原因.
    ~$ kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/a70459be0084506e4ec919aa1c114638878db11b/Documentation/kube-flannel.yml

使用Rook构建

  • Ceph Rook集成
    CephRook
  • Rook 的架构
    rook architecture

安装Ceph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
~$ git clone https://github.com/rook/rook
~$ cd rook/cluster/examples/kubernetes/ceph/
~$ kubectl create -f common.yaml
~$ kubectl create -f operator.yaml
~$ kubectl create -f cluster.yaml
~$ kubectl -n rook-ceph get pods
NAME READY STATUS RESTARTS AGE
rook-ceph-agent-f7ln5 1/1 Running 0 5m36s
rook-ceph-agent-fzztf 1/1 Running 0 5m36s
rook-ceph-agent-mgqk6 1/1 Running 0 5m36s
rook-ceph-agent-qdbmh 1/1 Running 0 5m36s
rook-ceph-agent-twsvp 1/1 Running 0 5m36s
rook-ceph-operator-775cf575c5-8k44f 1/1 Running 1 6m30s
rook-discover-d4btd 1/1 Running 0 5m36s
rook-discover-fbq9w 1/1 Running 0 5m36s
rook-discover-gcksv 1/1 Running 0 5m36s
rook-discover-hnbdj 1/1 Running 0 5m36s
rook-discover-j5x5h 1/1 Running 0 5m36s

拆除ROOK

1
2
3
4
5
6
7
8
9
10
11
12
~$ cat remove-nodes-rooks-containers.sh
for i in `seq 1 4`; do
for n in `ansible -i hosts node$i -b -m command -a "docker ps -a" | awk 'NR>2 {print $1}'`;do
#ansible -i hosts node$i -b -m command -a "docker stop $n ; docker rm $n";
ansible -i hosts node$i -b -m command -a "docker rm $n";
done;
done

~$ cat remove-rook-cluster-data.sh
ansible -i hosts all -b -m file -a "path=/var/lib/rook state=absent"
ansible -i hosts all -b -m file -a "path=/etc/kubernetes state=absent"
ansible -i hosts all -b -m file -a "path=/var/lib/kubelet state=absent"

错误

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
~$ kubectl -n rook-ceph get pod
NAME READY STATUS RESTARTS AGE
rook-ceph-mon-a-f799d9cf6-xrg8f 0/1 Init:CrashLoopBackOff 6 8m46s
rook-ceph-mon-d-5dd7b4d56f-wwg8n 0/1 Init:CrashLoopBackOff 6 7m1s
rook-ceph-mon-f-7977bd98c9-9b6h4 0/1 Init:CrashLoopBackOff 5 5m19s

~$ kubectl -n rook-ceph describe pod rook-ceph-mon-a
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m15s default-scheduler Successfully assigned rook-ceph/rook-ceph-mon-a-f799d9cf6-xrg8f to k8s-master
Normal Pulled 7m20s (x5 over 9m4s) kubelet, k8s-master Container image "rook/ceph:v0.9.3" already present on machine
Normal Created 7m19s (x5 over 9m2s) kubelet, k8s-master Created container config-init
Normal Started 7m18s (x5 over 8m59s) kubelet, k8s-master Started container config-init
Warning BackOff 3m52s (x26 over 8m52s) kubelet, k8s-master Back-off restarting failed container

~$ kubectl -n rook-ceph describe pod rook-ceph-mon-d
[...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m2s default-scheduler Successfully assigned rook-ceph/rook-ceph-mon-d-5dd7b4d56f-wwg8n to node1
Normal Pulled 6m15s (x5 over 7m45s) kubelet, node1 Container image "rook/ceph:v0.9.3" already present on machine
Normal Created 6m15s (x5 over 7m45s) kubelet, node1 Created container config-init
Normal Started 6m14s (x5 over 7m45s) kubelet, node1 Started container config-init
Warning BackOff 2m41s (x26 over 7m43s) kubelet, node1 Back-off restarting failed container

谢谢支持

  • 微信二维码: