Recover from full flag(s) set

Step-by-step guide

Full state report example

[root@overcloud-controller-1 heat-admin]# ceph health detail
HEALTH_ERR 1 full osd(s); 2 near full osd(s); too many PGs per OSD (556 > max 300); full flag(s) set
osd.6 is full at 95%
osd.3 is near full at 86%
osd.8 is near full at 86%
too many PGs per OSD (556 > max 300)
 
full flag(s) set

First you can try to reduce the weight of the full OSD

[root@overcloud-controller-1 heat-admin]# ceph osd reweight 6 0.95
reweighted osd.6 to 0.95 (f333)

Note that more OSDs may enter in near full state and even if the full state flag is cleared might not reach the desired state

[root@overcloud-controller-1 heat-admin]# ceph health detail
HEALTH_WARN 4 pgs backfill_wait; 2 pgs backfilling; 6 pgs stuck unclean; recovery 18233/2728843 objects misplaced (0.668%); 4 near full osd(s); too many PGs per OSD (556 > max 300)
pg 7.bc is stuck unclean for 8526.609344, current state active+remapped+backfilling, last acting [3,7,6]
pg 6.ab is stuck unclean for 19932.940385, current state active+remapped+wait_backfill, last acting [8,5,6]
pg 6.9a is stuck unclean for 20638.188059, current state active+remapped+wait_backfill, last acting [3,2,6]
pg 7.12 is stuck unclean for 17999.803348, current state active+remapped+backfilling, last acting [4,2,6]
pg 6.85 is stuck unclean for 20918.188965, current state active+remapped+wait_backfill, last acting [6,2,8]
pg 7.85 is stuck unclean for 20746.597906, current state active+remapped+wait_backfill, last acting [7,6,4]
pg 7.85 is active+remapped+wait_backfill, acting [7,6,4]
pg 6.85 is active+remapped+wait_backfill, acting [6,2,8]
pg 7.12 is active+remapped+backfilling, acting [4,2,6]
pg 6.9a is active+remapped+wait_backfill, acting [3,2,6]
pg 6.ab is active+remapped+wait_backfill, acting [8,5,6]
pg 7.bc is active+remapped+backfilling, acting [3,7,6]
recovery 18233/2728843 objects misplaced (0.668%)
osd.1 is near full at 85%
osd.3 is near full at 86%
osd.6 is near full at 91%
osd.8 is near full at 86%
too many PGs per OSD (556 > max 300)

Check the replicated size

[root@overcloud-controller-1 heat-admin]# ceph osd dump | grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 33 flags hashpspool stripe_width 0
pool 2 'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 230 flags hashpspool stripe_width 0
pool 3 'manila_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 35 flags hashpspool stripe_width 0
pool 4 'manila_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 36 flags hashpspool stripe_width 0
pool 5 'metrics' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 37 flags hashpspool stripe_width 0
pool 6 'vms' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 229 flags hashpspool stripe_width 0
pool 7 'volumes' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 140 flags hashpspool stripe_width

Next step is to reduce the number of replicas on the "volumes" pool from 3 to 2 (the most utilized volume).

[root@overcloud-controller-1 heat-admin]# ceph osd pool set volumes size 2
set pool 7 size to 2
[root@overcloud-controller-1 heat-admin]# ceph osd dump | grep 'replicated size'
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 33 flags hashpspool stripe_width 0
pool 2 'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 230 flags hashpspool stripe_width 0
pool 3 'manila_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 35 flags hashpspool stripe_width 0
pool 4 'manila_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 36 flags hashpspool stripe_width 0
pool 5 'metrics' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 37 flags hashpspool stripe_width 0
pool 6 'vms' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 229 flags hashpspool stripe_width 0
pool 7 'volumes' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 254 flags hashpspool stripe_width 0

The volumes pool has reduced from 92% to 55%, sufficient to buy time to restore the service and take cleanup actions tighter with adding extra OSD option.

[root@overcloud-controller-1 heat-admin]# ceph df
GLOBAL:
    SIZE       AVAIL     RAW USED     %RAW USED 
    11172G     3990G        7182G         64.29 
POOLS:
    NAME                ID     USED      %USED     MAX AVAIL     OBJECTS 
    rbd                 0          0         0         1200G           0 
    backups             1          0         0         1200G           0 
    images              2       380G     24.06         1200G      113193 
    manila_data         3          0         0         1200G           0 
    manila_metadata     4          0         0         1200G           0 
    metrics             5       599M      0.05         1200G      142613 
    vms                 6       500G     29.42         1200G      129025 
    volumes             7      2288G     55.95         1801G      588571

Docs

Dashboard

OCP

Freecad

CNC

LinkedIn

Page tree

Step-by-step guide

Related articles