Full state report example
[root@overcloud-controller-1 heat-admin]# ceph health detail HEALTH_ERR 1 full osd(s); 2 near full osd(s); too many PGs per OSD (556 > max 300); full flag(s) set osd.6 is full at 95% osd.3 is near full at 86% osd.8 is near full at 86% too many PGs per OSD (556 > max 300) full flag(s) set
First you can try to reduce the weight of the full OSD
[root@overcloud-controller-1 heat-admin]# ceph osd reweight 6 0.95 reweighted osd.6 to 0.95 (f333)
Note that more OSDs may enter in near full state and even if the full state flag is cleared might not reach the desired state
[root@overcloud-controller-1 heat-admin]# ceph health detail HEALTH_WARN 4 pgs backfill_wait; 2 pgs backfilling; 6 pgs stuck unclean; recovery 18233/2728843 objects misplaced (0.668%); 4 near full osd(s); too many PGs per OSD (556 > max 300) pg 7.bc is stuck unclean for 8526.609344, current state active+remapped+backfilling, last acting [3,7,6] pg 6.ab is stuck unclean for 19932.940385, current state active+remapped+wait_backfill, last acting [8,5,6] pg 6.9a is stuck unclean for 20638.188059, current state active+remapped+wait_backfill, last acting [3,2,6] pg 7.12 is stuck unclean for 17999.803348, current state active+remapped+backfilling, last acting [4,2,6] pg 6.85 is stuck unclean for 20918.188965, current state active+remapped+wait_backfill, last acting [6,2,8] pg 7.85 is stuck unclean for 20746.597906, current state active+remapped+wait_backfill, last acting [7,6,4] pg 7.85 is active+remapped+wait_backfill, acting [7,6,4] pg 6.85 is active+remapped+wait_backfill, acting [6,2,8] pg 7.12 is active+remapped+backfilling, acting [4,2,6] pg 6.9a is active+remapped+wait_backfill, acting [3,2,6] pg 6.ab is active+remapped+wait_backfill, acting [8,5,6] pg 7.bc is active+remapped+backfilling, acting [3,7,6] recovery 18233/2728843 objects misplaced (0.668%) osd.1 is near full at 85% osd.3 is near full at 86% osd.6 is near full at 91% osd.8 is near full at 86% too many PGs per OSD (556 > max 300)
Check the replicated size
[root@overcloud-controller-1 heat-admin]# ceph osd dump | grep 'replicated size' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 'backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 33 flags hashpspool stripe_width 0 pool 2 'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 230 flags hashpspool stripe_width 0 pool 3 'manila_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 35 flags hashpspool stripe_width 0 pool 4 'manila_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 36 flags hashpspool stripe_width 0 pool 5 'metrics' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 37 flags hashpspool stripe_width 0 pool 6 'vms' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 229 flags hashpspool stripe_width 0 pool 7 'volumes' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 140 flags hashpspool stripe_width
Next step is to reduce the number of replicas on the "volumes" pool from 3 to 2 (the most utilized volume).
[root@overcloud-controller-1 heat-admin]# ceph osd pool set volumes size 2 set pool 7 size to 2 [root@overcloud-controller-1 heat-admin]# ceph osd dump | grep 'replicated size' pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 1 'backups' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 33 flags hashpspool stripe_width 0 pool 2 'images' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 230 flags hashpspool stripe_width 0 pool 3 'manila_data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 35 flags hashpspool stripe_width 0 pool 4 'manila_metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 36 flags hashpspool stripe_width 0 pool 5 'metrics' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 37 flags hashpspool stripe_width 0 pool 6 'vms' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 229 flags hashpspool stripe_width 0 pool 7 'volumes' replicated size 2 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 256 pgp_num 256 last_change 254 flags hashpspool stripe_width 0
The volumes pool has reduced from 92% to 55%, sufficient to buy time to restore the service and take cleanup actions tighter with adding extra OSD option.
[root@overcloud-controller-1 heat-admin]# ceph df GLOBAL: SIZE AVAIL RAW USED %RAW USED 11172G 3990G 7182G 64.29 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS rbd 0 0 0 1200G 0 backups 1 0 0 1200G 0 images 2 380G 24.06 1200G 113193 manila_data 3 0 0 1200G 0 manila_metadata 4 0 0 1200G 0 metrics 5 599M 0.05 1200G 142613 vms 6 500G 29.42 1200G 129025 volumes 7 2288G 55.95 1801G 588571
There is no content with the specified labels