Marc's Public Blog - Linux Btrfs Blog Posts

 

I've been using btrfs since 2012, and while as of 2014, it's far from done, it's gone a long way in that time. I thought I'd post some tips and scripts from things I've learned through my own use and sharing with others, hence this page.
More generally, you'll find Btrfs documentation on the btrfs wiki

>>> Back to post index <<<

  2014/03/23 Btrfs Raid5 Status
 

π 2014-03-23 00:00 in Btrfs, Linux

How to use Btrfs raid5/6

Since I didn't find good documentation of where Btrfs raid5/raid6 was at, I did some tests, and with some help from list members, can write this page now.

This is as of kernel 3.14 with btrfs-tools 3.12. If your are using a kernel and especially tools older than that, there are good chances things will work less well.

Btrfs raid5/6 in a nutshell

It is important to know that raid5/raid6 is more experimental than btrfs itself is. Do not use this for production systems, or if you do and things break, you were warned 🙂

If you're coming from the mdadm raid5 world, here's what you need to know:

  • Btrfs is still experimental, but raid5/6 is experimental within btrfs (in other words quite unfinished).
  • As of 3.14, it works if everything goes right, but the error handling is still lacking. Unexpected conditions are likely to cause unexpected failures. Buyer beware 🙂
  • scrub cannot fix issues with raid5/6 yet. This means that if you have any checksum problem, your filesystem will be in a bad state.
  • btrfs does not yet seem to know that if you removed a drive from an array and you plug it back in later, that drive is out of date. It will auto-add an out of date drive back to an array and that will likely cause data loss by hiding files you had but the old drive didn't have. This means you should wipe a drive cleanly before you put it back into an array it used to be part of. See https://bugzilla.kernel.org/show_bug.cgi?id=72811
  • btrfs does not deal well with a drive that is present but not working. It does not know how to kick it from the array, nor can it be removed (btrfs device delete) because this causes reading from the drive that isn't working. This means btrfs will try to write to the bad drive forever. The solution there is to umount the array, remount it with the bad drive missing (it cannot be seen by btrfs, or it'll get automounted/added), and then rebuild on a new drive or rebuild/shrink the array to be one drive smaller (this is explained below).
  • You can add and remove drives from an array and rebalance to grow/shrink an array without umounting it. Note that is slow since it forces rewriting of all data blocks, and this takes about 3H per 100GB (or 30H per terabyte) with 10 drives on a dual core duo.
  • If you are missing a drive, btrfs will refuse to mount the array and give an obscure error unless you mount with -o degraded
  • btrfs has no special rebuild procedure. Rebuilding is done by rebalancing the array. You could actualy rebalance a degraded array to a smaller array by rebuilding/balancing without adding a drive, or you can add a drive, rebalance on it, and that will force a read/rewrite of all data blocks, which will restripe them nicely.
  • btrfs replace does not work, but you can easily do btrfs device add, and btrfs remove of the other drive, and this will do the same thing.
  • btrfs device add will not cause an auto rebalance. You could chose not to rebalance existing data and only have new data be balanced properly.
  • btrfs device delete will force all data from the deleted drive to be rebalanced and the command completes when the drive has been freed up.
  • The magic command to delete an unused drive from an array while it is missing from the system is btrfs device delete missing .
  • btrfs doesn't easily tell you that your array is in degraded mode (run btrfs fi show, and it'll show a missing drive as well as how much of your total data is still on it). This does means you can have an array that is half degraded: half the files are striped over the current drives because they were written after the drive was removed, or were written by a rebalance that hasn't finished, while the other half of your data could be in degraded mode.
  • You can see this by looking at the amount of data on each drive, anything on drive 11 is properly striped 10 way, while anything on drive 3 is in degraded mode:
    polgara:~# btrfs fi show
    Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
            Total devices 11 FS bytes used 564.54GiB
            devid    1 size 465.76GiB used 63.14GiB path /dev/dm-0
            devid    2 size 465.76GiB used 63.14GiB path /dev/dm-1
            devid    3 size 465.75GiB used 30.00GiB path   <- this device is missing
            devid    4 size 465.76GiB used 63.14GiB path /dev/dm-2
            devid    5 size 465.76GiB used 63.14GiB path /dev/dm-3
            devid    6 size 465.76GiB used 63.14GiB path /dev/dm-4
            devid    7 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdi1
            devid    8 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdj1
            devid    9 size 465.76GiB used 63.14GiB path /dev/dm-7
            devid    10 size 465.76GiB used 63.14GiB path /dev/dm-8
            devid    11 size 465.76GiB used 33.14GiB path /dev/mapper/crypt_sde1 <- this device was added

    Create a raid5 array

    polgara:/dev/disk/by-id# mkfs.btrfs -f -d raid5 -m raid5 -L backupcopy /dev/mapper/crypt_sd[bdfghijkl]1
    

    WARNING! – Btrfs v3.12 IS EXPERIMENTAL WARNING! – see http://btrfs.wiki.kernel.org before using

    Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 Turning ON incompat feature 'raid56': raid56 extended format adding device /dev/mapper/crypt_sdd1 id 2 adding device /dev/mapper/crypt_sdf1 id 3 adding device /dev/mapper/crypt_sdg1 id 4 adding device /dev/mapper/crypt_sdh1 id 5 adding device /dev/mapper/crypt_sdi1 id 6 adding device /dev/mapper/crypt_sdj1 id 7 adding device /dev/mapper/crypt_sdk1 id 8 adding device /dev/mapper/crypt_sdl1 id 9 fs created label backupcopy on /dev/mapper/crypt_sdb1 nodesize 16384 leafsize 16384 sectorsize 4096 size 4.09TiB polgara:/dev/disk/by-id# mount -L backupcopy /mnt/btrfs_backupcopy
    polgara:/mnt/btrfs_backupcopy# df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/crypt_sdb1 4.1T 3.0M 4.1T 1% /mnt/btrfs_backupcopy

    As another example, you could use -d raid5 -m raid1 to have metadata be raid1 while data being raid5. This specific example isn't actually that useful, but just giving it as an example.

    Replacing a drive that hasn't failed yet on a running raid5 array

    btrfs replace does not work:

    polgara:/mnt/btrfs_backupcopy# btrfs replace start -r /dev/mapper/crypt_sem1 /dev/mapper/crypt_sdm1  .
    Mar 23 14:56:06 polgara kernel: [53501.511493] BTRFS warning (device dm-9): dev_replace cannot yet handle RAID5/RAID6

    No big deal, this can be done in 2 steps:

  • Add the new drive
polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sdm1 .
polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1
        devid    12 size 465.75GiB used 0.00 path /dev/mapper/crypt_sdm1
  • btrfs device delete the drive to remove. This neatly causes a rebalance which will happen to use the new drive you just added
polgara:/mnt/btrfs_backupcopy# btrfs device delete /dev/mapper/crypt_sde1 .
Mar 23 11:13:31 polgara kernel: [40145.908207] BTRFS info (device dm-9): relocating block group 945203314688 flags 129
Mar 23 14:51:51 polgara kernel: [53245.955444] BTRFS info (device dm-9): found 5576 extents
Mar 23 14:51:57 polgara kernel: [53251.874925] BTRFS info (device dm-9): found 5576 extents
polgara:/mnt/btrfs_backupcopy# 

Note that this is slow, 3.5h for just 115GB of data. It could take days for a terabyte array.

 

polgara:/mnt/btrfs_backupcopy# btrfs fi show Label: backupcopy uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1 Total devices 10 FS bytes used 114.35GiB devid 1 size 465.76GiB used 13.14GiB path /dev/dm-0 devid 2 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdd1 devid 4 size 465.76GiB used 13.14GiB path /dev/dm-2 devid 5 size 465.76GiB used 13.14GiB path /dev/dm-3 devid 6 size 465.76GiB used 13.14GiB path /dev/dm-4 devid 7 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdi1 devid 8 size 465.76GiB used 13.14GiB path /dev/dm-6 devid 9 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdk1 devid 10 size 465.76GiB used 13.14GiB path /dev/mapper/crypt_sdl1 devid 12 size 465.75GiB used 13.14GiB path /dev/mapper/crypt_sdm1

There we go, I'm back on 10 devices, almost as good as a btrfs replace, it simply took 2 steps

Replacing a missing drive on a running raid5 array

Normal mount will not work:

polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime LABEL=backupcopy /mnt/btrfs_backupcopy
mount: wrong fs type, bad option, bad superblock on /dev/mapper/crypt_sdj1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
Mar 21 22:29:45 polgara kernel: [ 2288.285068] BTRFS info (device dm-8): disk space caching is enabled
Mar 21 22:29:45 polgara kernel: [ 2288.285369] BTRFS: failed to read the system array on dm-8
Mar 21 22:29:45 polgara kernel: [ 2288.316067] BTRFS: open_ctree failed

So we do a mount with -o degraded polgara:~# mount -v -t btrfs -o compress=zlib,space_cache,noatime,degraded LABEL=backupcopy /mnt/btrfs_backupcopy /dev/mapper/crypt_sdj1 on /mnt/btrfs_backupcopy type btrfs (rw,noatime,compress=zlib,space_cache,degraded) Mar 21 22:29:51 polgara kernel: [ 2295.042421] BTRFS: device label backupcopy devid 8 transid 3446 /dev/mapper/crypt_sdj1 Mar 21 22:29:51 polgara kernel: [ 2295.065951] BTRFS info (device dm-8): allowing degraded mounts Mar 21 22:29:51 polgara kernel: [ 2295.065955] BTRFS info (device dm-8): disk space caching is enabled Mar 21 22:30:32 polgara kernel: [ 2336.189000] BTRFS: device label backupcopy devid 3 transid 8 /dev/dm-9 Mar 21 22:30:32 polgara kernel: [ 2336.203175] BTRFS: device label backupcopy devid 3 transid 8 /dev/dm-9

Then we add the new drive:

polgara:/mnt/btrfs_backupcopy# btrfs device add -f /dev/mapper/crypt_sde1 .
polgara:/mnt/btrfs_backupcopy# df .
/dev/dm-0       5.1T  565G  4.0T  13% /mnt/btrfs_backupcopy   < bad, it should be 4.5T, but I get space for 11 drives

https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22balance.22_do.3F says:
"On a filesystem with damaged replication (e.g. a RAID-1 FS with a dead and removed disk), it will force the FS to rebuild the missing copy of the data on one of the currently active devices, restoring the RAID-1 capability of the filesystem."

See also: https://btrfs.wiki.kernel.org/index.php/Balance_Filters

If we have written data since the drive was removed, or if we are recovering from a unfinished balance, doing a filter on devid=3 tells balance to only rewrite data and metadata that has a chunk on missing device #3 (this is a good way to finish the balance in multiple passes if you have to reboot in between, or the filesystem deadlocks during a balance, which unfortunately is still common as of kernel 3.14.

 

polgara:/mnt/btrfs_backupcopy# btrfs balance start -ddevid=3 -mdevid=3 -v . Mar 22 13:15:55 polgara kernel: [20275.690827] BTRFS info (device dm-9): relocating block group 941277446144 flags 130 Mar 22 13:15:56 polgara kernel: [20276.604760] BTRFS info (device dm-9): relocating block group 940069486592 flags 132 Mar 22 13:19:27 polgara kernel: [20487.196844] BTRFS info (device dm-9): found 52417 extents Mar 22 13:19:28 polgara kernel: [20488.056749] BTRFS info (device dm-9): relocating block group 938861527040 flags 132 Mar 22 13:22:41 polgara kernel: [20681.588762] BTRFS info (device dm-9): found 70146 extents Mar 22 13:22:42 polgara kernel: [20682.380957] BTRFS info (device dm-9): relocating block group 937653567488 flags 132 Mar 22 13:26:12 polgara kernel: [20892.816204] BTRFS info (device dm-9): found 71497 extents Mar 22 13:26:14 polgara kernel: [20894.819258] BTRFS info (device dm-9): relocating block group 927989891072 flags 129

As balancing happens, data is taken out of devid3, the one missing, and added to devid11 (the one added):

polgara:~# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 564.54GiB
        devid    1 size 465.76GiB used 63.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 63.14GiB path /dev/dm-1
        devid    3 size 465.75GiB used 30.00GiB path   <- this device is missing
        devid    4 size 465.76GiB used 63.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 63.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 63.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 63.14GiB path /dev/mapper/crypt_sdj1
        devid    9 size 465.76GiB used 63.14GiB path /dev/dm-7
        devid    10 size 465.76GiB used 63.14GiB path /dev/dm-8
        devid    11 size 465.76GiB used 33.14GiB path /dev/mapper/crypt_sde1 <- this device was added

You can see status with:

polgara:/mnt/btrfs_backupcopy# while :
> do
> btrfs balance status .
> sleep 60
1 out of about 72 chunks balanced (2 considered),  99% left
2 out of about 72 chunks balanced (3 considered),  97% left
3 out of about 72 chunks balanced (4 considered),  96% left

At the end (and this can take hours to days), you get:

polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 11 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    3 size 465.75GiB used 0.00 path  <----  drive is freed up now.
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1
Btrfs v3.12

But the array still shows 11 drives with one missing and will not mount without -o degraded.
You do this with:

polgara:/mnt/btrfs_backupcopy# btrfs device delete missing .
polgara:/mnt/btrfs_backupcopy# btrfs fi show
Label: backupcopy  uuid: eed9b55c-1d5a-40bf-a032-1be6980648e1
        Total devices 10 FS bytes used 114.35GiB
        devid    1 size 465.76GiB used 32.14GiB path /dev/dm-0
        devid    2 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdd1
        devid    4 size 465.76GiB used 32.14GiB path /dev/dm-2
        devid    5 size 465.76GiB used 32.14GiB path /dev/dm-3
        devid    6 size 465.76GiB used 32.14GiB path /dev/dm-4
        devid    7 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdi1
        devid    8 size 465.76GiB used 32.14GiB path /dev/dm-6
        devid    9 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdk1
        devid    10 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sdl1
        devid    11 size 465.76GiB used 32.14GiB path /dev/mapper/crypt_sde1

And there we go, we're back in business!

From the above, you've also learned how to grow a raid5 array (add a drive, run balance), or remove a drive (just run btrfs device delete and the auto balance will restripe your entire array for n-1 drives).


More pages: March 2014 April 2014 May 2014 October 2014 March 2018

>>> Back to post index <<<

Measure

Measure

Read more

世界越快心越慢

在晚飯後的休息時間,我特別享受在客廳瀏灠youtube上各樣各式創作者的影音作品。很大不同於傳統媒體,節目多是針對大多數族群喜好挑選的,在youtube上我會依心情看無腦的動畫、一些旅拍記錄、新聞時事談論。 尤其在看了大量的Youtube的分享後,我真的感受到會限制我的是我的無知,特別是那些我想都沒想過的實際應用,在學習後大大幫助到我的生活和工作層面。 休息在家時,我喜歡想一些沒做過的菜,動手去設計生活和工作上的解決方案,自己是真的很難閒著沒事做。 如創作文章,陪養新的習慣都能感覺到成長的喜悅,是不同於吃喝玩樂的快樂的。 創作不去限制固定的形式,文字是創作、影像聲音也是創作,記錄生活也是創作,我想留下的就是創造—》實現—》回憶,這樣子的循環過程,在留下的足跡面看到自己一路上的成長、失敗、絕望、重新再來。 雖然大部份的時候去做這些創作也不明白有什麼特別的意義,但不去做也不會留下什麼,所以呀不如反事都去試試看,也許能有不一樣的水花也許有意想不到的結果,投資自己永遠不會是失敗的決定,不是嗎?先問問自己再開始計畫下一步,未來沒人說得準。 像最近看youtube仍大一群人在為DOS開

By Phillips Hsieh

知識管理的三個步驟:一小時學會把知識運用到生活上

摘錄瓦基「閱讀前哨站」文章作為自己學習知識管理的內容 Part1「篩選資訊」 如何從海量資訊中篩選出啟發性、實用性和相關性的精華,讓你在學習過程中不再迷失方向。 1. 實用性 2. 啟發性 Part2「提高理解」 如何通過譬喻法和應用法,將抽象的知識與日常生活和工作緊密結合,建立更深刻的理解。 1. 應用法 2. 譬喻法 Part3「運用知識」 如何連結既有知識,跟自己感興趣的領域和專案產生關聯,讓你在運用知識的路途上游刃有餘。 1. 跟日常工作專案、人際活動產生連結 # 為什麼要寫日記? * 寫日記是為了忘記,忘卻瑣碎事情,保持專注力 * 寫日記就像在翻譯這個世界,訓練自己的解讀能力 * 不只是透過日記來記錄生活,而是透過日記來發展生活 #如何寫日記? * 不要寫流水帳式的日記,而是寫覆盤式的日記 當我們試著記錄活動和感受之間的關聯,有助於辦認出真正快樂的事 日記的記錄方式要以過程為主,而非結果 * 感恩日記的科學建議,每日感恩的案例

By Phillips Hsieh
2024年 3月30日 14屆美利達環彰化百K

2024年 3月30日 14屆美利達環彰化百K

這是場半小時就被秒報名額滿的經典賽事, 能順利出賽實屬隊友的功勞, 這次的準備工作想試試新買的外胎, 因為是無內胎用的外胎, 特別緊超級難安裝的, 問了其他朋友才知道, 要沾上肥皂水才容易滑入車框。 一早四點起床準備, 五點集合備好咖啡在車上飲用, 約了六點在彰化田尾鄉南鎮國小, 整好裝四人一起出發前往會場。 被排在最後一批出發, 這次的路線會繞行的員林148上139縣道, 其實在早上五點多天就開始有點飄雨, 大伙就開始擔心不會要雨戰吧! 果不其然才出發準備上148爬坡雨勢越來越大, 戴著防風眼鏡的我在身體的熱氣加上雨水冷凝效果下, 鏡面上滿是霧氣肉眼可視距離才剩不到五公尺, 只能緊依前前方的車友幫忙開路, 之後洪大跟上來我立馬請求他幫忙開路, 上了139停下車把防風眼鏡收起來, 反正下雨天又陰天完全用不到太陽眼鏡了。 雨是邊下邊打雷, 大伙都在這條139上一台一台單車好像避電針, 一時有點害怕不然想平時沒做什麼壞事, 真打到自己就是天意了。 下了139雨勢開始變小, 大伙的速度開始有所提昇, 開高鐵列車的時機己成熟, 物色好列車就跟好跟滿。 最後找了一隊似乎整團有固定在練

By Phillips Hsieh
2023 12月9號 美利達單車嘉年華

2023 12月9號 美利達單車嘉年華

第二次參加美利達環南投賽事, 還記得去年第一次參加這美利達環南投, 還特地提前一天跟車友在魚池住了一晚。 這回用上了剛在7月份剛安裝的車頂架, 安裝了二種不同的攜車架, 都樂這邊可以不用拆車輪直上車頂, YAKIMA這邊選了經濟的款式, 折掉前輪利用前叉固定在攜車架上。 約了唯一一位一起參加的朋友, 二人一早四點約見面, 幫朋友帶上了拿鐵咖啡, 開上日月潭在水社碼頭停好車, 騎往向山遊客中心, 路過美麗的日月潭簡直不要太美了拍一張。 抵達會場己是人山人海了, 跟著大伙排隊順便也看網紅也欣賞名車。 出發就先沿著日月潭順時針騎, 騎到玄裝寺很急停下來上一下廁所, 比賽時都會尿都特別的滿, 一方面是比較緊張,一方面是特別興奮。 這時己經跟車友失散了, 只能獨推沿路看有沒有車友可以一起組隊的, 很可惜在山區大家的實力不一只求平安順騎了, 原則就是有補給就停有食物就吃。 下到水里人群再次聚集起來, 光等紅綠燈就是一條車龍。 騎行了一大圈水里再回到131縣道, 這時背後傳來熟悉的聲音叫菲哥, 終於跟車友重新集合接下來就一路邊聊邊騎。 最後來幾張專業攝影師拍攝的照片 回到終點台上

By Phillips Hsieh