Tuning Xen for Performance

Tuning Xen for Performance

Contents

 [hide

Tuning your Xen installation: recommended settings

Storage options

There are several choices for storage, however it is important to understand that the IO performance inside of the guest depends greatly on the storage option used:

  • LVM: this is probably the simplest way for obtaining good storage IO performance on Linux without much hassle.
  • ZFS ZVOLS: this is a more advanced configuration, and should provide better performance if configured properly. ZFS has some advanced features like ARC, L2ARC and ZIL that can provide much better performance than plain LVM volumes if properly configured and tuned. Please note that due to ZFS memory requirements in this case the Dom0/Driver domain should be given at least 4GB of RAM (or even more in order to increase performance).
  • iSCSI: the default toolstack in Xen supports using iSCSI disks as storage backends for guests. The performance of iSCSI greatly depends on the capacities of the server and the network components, but if configured properly should provide a performance similar to LVM or ZFS.
  • Files: using files as backends for guest storage is not recommended for performance reasons, but it has several benefits in terms of features, like being able to use raw, qcow, qcow2 or vhd formats to store guests disks.

See Storage_options for more details.

Memory

If the host has more memory than a typical laptop/desktop system, then do not rely on dom0 ballooning. Instead set the dom0 memory to be something between 1 and 4GB adding dom0_mem=1024M to the Xen command line.

1GB is enough for a pretty large host, more will be needed if you expect your users to use advanced storage types as ZFS or distributed filesystems.

Dedicating fixed amount of memory for dom0 is good for two reasons:

  • First of all (dom0) Linux kernel calculates various network related parameters based on the boot time amount of memory.
  • The second reason is Linux needs memory to store the memory metadata (per page info structures), and this allocation is also based on the boot time amount of memory.

Now, if you boot up the system with dom0 having all the memory visible to it, and then balloon down dom0 memory every time you start up a new guest, you end up having only a small amount of the original (boot time) amount of memory available in the dom0 in the end. This means the calculated parameters are not correct anymore, and you end up wasting a lot of memory for the metadata for a memory you don't have anymore. Also ballooning down busy dom0 might have bad side effects.

Dom0 vCPUs

By default Dom0 gets as many vCPUs as CPUs on the physical host. This might be a good idea if your host only has 4 CPUs, but as systems get bigger there's no reason to assign that many vCPUs to Dom0, so reducing it to something sensible is interesting for performance. In this case, the number of vCPUs assigned to Dom0 greatly depends on the host workload. For example, running HVM domains without stubdomains means that you can end up with a lot of Qemu instances in Dom0 that could be using quite some CPU, so in this case you should make sure that Dom0 has enough vCPUs assigned. In general you should not assigned less than 4 vCPUs to Dom0, and then you should pay attention to the load in Dom0 in order to make sure it is able to sustain the workload with the current assignation, or your guests will start to suffer performance degradations.

Another interesting approach is pinning Dom0 vCPUs to physical CPUs, this can be done by adding dom0_vcpus_pin to the Xen command line. Then once Dom0 has booted you can see to which CPUs the vCPUs have been pinned and exclude other domains from running on those CPUs. In this example I've used the following command line "dom0_max_vcpus=4 dom0_vcpus_pin":

# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0    0   -b-      28.8  0
Domain-0                             0     1    1   -b-      22.0  1
Domain-0                             0     2    2   r--      22.0  2
Domain-0                             0     3    3   -b-      22.2  3

Now we have to prevent domains from using those CPUs (0 to 3), so Dom0 doesn't have to schedule out, this is done by adding the following in the guest configuration file:

cpus="all,^0-3"

This will make the domain use all available CPUs except the ones currently pinned to Dom0:

# xl vcpu-list
Name                                ID  VCPU   CPU State   Time(s) CPU Affinity
Domain-0                             0     0    0   -b-      30.4  0
Domain-0                             0     1    1   r--      24.2  1
Domain-0                             0     2    2   -b-      23.4  2
Domain-0                             0     3    3   -b-      22.8  3
guest                                2     0    7   -b-       0.4  4-7
guest                                2     1    4   -b-       1.4  4-7

Another option for those that don't want to pin Dom0 to specific CPUs is to increase the realitve weight of Dom0, so that it gets scheduled more often than unpriviledged domains. By default all guests in Xen (including Dom0) have a weight of 256, this might be a problem if all domains rely on Dom0 for IO, since Dom0 can easily become a bottleneck.

# xl sched-credit
Cpupool Pool-0: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
Domain-0                             0    256    0

An easy solution to this is to increase the weight of Dom0, while leaving the other domains with the default weight:

# xl sched-credit -d 0 -w 512
# xl sched-credit
Cpupool Pool-0: tslice=30ms ratelimit=1000us
Name                                ID Weight  Cap
Domain-0                             0    512    0
guest                                3    256    0

In this case Dom0 will get twice as much CPU time as a normal guest. See Credit_Scheduler for more information.

Tuning your Xen installation: advanced settings

HAP vs. shadow

HAP stands for hardware assisted paging and requires a CPU feature called EPT by Intel and RVI by AMD. It is used to manage the guest's MMU. The alternative is shadow paging, completely managed in software by Xen. On HAP TLB misses are expensive so if you have really random access, HAP will be expensive. On shadow page table updates are expensive. HAP is enabled by default (and it is the recommended setting) but can be disabled passing hap=0 in the VM config file.

PV vs PV on HVM

Linux, NetBSD, FreeBSD and Solaris can run as PV or PV on HVM guests. Memory intensive workloads that involve the continuous creation and destruction of page tables can perform better when run in a PV on HVM guest. Examples are kernbench and sql-bench. On the other hand memory workloads that run on a quasi-static set of page tables run better on a PV guests. An example of this kind of workloads is specjbb. See Xen_Linux_PV_on_HVM_drivers#Performance_Tradeoffs for more details. A basic PV guest config file looks like the following:

bootloader = "/usr/bin/pygrub"
memory = 1024
name = "linux"
vif = [ "bridge=xenbr0" ]
disk = [ "/root/images/debian_squeeze_amd64_standard.raw,raw,xvda,w" ]
root = "/dev/xvda1"

You can also specify a kernel and ramdisk path in the dom0 filesystem directly in the VM config file, to be used for the guest:

kernel = "/boot/vmlinuz"
ramdisk = "/boot/initrd"
memory = 1024
name = "linux"
vif = [ "bridge=xenbr0" ]
disk = [ "/images/debian_squeeze_amd64_standard.raw,raw,xvda,w" ]
root = "/dev/xvda1"

See this page for instructions on how to install a Debian PV guest.

HVM guests run in a fully emulated environment that looks like a normal PC from the inside. As a consequence an HVM config file is a bit different and cannot specify a kernel and a ramdisk. On the other hand it is possible to perform an HVM installation from an emulated cdrom, using the iso of your preferred distro. It is also possible to pxeboot the VM. See the following very basic example:

builder="hvm"
memory=1024
name = "linuxhvm"
vif = [ "type=ioemu, bridge=xenbr0" ]
disk = [ "/images/debian_squeeze_amd64_standard.raw,raw,hda,w", "/images/debian-6.0.5-amd64-netinst.iso,raw,hdc:cdrom,r" ]
serial="pty"
boot = "dc"

See this page for a more detailed example PV on HVM config file.

vCPU Pinning for guests

You can dedicate a physical cpu to a particular virtual cpu or a set of virtual cpus. If you have enough physical cpus for all your guests, including dom0, you can make sure that the scheduler won't get in your way. Even if you don't have enough physical cpus for everybody, you can still use this technique to ensure that a particular guest has always cpu time.

xl vcpu-pin Domain-name1 0 0
xl vcpu-pin Domain-name1 1 1

These two commands pin vcpu 0 and 1 of Domain-name1 to physical cpu 0 and 1. However they do not prevent other vcpus from running on pcpu 0 and pcpu1: you need to plan in advance and pin the vcpus of all your guests so they won't be running on pcpu 0 and 1. For example:

xl vcpu-pin Domain-name2 all 2-6

This commands forces all the vcpus of Domain-name2 to only run on physical cpus from 2 to 6, leaving pcpu 0 and 1 to Domain-name1. You can also add the following lines to the config file of the VM to automatically pin the vcpus to a set of pcpus at boot time:

cpus="2-6"
Icon Ambox.png Pinning can have unexpected negative effects just as often as beneficial ones. Before using pinning in a production situation, test it on your workload to prove it will be beneficial for you. If you are unsure, the default decision should be to not pin.

 

vCPU Soft Affinity for guests

Starting from Xen 4.5, each vcpu has:

  • an hard affinity, also known as pinning (see the above paragraph). This is the list of pcpus where a vcpu is allowed to run;
  • a soft affinity, which this series introduces. This is the list of pcpus where a vcpu prefers to run.

This helps in all the situations where it is considered to be preferable, for the (some) vcpus of a VM, to execute on a given set of host's pcpus, but we still want them to be able to run somewhere else, if all the pcpus in such a preferred set are busy. A typical use case for this are NUMA machines, where the soft affinity for the vcpus of the VM should be set equal to the pcpus of the NUMA node where the VM has been placed (xl and libxl will do this automatically, if not instructed otherwise).

To control soft affinity, the command is still xl vcpu-pin. Soft affinity is listed (started from 4.5) next to hard affinity:

xl vcpu-list 1
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
debian.guest.osstest                 1     0   12   -b-       5.2  all / all
debian.guest.osstest                 1     1   14   -b-       3.3  all / all

For altering it, use the 4th parameter of xl vcpu-pin. It is possible to set the soft affinity without changing vcpu pinning by using "-" as the 3rd param:

xl vcpu-pin 1 all - 16-18
xl vcpu-list 1
Name                                ID  VCPU   CPU State   Time(s) Affinity (Hard / Soft)
debian.guest.osstest                 1     0   19   -b-       5.3  all / 16-18
debian.guest.osstest                 1     1   18   -b-       3.3  all / 16-18

TCP Small Queue

Since Linux 3.19, the kernel uses TCP Small Queue that has less than optimal performance on Xen. You can switch back to the older Single Flow Throughput behaviour and improve network performance by increasing tcp_limit_output_bytes:

echo 1048576 > /proc/sys/net/ipv4/tcp_limit_output_bytes

NUMA

A NUMA machine is typically a multi-sockets machine built in such a way that processors have their own local memory. A group of processors connected to the same memory controller is usually called a node. Accessing memory from remote nodes is always possible, but it is usually very slow. Since VMs are usually small (both in number of vcpus and amount of memory) it should be possible to avoid remote memory access altogether. Both XenD and xl (starting from Xen 4.2) try to automatically make that happen by default. This means they will allocate the vcpus and memory of your VMs trying to take the NUMA topology of the underlying host into account, if no vcpu pinning or cpupools are specified (see right below). Check out this article for some more details.

However, if one wants to manually control from which node(s) the vcpus and the memory of a VM should come from, the following mechanisms are available:

  • vcpu pinning: if you use the cpus setting in the VM config file (as described in the previous chapter) to assign all the vcpus of a VM to the pcpus of a single NUMA node, all the memory of the VM will be allocated locally to that node too: no remote memory access will occur (this is available in xl starting from Xen 4.2). To figure out which physical cpus belong to which NUMA node, you can use the following command:
xl info -n
  • cpupools: using the command xl cpupool-numa-split (see here) you can split your physical cpus and memory into pools according to the NUMA topology. You'll end up with one cpupool per NUMA node: use xl cpupool-list to see the available cpupools. Then you can assign each VM to a different cpupool adding to the VM config file:
pool="Pool-node0"
  • NUMA Aware Scheduling: for achieving the best possible locality, while the VM is running. Starting from Xen 4.5, NUMA aware scheduling is implemented by means of scheduling soft affinity.

To find out more about NUMA within this Wiki, check out the various pages from the proper category: Category:NUMA

Performance bugs

In Xen 4.5 we tried to determine is the reported performance issue of Hyperthreading in Credit1 scheduler was a regression or had existed. See [Virt overehead with HT [was: Re: Xen 4.5 development update ]] contains gore details.

The brief summary is that Xen Credit1 has an 7.9% performance drop when using SMT with kernbench workload. That is something that will be looked at in the future.

We originally thought it was an regression – but it is an inherent way credit1 has been implemented and had been since the introduction of credit1.

The workload makes a difference. While this is 'kernbench' which stresses a multitude of paths, other workloads will work just fine.

Please note that in the most optimal case HT gives 30% boost. Accounting for the 7.9% drop, that means you can still get 20% performance speed.

Please note also that earlier version of Linux kernel did not PV aware spinlocks which would contribute to this. With Linux 3.11 when booting as PVHVM with CONFIG_PARAVIRT_SPINLOCK and with Linux 2.6.32 booting as PV (also with CONFIG_PARAVIRT_SPINLOCK enabled), the lock contention leading to abysmal performance on CPU oversubscribe has ameliorated.

Read more

世界越快心越慢

在晚飯後的休息時間,我特別享受在客廳瀏灠youtube上各樣各式創作者的影音作品。很大不同於傳統媒體,節目多是針對大多數族群喜好挑選的,在youtube上我會依心情看無腦的動畫、一些旅拍記錄、新聞時事談論。 尤其在看了大量的Youtube的分享後,我真的感受到會限制我的是我的無知,特別是那些我想都沒想過的實際應用,在學習後大大幫助到我的生活和工作層面。 休息在家時,我喜歡想一些沒做過的菜,動手去設計生活和工作上的解決方案,自己是真的很難閒著沒事做。 如創作文章,陪養新的習慣都能感覺到成長的喜悅,是不同於吃喝玩樂的快樂的。 創作不去限制固定的形式,文字是創作、影像聲音也是創作,記錄生活也是創作,我想留下的就是創造—》實現—》回憶,這樣子的循環過程,在留下的足跡面看到自己一路上的成長、失敗、絕望、重新再來。 雖然大部份的時候去做這些創作也不明白有什麼特別的意義,但不去做也不會留下什麼,所以呀不如反事都去試試看,也許能有不一樣的水花也許有意想不到的結果,投資自己永遠不會是失敗的決定,不是嗎?先問問自己再開始計畫下一步,未來沒人說得準。 像最近看youtube仍大一群人在為DOS開

By Phillips Hsieh

知識管理的三個步驟:一小時學會把知識運用到生活上

摘錄瓦基「閱讀前哨站」文章作為自己學習知識管理的內容 Part1「篩選資訊」 如何從海量資訊中篩選出啟發性、實用性和相關性的精華,讓你在學習過程中不再迷失方向。 1. 實用性 2. 啟發性 Part2「提高理解」 如何通過譬喻法和應用法,將抽象的知識與日常生活和工作緊密結合,建立更深刻的理解。 1. 應用法 2. 譬喻法 Part3「運用知識」 如何連結既有知識,跟自己感興趣的領域和專案產生關聯,讓你在運用知識的路途上游刃有餘。 1. 跟日常工作專案、人際活動產生連結 # 為什麼要寫日記? * 寫日記是為了忘記,忘卻瑣碎事情,保持專注力 * 寫日記就像在翻譯這個世界,訓練自己的解讀能力 * 不只是透過日記來記錄生活,而是透過日記來發展生活 #如何寫日記? * 不要寫流水帳式的日記,而是寫覆盤式的日記 當我們試著記錄活動和感受之間的關聯,有助於辦認出真正快樂的事 日記的記錄方式要以過程為主,而非結果 * 感恩日記的科學建議,每日感恩的案例

By Phillips Hsieh
2024年 3月30日 14屆美利達環彰化百K

2024年 3月30日 14屆美利達環彰化百K

這是場半小時就被秒報名額滿的經典賽事, 能順利出賽實屬隊友的功勞, 這次的準備工作想試試新買的外胎, 因為是無內胎用的外胎, 特別緊超級難安裝的, 問了其他朋友才知道, 要沾上肥皂水才容易滑入車框。 一早四點起床準備, 五點集合備好咖啡在車上飲用, 約了六點在彰化田尾鄉南鎮國小, 整好裝四人一起出發前往會場。 被排在最後一批出發, 這次的路線會繞行的員林148上139縣道, 其實在早上五點多天就開始有點飄雨, 大伙就開始擔心不會要雨戰吧! 果不其然才出發準備上148爬坡雨勢越來越大, 戴著防風眼鏡的我在身體的熱氣加上雨水冷凝效果下, 鏡面上滿是霧氣肉眼可視距離才剩不到五公尺, 只能緊依前前方的車友幫忙開路, 之後洪大跟上來我立馬請求他幫忙開路, 上了139停下車把防風眼鏡收起來, 反正下雨天又陰天完全用不到太陽眼鏡了。 雨是邊下邊打雷, 大伙都在這條139上一台一台單車好像避電針, 一時有點害怕不然想平時沒做什麼壞事, 真打到自己就是天意了。 下了139雨勢開始變小, 大伙的速度開始有所提昇, 開高鐵列車的時機己成熟, 物色好列車就跟好跟滿。 最後找了一隊似乎整團有固定在練

By Phillips Hsieh
2023 12月9號 美利達單車嘉年華

2023 12月9號 美利達單車嘉年華

第二次參加美利達環南投賽事, 還記得去年第一次參加這美利達環南投, 還特地提前一天跟車友在魚池住了一晚。 這回用上了剛在7月份剛安裝的車頂架, 安裝了二種不同的攜車架, 都樂這邊可以不用拆車輪直上車頂, YAKIMA這邊選了經濟的款式, 折掉前輪利用前叉固定在攜車架上。 約了唯一一位一起參加的朋友, 二人一早四點約見面, 幫朋友帶上了拿鐵咖啡, 開上日月潭在水社碼頭停好車, 騎往向山遊客中心, 路過美麗的日月潭簡直不要太美了拍一張。 抵達會場己是人山人海了, 跟著大伙排隊順便也看網紅也欣賞名車。 出發就先沿著日月潭順時針騎, 騎到玄裝寺很急停下來上一下廁所, 比賽時都會尿都特別的滿, 一方面是比較緊張,一方面是特別興奮。 這時己經跟車友失散了, 只能獨推沿路看有沒有車友可以一起組隊的, 很可惜在山區大家的實力不一只求平安順騎了, 原則就是有補給就停有食物就吃。 下到水里人群再次聚集起來, 光等紅綠燈就是一條車龍。 騎行了一大圈水里再回到131縣道, 這時背後傳來熟悉的聲音叫菲哥, 終於跟車友重新集合接下來就一路邊聊邊騎。 最後來幾張專業攝影師拍攝的照片 回到終點台上

By Phillips Hsieh