当前位置: 首页 >> 榆树网-系统 >> 文章正文

Linux NAT优化手册

最大连接数:

其实说来说去,就2个参数,还是看看原文吧,怕我给理会错了

还是这个详细:http://www.wallfire.org/misc/netfilter_conntrack_perf.txt

SMP(SMP环境中中断优化):

http://www.cs.uwaterloo.ca/~brecht/servers/apic/SMP-affinity.txt

—-
看别人说这样优化后效果不错,由于目前我所管理的NAT机器流量和连接数不至于这么大,所以算是没有验证过。

There are two parameters we can play with:
- the maximum number of allowed conntrack entries, which will be called
CONNTRACK_MAX in this document
- the size of the hash table storing the lists of conntrack entries, which
will be called HASHSIZE (see below for a description of the structure)

CONNTRACK_MAX is the maximum number of “sessions” (connection tracking entries)
that can be handled simultaneously by netfilter in kernel memory.

A conntrack entry is stored in a node of a linked list, and there are several
lists, each list being an element in a hash table.  So each hash table entry
(also called a bucket) contains a linked list of conntrack entries.
To access a conntrack entry corresponding to a packet, the kernel has to:
- compute a hash value according to some defined characteristics of the packet.
This is a constant time operation.
This hash value will then be used as an index in the hash table, where a
list of conntrack entries is stored.
- iterate over the linked list of conntrack entries to find the good one.
This is a more costly operation, depending on the size of the list (and on
the position of the wanted conntrack entry in the list).

The hash table contains HASHSIZE linked lists.  When the limit is reached
(the total number of conntrack entries being stored has reached CONNTRACK_MAX),
each list will contain ideally (in the optimal case) about
CONNTRACK_MAX/HASHSIZE entries.

The hash table occupies a fixed amount of non-swappable kernel memory,
whether you have any connections or not.  But the maximum number of conntrack
entries determines how many conntrack entries can be stored (globally into the
linked lists), i.e. how much kernel memory they will be able to occupy at most.

This document will now give you hints about how to choose optimal values for
HASHSIZE and CONNTRACK_MAX, in order to get the best out of the netfilter
conntracking/NAT system.

Default values of CONNTRACK_MAX and HASHSIZE
============================================

By default, both CONNTRACK_MAX and HASHSIZE get average values for
“reasonable” use, computed automatically according to the amount of
available RAM.

Default value of CONNTRACK_MAX
——————————

On i386 architecture, CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 =
RAMSIZE (in MegaBytes) * 64.
So for example, a 32 bits PC with 512MB of RAM can handle 512*1024^2/16384 =
512*64 = 32768 simultaneous netfilter connections by default.

But the real formula is:
CONNTRACK_MAX = RAMSIZE (in bytes) / 16384 / (x / 32)
where x is the number of bits in a pointer (for example, 32 or 64 bits)

Please note that:
- default CONNTRACK_MAX value will not be inferior to 128
- for systems with more than 1GB of RAM, default CONNTRACK_MAX value is
limited to 65536 (but can of course be set to more manually).

Default value of HASHSIZE
————————-

By default, CONNTRACK_MAX = HASHSIZE * 8.  This means that there is an average
of 8 conntrack entries per linked list (in the optimal case, and when
CONNTRACK_MAX is reached), each linked list being a hash table entry
(a bucket).

On i386 architecture, HASHSIZE = CONNTRACK_MAX / 8 =
RAMSIZE (in bytes) / 131072 = RAMSIZE (in MegaBytes) * 8.
So for example, a 32 bits PC with 512MB of RAM can store 512*1024^2/128/1024 =
512*8 = 4096 buckets (linked lists)

But the real formula is:
HASHSIZE = CONNTRACK_MAX / 8 = RAMSIZE (in bytes) / 131072 / (x / 32)
where x is the number of bits in a pointer (for example, 32 or 64 bits)

Please note that:
- default HASHSIZE value will not be inferior to 16
- for systems with more than 1GB of RAM, default HASHSIZE value is limited
to 8192 (but can of course be set to more manually).

Reading CONNTRACK_MAX and HASHSIZE
==================================

Current CONNTRACK_MAX value can be read at runtime, via the /proc filesystem.

Before Linux kernel version 2.4.23, use:
# cat /proc/sys/net/ipv4/ip_conntrack_max

As of Linux kernel version 2.4.23, use:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_max
(old /proc/sys/net/ipv4/ip_conntrack_max is then deprecated!)

Current HASHSIZE is always available (for every kernel version) in syslog
messages, as the number of buckets (which is HASHSIZE) is printed there at
ip_conntrack initialization.
As of Linux kernel version 2.4.24, current HASHSIZE value can be read at
runtime with:
# cat /proc/sys/net/ipv4/netfilter/ip_conntrack_buckets

Modifying CONNTRACK_MAX and HASHSIZE
====================================

Default CONNTRACK_MAX and HASHSIZE values are reasonable for a typical host,
but you may increase them on high-loaded firewalling-only systems.
So CONNTRACK_MAX and HASHSIZE values can be changed manually if needed.

While accessing a bucket is a constant time operation (hence the interest
of having a hash of lists), keep in mind that the kernel has to iterate over
a linked list to find a conntrack entry.  So the average size of a linked
list (CONNTRACK_MAX/HASHSIZE in the optimal case when the limit is reached)
must not be too big.  This ratio is set to 8 by default (when values are
computed automatically).
On systems with enough memory and where performance really matters, you can
really consider trying to get an average of one conntrack entry by hash bucket,
that means HASHSIZE = CONNTRACK_MAX.

Setting CONNTRACK_MAX
———————

Conntrack entries are stored in linked lists, so the maximum number of
conntrack entries (CONNTRACK_MAX) can be easily configured dynamically.

Before Linux kernel version 2.4.23, use:
# echo $CONNTRACK_MAX > /proc/sys/net/ipv4/ip_conntrack_max

As of Linux kernel version 2.4.23, use:
# echo $CONNTRACK_MAX > /proc/sys/net/ipv4/netfilter/ip_conntrack_max

where $CONNTRACK_MAX is an integer.

Setting HASHSIZE
—————-

For mathematical reasons, hash tables have static sizes.  So HASHSIZE must be
determined before the hash table is created and begins to be filled.

Before Linux kernel version 2.4.21, a prime number should be choosed for hash
size, ensuring that the hash table will be efficiently populated. Odd
non-prime numbers or even numbers are strongly discouraged, as the hash
distribution will be sub-optimal.

Since Linux kernel version 2.4.21 (and for 2.6 kernel as well), conntrack
uses jenkins2b hash algorithm which is happy with all sizes, but power
of 2 works best.

If netfilter conntrack is statically compiled in the kernel, the hash table
size can only be set at compile time.

But if netfilter conntrack is compiled as a module, the hash table size can
be set at module insertion, with the following command:
# modprobe ip_conntrack hashsize=$HASHSIZE

where $HASHSIZE is an integer.

Ideal case: firewalling-only machine
————————————

In the ideal case, you have a machine _just_ doing packet filtering and NAT
(i.e. almost no userspace running, at least none that would have a growing
memory consumption like proxies, …).

The size of kernel memory used by netfilter connection tracking is:
size_of_mem_used_by_conntrack (in bytes) =
CONNTRACK_MAX * sizeof(struct ip_conntrack) +
HASHSIZE * sizeof(struct list_head)
where:
- sizeof(struct ip_conntrack) can vary quite much, depending on architecture,
kernel version and compile-time configuration. To know its size, see the
kernel log message at ip_conntrack initialization time.
sizeof(struct ip_conntrack) is around 300 bytes on i386 for 2.6.5, but
heavy development around 2.6.10 make it vary between 352 and 192 bytes!
- sizeof(struct list_head) = 2 * size_of_a_pointer
On i386, size_of_a_pointer is 4 bytes.

So, on i386, kernel 2.6.5, size_of_mem_used_by_conntrack is around
CONNTRACK_MAX * 300 + HASHSIZE * 8 (bytes).

If we take HASHSIZE = CONNTRACK_MAX (if we have most of the memory dedicated
to firewalling, see “Modifying CONNTRACK_MAX and HASHSIZE” section above),
size_of_mem_used_by_conntrack would be around CONNTRACK_MAX * 308 bytes
on i386 systems, kernel 2.6.5.

Now suppose you put 512MB of RAM (a decent amount of memory considering today’s
memory prices) into the firewalling-only box, and use all but 128MB for
conntrack, which should really be big enough for a firewall in console mode,
for example.
Then you could set both CONNTRACK_MAX and HASHSIZE approximately to:
(512 – 128) * 1024^2 / 308 =~ 1307315 (instead of 32768 for CONNTRACK_MAX,
and 4096 for HASHSIZE by default).
As of Linux 2.4.21 (and Linux 2.6), hash algorithm is happy with
“power of 2″ sizes (it used to be a prime number before).

So here we can set CONNTRACK_MAX and HASHSIZE to 1048576 (2^20), for example.

This way, you can store about 32 times more conntrack entries than the
default, and get better performance for conntrack entry access.

- – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - – - -
Last changes on Apr 22, 2004

Revision history:
0.5 Added further notice about the varying length of the conntrack structure.
0.4 Since Linux 2.4.21, hash algorithm is happy with all sizes, not only
prime ones.  However, power of 2 is best.
0.3 Various small precisions.
0.2 Information about Linux kernel versions and corresponding /proc entries.
(/proc/sys/net/ipv4/netfilter/ip_conntrack_{max,buckets}).
0.1 Initial writing, largely based on my discussions with Harald Welte
(netfilter maintainer) on the netfilter-devel mailing-list.  Many thanks
to him!.

————————————————————————————

SMP IRQ Affinity

Background:  

Whenever a piece of hardware, such as disk controller or ethernet card,
needs attention from the CPU, it throws an interrupt.  The interrupt tells
the CPU that something has happened and that the CPU should drop what
it's doing to handle the event.  In order to prevent mutliple devices from
sending the same interrupts, the IRQ system was established where each device
in a computer system is assigned its own special IRQ so that its interrupts
are unique.

Starting with the 2.4 kernel, Linux has gained the ability to assign certain
IRQs to specific processors (or groups of processors).  This is known
as SMP IRQ affinity, and it allows you control how your system will respond
to various hardware events.  It allows you to restrict or repartition
the work load that you server must do so that it can more efficiently do
it's job.

Obviously, in order for this to work, you will need a system that has more
than one processor (SMP).  You will also need to be running a 2.4 or higher
kernel.

Some brief and very bare information on SMP IRQ affinity is provided in
the kernel source tree of the 2.4 kernel in the file:

    /usr/src/linux-2.4/Documentation/IRQ-affinity.txt

How to use it:

SMP affinity is controlled by manipulating files in the /proc/irq/ directory.
In /proc/irq/ are directories that correspond to the IRQs present on your
system (not all IRQs may be available). In each of these directories is
the "smp_affinity" file, and this is where we will work our magic.

The first order of business is to figure out what IRQ a device is using.
This information is available in the /proc/interrupts file.  Here's a sample:

 [root@archimedes /proc]# cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
   0:    4865302    5084964    4917705    5017077    IO-APIC-edge  timer
   1:        132        108        159        113    IO-APIC-edge  keyboard
   2:          0          0          0          0          XT-PIC  cascade
   8:          0          1          0          0    IO-APIC-edge  rtc
  10:          0          0          0          0   IO-APIC-level  usb-ohci
  14:          0          0          1          1    IO-APIC-edge  ide0
  24:      87298      86066      86012      86626   IO-APIC-level  aic7xxx
  31:      93707     106211     107988      93329   IO-APIC-level  eth0
 NMI:          0          0          0          0
 LOC:   19883500   19883555   19883441   19883424
 ERR:          0
 MIS:          0

As you can see, this is a 4 processor machine.  The first column (unlabelled)
lists the IRQs used on the system.  The rows with letters (ie, "NMI", "LOC")
are parts of other drivers used on the system and aren't really accessible
to us, so we'll just ignore them.

The second through fifth columns (labelled CPU0-CPU3) show the number of times
the corresponding process has handled an interrupt from that particular IRQ.
For example, all of the CPUs have handled roughly the same number of interrupts
for IRQ 24 (around 86,000 with CPU0 handling a little over 87,000).

The sixth column lists whether or not the device driver associated with the
interrupt supports IO-APIC (see /usr/src/linux/Documentation/i386/IO-APIC.txt
for more information).  The only reason to look at this value is that
SMP affinity will only work for IO-APIC enabled device drivers.  For
example, we will not be able to change the affinity for the "cascade"
driver (IRQ 2) because it doesn't support IO-APIC.

Finally, the seventh and last column lists the driver or device that is
associated with the interrupt.  In the above example, our ethernet card
(eth0) is using IRQ 31, and our SCSI controller (aic7xxx) is using IRQ 24.

The first and last columns are really the only ones we're interested in here.
For the rest of this example, I'm going to assume that we want to adjust
the SMP affinity for th SCSI controller (IRQ 24).

Now that we've got the IRQ, we can change the processor affinity.  To
do this, we'll go into the /proc/irq/24/ directory, and see what the
affinity is currently set to:

 [root@archimedes Documentation]# cat /proc/irq/24/smp_affinity
 ffffffff

This is a bitmask that represents which processors any interrupts on IRQ
24 should be routed to.  Each field in the bit mask corresponds to a processor.
The number held in the "smp_affinity" file is presented in hexadecimal format,
so in order to manipulate it properly we will need to convert our bit patterns
from binary to hex before setting them in the proc file.

Each of the "f"s above represents a group of 4 CPUs, with the rightmost
group being the least significant.  For the purposes of our discussion,
we're going to limit ourselves to only the first 4 CPUs (although we can
address up to 32).

In short, this means you only have to worry about the rightmost "f" and you
can assume everything else is a "0" (ie, our bitmask is "0000000f").

"f" is the hexadecimal represenatation for the decimal number 15 (fifteen)
and the binary pattern of "1111".  Each of the places in the binary pattern
corresponds to a CPU in the server, which means we can use the following
chart to represent the CPU bit patterns:

            Binary       Hex
    CPU 0    0001         1
    CPU 1    0010         2
    CPU 2    0100         4
    CPU 3    1000         8

By combining these bit patterns (basically, just adding the Hex values), we
can address more than one processor at a time.   For example, if I wanted
to talk to both CPU0 and CPU2 at the same time, the result is:

            Binary       Hex
    CPU 0    0001         1
  + CPU 2    0100         4
    -----------------------
    both     0101         5

If I want to address all four of the processors at once, then the result is:

            Binary       Hex
    CPU 0    0001         1
    CPU 1    0010         2
    CPU 2    0100         4
  + CPU 3    1000         8
    -----------------------
    both     1111         f

(Remember that we use the letters "a" through "f" to represent the numbers
 "10" to "15" in hex notation).

Given that, we now know that if we have a four processor system, we can
assign any of 15 different CPU combinations to an IRQ (it would be 16, but
it isn't legal to assign an IRQ affinity of "0" to any IRQ... if you try,
Linux will just ignore your attempt).

So.  Now we get to the fun part.  Remember in our /proc/interrupts listing
above that all four of our CPUs had handled the close to the same amount of
interrupts for our SCSI card?  We now have the tools needed to limit managing
the SCSI card to just one processor and leave the other three free to
concentrate on doing other tasks.   Let's assume that we want to dedicate
our first CPU (CPU0) to handling the SCSI controller interrupts.  To do this,
we would simply run the following command:

 [root@archimedes /proc]# echo 1 > /proc/irq/24/smp_affinity
 [root@archimedes /proc]# cat /proc/irq/24/smp_affinity
 00000001

Now, let's test it out and see what happens:

 [root@archimedes /proc]# cd /tmp/
 [root@archimedes /tmp]# tar -zcf test.tgz /usr/src/linux-2.4.2
 tar: Removing leading `/' from member names
 [root@archimedes /tmp]# tar -zxf test.tgz && rm -rf usr/
 [root@archimedes /tmp]# tar -zxf test.tgz && rm -rf usr/
 [root@archimedes /tmp]# tar -zxf test.tgz && rm -rf usr/
 [root@archimedes /tmp]# tar -zxf test.tgz && rm -rf usr/
 [root@archimedes /tmp]# tar -zxf test.tgz && rm -rf usr/
 [root@archimedes /tmp]# cat /proc/interrupts | grep 24:
  24:      99719      86067      86012      86627   IO-APIC-level  aic7xxx

Compare that to the previous run without having the IRQ bound to CPU0:

  24:      87298      86066      86012      86626   IO-APIC-level  aic7xxx

All of the interrupts from the disk controller are now handled exclusively
by the first CPU (CPU0), which means that our other 3 proccessors are free
to do other stuff now.

Finally, it should be pointed out that if you decide you no longer want
SMP affinity and would rather have the system revert back to the old set up,
then you can simply do:

 [root@archimedes /tmp]# cat /proc/irq/prof_cpu_mask >/proc/irq/24/smp_affinity

This will reset the "smp_affinity" file to use all "f"s, and will return to
the load sharing arrangement that we saw earlier.

What can I use it for?

- "balance" out multiple NICs in a multi-processor machine.  By tying a single
  NIC to a single CPU, you should be able to scale the amount of traffic
  your server can handle nicely.

- database servers (or servers with lots of disk storage) that also have
  heavy network loads can dedicate a CPU to their disk controller and assign
  another to deal with the NIC to help improve response times.

Can I do this with processes?

At this time, no.

相关文章

发表评论