Linux developer tips for achieving maximum performance and scalability, some are
linux-specific and some are more general programming tips.
Most of these were painfully sorted through while I worked for Hostway Inc. as a
systems develeper, fixing their problematic courier-based mail cluster. In some
cases I have a link to web pages I created while I was actively working on the
mail system, these should provide some more proof / information of the issues I
bring up below.
- Don't use stdio
But if you do:
- Use __fsetlocking(stream, FSETLOCKING_BYCALLER) when you don't need the
libc implicit file object locking.
- Use _unlocked() versions of stdio functions when available, even if you
set the locking mode to FSETLOCKING_BYCALLER, this avoids the check of the
locking mode at each call, but it's not available for all stdio functions
usually so still set the locking mode.
- Don't use fgetc/getc/getchar, none have been fast macros for a long time
now, I think libc5 was the last time.
Additional information including graphs:
stdio locking
getc is bad
- Use multiple threads instead of multiple processes whenever possible
- When using threads don't manipulate the signal mask (sigprocmask()) frequently
There are some libc & pthread functions which do this behind the scenes so be
careful.
Particularly, syslog() does in some libc6 versions to ignore SIGPIPE (they
fix this in redhat at some point, check it yourself, debian sarge still had it
broken when I last checked).
Manipulating the cancellation state also appears to manipulate the signal
mask and should be avoided.
The reason this must be avoided is this function appears to manipulate the
mask per-thread, when you have thousands of threads, a single call explodes
into an O(N) where N is the number of threads in the process, this is a huge
problem when the number of threads gets high and can generally be avoided
pretty easily.
Additional information including graphs:
sigprocmask/syslog() and threads don't mix
- Avoid concurrent cross-directory renames on the same filesystem
One of the 'features' of rename() is atomicity. In linux, as of 2.6.11.8, at
the vfs layer, in order to achieve atomicity with a cross-directory rename, a
per-filesystem lock is held during the rename operation.
So if you have a parallel program which you have put alot of effort into
reducing contention and maximizing parallelism doing cross-directory renames
on the same filesystem regularly, you are shooting yourself in the foot. This
effectively serializes all your parallel execution contexts at the rename.
One situation where this is very likely to happen is in mail software using
the Maildir format. The process of moving mail from new/ to cur/ involves
cross-directory rename() calls. If you are storing your mail over NFS it
exacerbates the situation because the rename operation takes significantly
longer to perform than on say a local ext2 volume. This means more wall-time
passes while in the critical section (while the per-filesystem rename lock is
held).
Ways to work around this problem is to either eliminate the rename() calls, or
add more filesystems. For example, if you had a filesystem per Maildir the
problem disappears unless you have concurrent access on the same Maildir. I
don't expect anyone to actually try this but it illustrates the problem and
solution well. Dividing your mailstore into many smaller filesystems can help
dramatically in this context.
I expect this problem to get solved eventually, it doesnt make sense to me
really, because the rename() function in the vfs layer acquires the locks
of the relevant directories after acquiring the per-filesystem lock. I would
imagine after successfully acquiring the other locks the per-filesystem lock
could be released while the rename() operation is executed, then the directory
locks released in the reverse order of acquisition. It looks to my non-kernel
-programmer eyes that the per-filesystem lock is needed to make the
acquisition of the other directory locks atomic. But once they are held the
big lock should get released, no?
If this big lock could be released during the operation, I suspect it would
have a tremendous impact on most of the Maildir-based mail servers out in
the wild.
- Don't use select(), use poll() or the non-portable epoll interface
Select() is less efficient than poll(), and also has a limit on the highest
file descriptor value it can operate on, this highest file descriptor value
is defined in FD_SETSIZE. Often programmers don't realize, this is not only
a size limitation on the number of file descriptors select() can work with,
it is a limit on the highest file descriptor *VALUE* it can work with. When
you start working with large scales, with todays hardware it is trivial to
have enough concurrent connections to exceed this value. The result is
usually a crash, but really unpredictable. In my experience, you get a
segfault more often than not.
I discovered a bug in OpenSSL that was exactly this problem, and was causing
Apache2 to segfault when configured with SSL and enough virtual hosts to
cause the next fd to be higher than FD_SETSIZE (because of the log files
opened per vhost). It proved to be quite an elusive bug due to the complexity
and size of the Apache2 codebase and the fact that the bug was not actually
in the Apache2 code but in an external library pulled in by the SSL feature.
The bug report & patch is documented here:
select() bug report to openssl-dev mailing list
- Avoid repeated conversions of string to integer and vice versa
Often times I see code that repeatedly prints the string version of a value
stored as an integer that hasnt changed, and possibly won't change ever,
converting it every time.
Depending on where this happens, it can become a serious pointless waste of
CPU cycles, most often I see this in logging functions. If you happen to have
an integer that rarely changes, and it is converted to string form frequently
for whatever reason, consider creating a character array for storing the
string version of the value persistently. Then when you go to use the value
in string form, simply check if it is changed since you last created the
string version. If it hasnt changed, use the string version as-is. If it
has, then you have to recreate the string version.
It seems some programmers don't realize what is involved in converting an
integer into a string (e.g. sprintf(buf, "%i", foo);), it is not cheap. In
its simplest form it's a loop of divides and modulos by the base, storing the
characters in the destination buffer in reverse order, which usually have to
be copied again into their final destination now that the length is known.
Anyone who has done programming with an eye towards optimization knows divides
and multiplies are some of the most expensive integer operations you can do.
Once you understand how expensive it is to create that string version, you can
see how storing a 'last converted' copy of the integer version and a
character array to store the string version is not such a big deal. This is
all you need to be able to detect if the value is different from the last time
it was converted, and convert only when necessary.
The same is true for the opposite direction, from string to integer, don't do
it more often than necessary. It's slightly more efficient than the other
direction but still silly to do excessively.
If you are curious to see in more detail how the conversions are implemented,
look at the glibc source. The integer-to-string conversion is done in
"stdio-common/_itoa.c" in the version I have sitting here (glibc-2.2)
*** An excellent example of this in practice is the construct_time() usage
in the Shttp module found here:
http://serverkit.org/modules/contrib/shttp/
This function constructs an rfc822-compatible date+time string compliant
with the HTTP/1.1 RFC for the Date: and Last-Modified: response headers.
For comparison purposes, other http servers do it quite differently, and
it should be fairly obvious which is the most efficient:
Shttp 0.0.3:
char *smhd[] = {"00","01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50", "51","52","53","54","55","56","57","58","59"};
char *months[] = {"Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"};
char *days[] = {"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"};
char *years[] = {"1900", "1901", "1902", "1903", "1904", "1905", "1906", "1907", "1908", "1909", "1910", "1911", "1912", "1913", "1914", "1915", "1916", "1917", "1918", "1919", "1920", "1921", "1922", "1923", "1924", "1925", "1926", "1927", "1928", "1929", "1930", "1931", "1932", "1933", "1934", "1935", "1936", "1937", "1938", "1939", "1940", "1941", "1942", "1943", "1944", "1945", "1946", "1947", "1948", "1949", "1950", "1951", "1952", "1953", "1954", "1955", "1956", "1957", "1958", "1959", "1960", "1961", "1962", "1963", "1964", "1965", "1966", "1967", "1968", "1969", "1970", "1971", "1972", "1973", "1974", "1975", "1976", "1977", "1978", "1979", "1980", "1981", "1982", "1983", "1984", "1985", "1986", "1987", "1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026", "2027", "2028", "2029", "2030", "2031", "2032", "2033", "2034", "2035", "2036", "2037", "2038", "2039", "2040", "2041", "2042", "2043", "2044", "2045", "2046", "2047", "2048", "2049", "2050", "2051", "2052", "2053", "2054", "2055", "2056", "2057", "2058", "2059", "2060", "2061", "2062", "2063", "2064", "2065", "2066", "2067", "2068", "2069", "2070", "2071", "2072", "2073", "2074", "2075", "2076", "2077", "2078", "2079", "2080", "2081", "2082", "2083", "2084", "2085", "2086", "2087", "2088", "2089", "2090", "2091", "2092", "2093", "2094", "2095", "2096", "2097", "2098", "2099", "2100", "2101", "2102", "2103", "2104"};
static void construct_time(char *where, time_t *time)
{
struct tm *t = gmtime(time);
where[0] = days[t->tm_wday][0];
where[1] = days[t->tm_wday][1];
where[2] = days[t->tm_wday][2];
where[3] = ',';
where[4] = ' ';
where[5] = smhd[t->tm_mday][0];
where[6] = smhd[t->tm_mday][1];
where[7] = ' ';
where[8] = months[t->tm_mon][0];
where[9] = months[t->tm_mon][1];
where[10] = months[t->tm_mon][2];
where[11] = ' ';
if(t->tm_year > 203) {
t->tm_year = 203; /* avoid overflow, I don't care if shttp gives wrong years nearly 100 years from now. */
}
where[12] = years[t->tm_year][0];
where[13] = years[t->tm_year][1];
where[14] = years[t->tm_year][2];
where[15] = years[t->tm_year][3];
where[16] = ' ';
where[17] = smhd[t->tm_hour][0];
where[18] = smhd[t->tm_hour][1];
where[19] = ':';
where[20] = smhd[t->tm_min][0];
where[21] = smhd[t->tm_min][1];
where[22] = ':';
where[23] = smhd[t->tm_sec][0];
where[24] = smhd[t->tm_sec][1];
where[25] = ' ';
where[26] = 'G';
where[27] = 'M';
where[28] = 'T';
}
Boa 0.94.14rc21:
void rfc822_time_buf(char *buf, time_t s)
{
struct tm *t;
char *p;
unsigned int a;
if (!s) {
t = gmtime(¤t_time);
} else
t = gmtime(&s);
p = buf + 28;
/* p points to the last char in the buf */
p -= 3;
/* p points to where the ' ' will go */
memcpy(p--, " GMT", 4);
a = t->tm_sec;
*p-- = '0' + a % 10;
*p-- = '0' + a / 10;
*p-- = ':';
a = t->tm_min;
*p-- = '0' + a % 10;
*p-- = '0' + a / 10;
*p-- = ':';
a = t->tm_hour;
*p-- = '0' + a % 10;
*p-- = '0' + a / 10;
*p-- = ' ';
a = 1900 + t->tm_year;
while (a) {
*p-- = '0' + a % 10;
a /= 10;
}
/* p points to an unused spot to where the space will go */
p -= 3;
/* p points to where the first char of the month will go */
memcpy(p--, month_tab + 4 * (t->tm_mon), 4);
*p-- = ' ';
a = t->tm_mday;
*p-- = '0' + a % 10;
*p-- = '0' + a / 10;
*p-- = ' ';
p -= 3;
memcpy(p, day_tab + t->tm_wday * 4, 4);
}
Thttpd-2.23beta1:
const char* rfc1123fmt = "%a, %d %b %Y %H:%M:%S GMT";
(void) strftime( nowbuf, sizeof(nowbuf), rfc1123fmt, gmtime( &now ) );
Apache 2.2.3:
APR_DECLARE_DATA const char apr_month_snames[12][4] =
{
"Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
};
APR_DECLARE_DATA const char apr_day_snames[7][4] =
{
"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"
};
apr_status_t apr_rfc822_date(char *date_str, apr_time_t t)
{
apr_time_exp_t xt;
const char *s;
int real_year;
apr_time_exp_gmt(&xt, t);
/* example: "Sat, 08 Jan 2000 18:31:41 GMT" */
/* 12345678901234567890123456789 */
s = &apr_day_snames[xt.tm_wday][0];
*date_str++ = *s++;
*date_str++ = *s++;
*date_str++ = *s++;
*date_str++ = ',';
*date_str++ = ' ';
*date_str++ = xt.tm_mday / 10 + '0';
*date_str++ = xt.tm_mday % 10 + '0';
*date_str++ = ' ';
s = &apr_month_snames[xt.tm_mon][0];
*date_str++ = *s++;
*date_str++ = *s++;
*date_str++ = *s++;
*date_str++ = ' ';
real_year = 1900 + xt.tm_year;
/* This routine isn't y10k ready. */
*date_str++ = real_year / 1000 + '0';
*date_str++ = real_year % 1000 / 100 + '0';
*date_str++ = real_year % 100 / 10 + '0';
*date_str++ = real_year % 10 + '0';
*date_str++ = ' ';
*date_str++ = xt.tm_hour / 10 + '0';
*date_str++ = xt.tm_hour % 10 + '0';
*date_str++ = ':';
*date_str++ = xt.tm_min / 10 + '0';
*date_str++ = xt.tm_min % 10 + '0';
*date_str++ = ':';
*date_str++ = xt.tm_sec / 10 + '0';
*date_str++ = xt.tm_sec % 10 + '0';
*date_str++ = ' ';
*date_str++ = 'G';
*date_str++ = 'M';
*date_str++ = 'T';
*date_str++ = 0;
return APR_SUCCESS;
}
Interesting variety, yes? Which do you think is the fastest?
Another interesting thing to look at is how these various http servers
call this routine. This routine is for constructing the Date: and
Last-Modified: strings that essentially go out with each and every
response. One important thing to note is that the precision of this
time string is only seconds, it does not show micro or nano seconds...
Yet, many servers assemble this string anew for every Date: response
because it presents the current time. Shttp uses a dedicated timer
thread that wakes up around 5 times a second and updates a global string
version of the current time using the construct_time() function above.
Some even call the time() system call before constructing the string
version for every response. So, lets say you are serving 10,000
requests per second... some of tehse servers would be constructing the
_exact same_ rfc822 date string ~10,000 times per second. Look at some
of those functions, full of modulos and divides. Hell, one even calls
memcpy just to move 4 bytes a few times.
Get the picture?
- Avoid using libcs conversion functions in code entered frequently
(G)Libc provides a robust set of conversions that make writing programs that
work very easy. However, once you get past the prototype stage it is
worthwhile to pull out the libc-provided conversions and streamline the code
to cut down the function call overhead and do the conversions in-place. Often
the libc conversions do far more than is necessary for what you need, and if
you are using the conversions via va_arg libc functions that use format
strings (sscanf, fscanf, fprintf, etc...), you are also incurring the overhead
of parsing your format string every call.
Daniel J. Bernstein (author of qmail, daemontools, http://cr.yp.to) has
written some conversion functions he uses in the qmail source which is freely
available that can be used in place of glibcs more general conversion
functions for common conversions with minor code modifications for a decent
speedup. Look in the qmail source directory at listings "fmt*".
- Obtain, build, and learn to use instrumentation for profiling
- Use the kernels builtin profiling, it's incredibly lightweight and very
helpful, see the readprofile(1) man page for more information.
- Use the files in /proc, there are tools out there for gathering data from
here that you can use, or build your own. These files contain alot of
helpful data, you just have to collect it and visualize it, preferably at
high resolution.
- Do not assume the compiler will optimize out what appear to you as obvious
compiler optimizations.
Example #1:
[code]
char buf[BUFSIZ];
if(read(fd, buf, sizeof(buf)) < 0) { /* read a null terminated string into buf */
/* handle read error */
return -1;
}
for(i = 0; i < strlen(buf); i++) {
/* read from buf, but don't modify buf (say.. attribute=value pair) */
}
[/code]
In this example, strlen() is called on a buffer to determine the strings
length, every iteration of the loop. This happens even if you never write
to the buffer in your loop. Simply moving the strlen(buf) to before the for
loop or at least outside of the test, would make a potentiallly dramatic
difference.
Example #2:
[code]
char buf[BUFSIZ];
int len;
if(read(fd, buf, sizeof(buf)) < 0) { /* read a null terminated string into buf */
/* handle read error */
return -1;
}
for(i = 0, len = strlen(buf); i < len; i++) {
/* read from buf, but don't modify buf (say.. attribute=value pair) */
}
[/code]
In example #2, we have insured strlen(buf) is called at most once, it's a big
improvement over the last version, such a subtle change makes a tremendous
difference. However, it's still not quite right.
In most cases, we have the length of the data in the buffer already on hand,
that length should just be kept around and used rather than calculated by
exhaustively scanning the buffer for a terminating byte.
Example #3:
[code]
char buf[BUFSIZ];
int len;
len = read(fd, buf, sizeof(buf)); /* read a null terminated string into buf */
if(len < 0) {
/* handle read error */
return -1;
}
for(i = 0; i < len; i++) {
/* read from buf, but don't modify buf (say.. attribute=value pair) */
}
[/code]
In example #3, we lose the strlen altogether, and only access bytes in the
buffer that read() populated, the loop might need to be altered slightly
to deal with encountering a null byte which the strlen() may have prevented
you from hitting, but thats just a minor change it shouldnt have any
significant impact on the efficiency of your loop, but it does illustrate how
it's not a simple drop-in replacement, you must understand what the meat of
the loop is doing.
Also, if inside your loop you start modifying the buffer contents, maybe
you do need to continuously recalculate the string length with strlen() in
the test, but it's very unlikely, and if you are modifying it you should have
knowledge of the changes when they are made and be able to update a length
variable at that time rather than forcing recalculation at every test, so it
basically never makes sense... rethink your algorithm.
There are cases where gcc will optimize it out, but they are few and far
between, it's basically when the string is a constant last I checked. It
does optimize string function calls into inlined ones which can hide these
inefficient codes from prying eyes using ltrace... it depends on the
optimization options used, you really have to look at the code, and should
look at the produced assembly output, if you have your doubts. It is best
to explicitly do it the efficient way rather than rely on chance of the
compiler doing the right thing.
I will update this page as I recall and discover more things and get time to document them, so bookmark it or something and check back.