Linux developer tips for achieving maximum performance and scalability, some are
linux-specific and some are more general programming tips.  

Most of these were painfully sorted through while I worked for Hostway Inc. as a
systems develeper, fixing their problematic courier-based mail cluster.  In some
cases I have a link to web pages I created while I was actively working on the
mail system, these should provide some more proof / information of the issues I
bring up below.


- Don't use stdio

  But if you do:
    - Use __fsetlocking(stream, FSETLOCKING_BYCALLER) when you don't need the
      libc implicit file object locking.  
    - Use _unlocked() versions of stdio functions when available, even if you
      set the locking mode to FSETLOCKING_BYCALLER, this avoids the check of the
      locking mode at each call, but it's not available for all stdio functions
      usually so still set the locking mode.
    - Don't use fgetc/getc/getchar, none have been fast macros for a long time
      now, I think libc5 was the last time.

   Additional information including graphs:
   stdio locking
   getc is bad


- Use multiple threads instead of multiple processes whenever possible


- When using threads don't manipulate the signal mask (sigprocmask()) frequently

  There are some libc & pthread functions which do this behind the scenes so be
  careful.
  
  Particularly, syslog() does in some libc6 versions to ignore SIGPIPE (they
  fix this in redhat at some point, check it yourself, debian sarge still had it
  broken when I last checked).

  Manipulating the cancellation state also appears to manipulate the signal
  mask and should be avoided.

  The reason this must be avoided is this function appears to manipulate the
  mask per-thread, when you have thousands of threads, a single call explodes
  into an O(N) where N is the number of threads in the process, this is a huge
  problem when the number of threads gets high and can generally be avoided
  pretty easily.

  Additional information including graphs:
  sigprocmask/syslog() and threads don't mix


- Avoid concurrent cross-directory renames on the same filesystem

  One of the 'features' of rename() is atomicity.  In linux, as of 2.6.11.8, at
  the vfs layer, in order to achieve atomicity with a cross-directory rename, a
  per-filesystem lock is held during the rename operation.
  
  So if you have a parallel program which you have put alot of effort into
  reducing contention and maximizing parallelism doing cross-directory renames
  on the same filesystem regularly, you are shooting yourself in the foot.  This
  effectively serializes all your parallel execution contexts at the rename.

  One situation where this is very likely to happen is in mail software using 
  the Maildir format.  The process of moving mail from new/ to cur/ involves
  cross-directory rename() calls.  If you are storing your mail over NFS it
  exacerbates the situation because the rename operation takes significantly
  longer to perform than on say a local ext2 volume.  This means more wall-time
  passes while in the critical section (while the per-filesystem rename lock is
  held).

  Ways to work around this problem is to either eliminate the rename() calls, or
  add more filesystems.  For example, if you had a filesystem per Maildir the
  problem disappears unless you have concurrent access on the same Maildir.  I
  don't expect anyone to actually try this but it illustrates the problem and
  solution well.  Dividing your mailstore into many smaller filesystems can help
  dramatically in this context.

  I expect this problem to get solved eventually, it doesnt make sense to me
  really, because the rename() function in the vfs layer acquires the locks
  of the relevant directories after acquiring the per-filesystem lock.  I would
  imagine after successfully acquiring the other locks the per-filesystem lock
  could be released while the rename() operation is executed, then the directory
  locks released in the reverse order of acquisition.  It looks to my non-kernel
  -programmer eyes that the per-filesystem lock is needed to make the
  acquisition of the other directory locks atomic.  But once they are held the
  big lock should get released, no?

  If this big lock could be released during the operation, I suspect it would
  have a tremendous impact on most of the Maildir-based mail servers out in
  the wild.


- Don't use select(), use poll() or the non-portable epoll interface

  Select() is less efficient than poll(), and also has a limit on the highest
  file descriptor value it can operate on, this highest file descriptor value
  is defined in FD_SETSIZE.  Often programmers don't realize, this is not only
  a size limitation on the number of file descriptors select() can work with,
  it is a limit on the highest file descriptor *VALUE* it can work with.  When
  you start working with large scales, with todays hardware it is trivial to
  have enough concurrent connections to exceed this value.  The result is
  usually a crash, but really unpredictable.  In my experience, you get a
  segfault more often than not.

  I discovered a bug in OpenSSL that was exactly this problem, and was causing
  Apache2 to segfault when configured with SSL and enough virtual hosts to
  cause the next fd to be higher than FD_SETSIZE (because of the log files
  opened per vhost).  It proved to be quite an elusive bug due to the complexity
  and size of the Apache2 codebase and the fact that the bug was not actually
  in the Apache2 code but in an external library pulled in by the SSL feature.

  The bug report & patch is documented here:
  select() bug report to openssl-dev mailing list


- Avoid repeated conversions of string to integer and vice versa

  Often times I see code that repeatedly prints the string version of a value
  stored as an integer that hasnt changed, and possibly won't change ever,
  converting it every time.

  Depending on where this happens, it can become a serious pointless waste of
  CPU cycles, most often I see this in logging functions.  If you happen to have
  an integer that rarely changes, and it is converted to string form frequently
  for whatever reason, consider creating a character array for storing the
  string version of the value persistently.  Then when you go to use the value
  in string form, simply check if it is changed since you last created the
  string version.  If it hasnt changed, use the string version as-is.  If it
  has, then you have to recreate the string version.

  It seems some programmers don't realize what is involved in converting an
  integer into a string (e.g. sprintf(buf, "%i", foo);), it is not cheap.  In
  its simplest form it's a loop of divides and modulos by the base, storing the
  characters in the destination buffer in reverse order, which usually have to
  be copied again into their final destination now that the length is known.
  Anyone who has done programming with an eye towards optimization knows divides
  and multiplies are some of the most expensive integer operations you can do.

  Once you understand how expensive it is to create that string version, you can
  see how storing a 'last converted' copy of the integer version and a
  character array to store the string version is not such a big deal.  This is
  all you need to be able to detect if the value is different from the last time
  it was converted, and convert only when necessary.

  The same is true for the opposite direction, from string to integer, don't do
  it more often than necessary.  It's slightly more efficient than the other
  direction but still silly to do excessively.

  If you are curious to see in more detail how the conversions are implemented,
  look at the glibc source.  The integer-to-string conversion is done in
  "stdio-common/_itoa.c" in the version I have sitting here (glibc-2.2)

  *** An excellent example of this in practice is the construct_time() usage
      in the Shttp module found here:

      http://serverkit.org/modules/contrib/shttp/


      This function constructs an rfc822-compatible date+time string compliant
      with the HTTP/1.1 RFC for the Date: and Last-Modified: response headers.

      For comparison purposes, other http servers do it quite differently, and
      it should be fairly obvious which is the most efficient:

       Shttp 0.0.3:

	char *smhd[] = {"00","01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33","34","35","36","37","38","39","40","41","42","43","44","45","46","47","48","49","50", "51","52","53","54","55","56","57","58","59"};
	char *months[] = {"Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"};
	char *days[] = {"Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"};
	char *years[] = {"1900", "1901", "1902", "1903", "1904", "1905", "1906", "1907", "1908", "1909", "1910", "1911", "1912", "1913", "1914", "1915", "1916", "1917", "1918", "1919", "1920", "1921", "1922", "1923", "1924", "1925", "1926", "1927", "1928", "1929", "1930", "1931", "1932", "1933", "1934", "1935", "1936", "1937", "1938", "1939", "1940", "1941", "1942", "1943", "1944", "1945", "1946", "1947", "1948", "1949", "1950", "1951", "1952", "1953", "1954", "1955", "1956", "1957", "1958", "1959", "1960", "1961", "1962", "1963", "1964", "1965", "1966", "1967", "1968", "1969", "1970", "1971", "1972", "1973", "1974", "1975", "1976", "1977", "1978", "1979", "1980", "1981", "1982", "1983", "1984", "1985", "1986", "1987", "1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999", "2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026", "2027", "2028", "2029", "2030", "2031", "2032", "2033", "2034", "2035", "2036", "2037", "2038", "2039", "2040", "2041", "2042", "2043", "2044", "2045", "2046", "2047", "2048", "2049", "2050", "2051", "2052", "2053", "2054", "2055", "2056", "2057", "2058", "2059", "2060", "2061", "2062", "2063", "2064", "2065", "2066", "2067", "2068", "2069", "2070", "2071", "2072", "2073", "2074", "2075", "2076", "2077", "2078", "2079", "2080", "2081", "2082", "2083", "2084", "2085", "2086", "2087", "2088", "2089", "2090", "2091", "2092", "2093", "2094", "2095", "2096", "2097", "2098", "2099", "2100", "2101", "2102", "2103", "2104"};

	static void construct_time(char *where, time_t *time)
	{
		struct tm       *t = gmtime(time);

		where[0] = days[t->tm_wday][0];
		where[1] = days[t->tm_wday][1];
		where[2] = days[t->tm_wday][2];
		where[3] = ',';
		where[4] = ' ';
		where[5] = smhd[t->tm_mday][0];
		where[6] = smhd[t->tm_mday][1];
		where[7] = ' ';
		where[8] = months[t->tm_mon][0];
		where[9] = months[t->tm_mon][1];
		where[10] = months[t->tm_mon][2];
		where[11] = ' ';
		if(t->tm_year > 203) {
			t->tm_year = 203; /* avoid overflow, I don't care if shttp gives wrong years nearly 100 years from now. */
		}
		where[12] = years[t->tm_year][0];
		where[13] = years[t->tm_year][1];
		where[14] = years[t->tm_year][2];
		where[15] = years[t->tm_year][3];
		where[16] = ' ';
		where[17] = smhd[t->tm_hour][0];
		where[18] = smhd[t->tm_hour][1];
		where[19] = ':';
		where[20] = smhd[t->tm_min][0];
		where[21] = smhd[t->tm_min][1];
		where[22] = ':';
		where[23] = smhd[t->tm_sec][0];
		where[24] = smhd[t->tm_sec][1];
		where[25] = ' ';
		where[26] = 'G';
		where[27] = 'M';
		where[28] = 'T';
	}


       Boa 0.94.14rc21:

	void rfc822_time_buf(char *buf, time_t s)
	{
	    struct tm *t;
	    char *p;
	    unsigned int a;

	    if (!s) {
		t = gmtime(¤t_time);
	    } else
		t = gmtime(&s);

	    p = buf + 28;
	    /* p points to the last char in the buf */

	    p -= 3;
	    /* p points to where the ' ' will go */
	    memcpy(p--, " GMT", 4);

	    a = t->tm_sec;
	    *p-- = '0' + a % 10;
	    *p-- = '0' + a / 10;
	    *p-- = ':';
	    a = t->tm_min;
	    *p-- = '0' + a % 10;
	    *p-- = '0' + a / 10;
	    *p-- = ':';
	    a = t->tm_hour;
	    *p-- = '0' + a % 10;
	    *p-- = '0' + a / 10;
	    *p-- = ' ';
	    a = 1900 + t->tm_year;
	    while (a) {
		*p-- = '0' + a % 10;
		a /= 10;
	    }
	    /* p points to an unused spot to where the space will go */
	    p -= 3;
	    /* p points to where the first char of the month will go */
	    memcpy(p--, month_tab + 4 * (t->tm_mon), 4);
	    *p-- = ' ';
	    a = t->tm_mday;
	    *p-- = '0' + a % 10;
	    *p-- = '0' + a / 10;
	    *p-- = ' ';
	    p -= 3;
	    memcpy(p, day_tab + t->tm_wday * 4, 4);
	}

   
       Thttpd-2.23beta1:

	const char* rfc1123fmt = "%a, %d %b %Y %H:%M:%S GMT";

	(void) strftime( nowbuf, sizeof(nowbuf), rfc1123fmt, gmtime( &now ) );


       Apache 2.2.3:

	APR_DECLARE_DATA const char apr_month_snames[12][4] =
	{
	    "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
	};
	APR_DECLARE_DATA const char apr_day_snames[7][4] =
	{
	    "Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"
	};

	apr_status_t apr_rfc822_date(char *date_str, apr_time_t t)
	{
	    apr_time_exp_t xt;
	    const char *s;
	    int real_year;

	    apr_time_exp_gmt(&xt, t);

	    /* example: "Sat, 08 Jan 2000 18:31:41 GMT" */
	    /*           12345678901234567890123456789  */

	    s = &apr_day_snames[xt.tm_wday][0];
	    *date_str++ = *s++;
	    *date_str++ = *s++;
	    *date_str++ = *s++;
	    *date_str++ = ',';
	    *date_str++ = ' ';
	    *date_str++ = xt.tm_mday / 10 + '0';
	    *date_str++ = xt.tm_mday % 10 + '0';
	    *date_str++ = ' ';
	    s = &apr_month_snames[xt.tm_mon][0];
	    *date_str++ = *s++;
	    *date_str++ = *s++;
	    *date_str++ = *s++;
	    *date_str++ = ' ';
	    real_year = 1900 + xt.tm_year;
	    /* This routine isn't y10k ready. */
	    *date_str++ = real_year / 1000 + '0';
	    *date_str++ = real_year % 1000 / 100 + '0';
	    *date_str++ = real_year % 100 / 10 + '0';
	    *date_str++ = real_year % 10 + '0';
	    *date_str++ = ' ';
	    *date_str++ = xt.tm_hour / 10 + '0';
	    *date_str++ = xt.tm_hour % 10 + '0';
	    *date_str++ = ':';
	    *date_str++ = xt.tm_min / 10 + '0';
	    *date_str++ = xt.tm_min % 10 + '0';
	    *date_str++ = ':';
	    *date_str++ = xt.tm_sec / 10 + '0';
	    *date_str++ = xt.tm_sec % 10 + '0';
	    *date_str++ = ' ';
	    *date_str++ = 'G';
	    *date_str++ = 'M';
	    *date_str++ = 'T';
	    *date_str++ = 0;
	    return APR_SUCCESS;
	}


       Interesting variety, yes?   Which do you think is the fastest?

       Another interesting thing to look at is how these various http servers
       call this routine.  This routine is for constructing the Date: and
       Last-Modified: strings that essentially go out with each and every
       response.  One important thing to note is that the precision of this
       time string is only seconds, it does not show micro or nano seconds...

       Yet, many servers assemble this string anew for every Date: response
       because it presents the current time.  Shttp uses a dedicated timer
       thread that wakes up around 5 times a second and updates a global string
       version of the current time using the construct_time() function above.

       Some even call the time() system call before constructing the string
       version for every response.  So, lets say you are serving 10,000
       requests per second... some of tehse servers would be constructing the
       _exact same_ rfc822 date string ~10,000 times per second.  Look at some
       of those functions, full of modulos and divides.  Hell, one even calls
       memcpy just to move 4 bytes a few times.

       Get the picture?
	


- Avoid using libcs conversion functions in code entered frequently

  (G)Libc provides a robust set of conversions that make writing programs that
  work very easy.  However, once you get past the prototype stage it is
  worthwhile to pull out the libc-provided conversions and streamline the code
  to cut down the function call overhead and do the conversions in-place.  Often
  the libc conversions do far more than is necessary for what you need, and if
  you are using the conversions via va_arg libc functions that use format
  strings (sscanf, fscanf, fprintf, etc...), you are also incurring the overhead
  of parsing your format string every call.

  Daniel J. Bernstein (author of qmail, daemontools, http://cr.yp.to) has
  written some conversion functions he uses in the qmail source which is freely
  available that can be used in place of glibcs more general conversion
  functions for common conversions with minor code modifications for a decent
  speedup.  Look in the qmail source directory at listings "fmt*".


- Obtain, build, and learn to use instrumentation for profiling

  - Use the kernels builtin profiling, it's incredibly lightweight and very
    helpful, see the readprofile(1) man page for more information.
  - Use the files in /proc, there are tools out there for gathering data from
    here that you can use, or build your own.  These files contain alot of
    helpful data, you just have to collect it and visualize it, preferably at
    high resolution.


- Do not assume the compiler will optimize out what appear to you as obvious
  compiler optimizations.

  Example #1:

  [code]
  char	buf[BUFSIZ];

  if(read(fd, buf, sizeof(buf)) < 0) { /* read a null terminated string into buf */
  	/* handle read error */
	return -1;
  }

  for(i = 0; i < strlen(buf); i++) {
  	/* read from buf, but don't modify buf (say.. attribute=value pair) */
  }
  [/code]

  In this example, strlen() is called on a buffer to determine the strings
  length, every iteration of the loop.  This happens even if you never write
  to the buffer in your loop.  Simply moving the strlen(buf) to before the for
  loop or at least outside of the test, would make a potentiallly dramatic
  difference.

  Example #2:

  [code]
  char	buf[BUFSIZ];
  int	len;

  if(read(fd, buf, sizeof(buf)) < 0) { /* read a null terminated string into buf */
  	/* handle read error */
	return -1;
  }

  for(i = 0, len = strlen(buf); i < len; i++) {
  	/* read from buf, but don't modify buf (say.. attribute=value pair) */
  }
  [/code]

  In example #2, we have insured strlen(buf) is called at most once, it's a big
  improvement over the last version, such a subtle change makes a tremendous
  difference.  However, it's still not quite right.

  In most cases, we have the length of the data in the buffer already on hand,
  that length should just be kept around and used rather than calculated by
  exhaustively scanning the buffer for a terminating byte.

  Example #3:

  [code]
  char	buf[BUFSIZ];
  int	len;

  len = read(fd, buf, sizeof(buf)); /* read a null terminated string into buf */

  if(len < 0) {
  	/* handle read error */
	return -1;
  }

  for(i = 0; i < len; i++) {
  	/* read from buf, but don't modify buf (say.. attribute=value pair) */
  }
  [/code]

  In example #3, we lose the strlen altogether, and only access bytes in the
  buffer that read() populated, the loop might need to be altered slightly
  to deal with encountering a null byte which the strlen() may have prevented
  you from hitting, but thats just a minor change it shouldnt have any
  significant impact on the efficiency of your loop, but it does illustrate how
  it's not a simple drop-in replacement, you must understand what the meat of
  the loop is doing.

  Also, if inside your loop you start modifying the buffer contents, maybe
  you do need to continuously recalculate the string length with strlen() in
  the test, but it's very unlikely, and if you are modifying it you should have
  knowledge of the changes when they are made and be able to update a length
  variable at that time rather than forcing recalculation at every test, so it
  basically never makes sense... rethink your algorithm.

  There are cases where gcc will optimize it out, but they are few and far
  between, it's basically when the string is a constant last I checked.  It
  does optimize string function calls into inlined ones which can hide these
  inefficient codes from prying eyes using ltrace... it depends on the
  optimization options used, you really have to look at the code, and should
  look at the produced assembly output, if you have your doubts.  It is best
  to explicitly do it the efficient way rather than rely on chance of the
  compiler doing the right thing.

I will update this page as I recall and discover more things and get time to document them, so bookmark it or something and check back.


© 2006 - Vito Caputo