We're out of the starting blocks

So, let's begin by configuring a simple web server, creating a web page or two and build on our knowledge from there.

In it's simplest form, Apache is quite easy to configure. Freshly unwrapped out of the plastic, it is almost configured. Just a couple of tweaks and we're off.

The Listen directive

Directives are keywords that are used in the configuration file. The listen directive for example would tell the server what port it will be listening on.

Listen 80
                

Will listen on port 80, the default web server port. Directives are case insensitive, so "Listen" and "listen" will be the same directive.

The ServerAdmin directive

This is the email address of the administrator of this web server. Generally, web administrators use some generic name such as webmaster@QEDux.co.za, rather than their own name. I suggest you follow that convention. How do you get this email relayed to your personal email address? Create an alias on the mail server (which is covered in the configuration of the MTA).

In my case, I've set mine as follows:

ServerAdmin webmistress@QEDux.co.za
                

The DocumentRoot directive

Finally, to get us going, we need to verify our DocumentRoot directive. This dictates where out actual .html pages will reside. The DocumentRoot on my machine is:

DocumentRoot /var/www/html
                

Since this where the html files reside, you will need to put a file in this directory called index.html. The index.html file is attached to this course and is called index-first.html. Not only will you will need to copy it to your DocumentRoot directory but you will also be required to rename it to index.html. When a client enters the URL in their browser as follows:

http://www.QEDux.co.za/
                

they expect to get the index.html page in return. The web server in turn, appends the "/" to the end of the DocumentRoot in order to find the page in question. This, in essence is similar to a chroot in Linux. When you chroot, you "hide" the real path to the file system, presenting the user with what looks like the root directory. Here the same applies. You would not want your clients to be able to see that the real directory lives in /var/www/html, they should just see it a "/"

Starting Apache

Starting Apache can be done from the command line. However before you jump in boots and all, it might be wise to run your httpd.conf file through the syntax checker. Start by getting help on the httpd options:

httpd -?
                

Once you've scanned the options (and of course read the man page for httpd ;-), you can run the syntax checker using the switch:

httpd -t
                

which should tell you whether you've made any glaring errors in the httpd.conf file. If not, we're in business. Now run:

httpd
                

This will pause briefly on the command line and then return you to the prompt. Now, it is worth looking at the httpd process BEFORE looking at the web page (index.html) you copied to the DocumentRoot earlier. To do this type:

ps -ef|grep httpd
                

In my case, I see:

root      1983     1  2 22:36 ?        00:00:00 httpd
                

Now, using a text browser like links or lynx, type the following command:

lynx http://your-IP-address/index.html
                

Bingo, the page is served to you. If you don't have lynx or links (both text based browsers) then fire up your old trusty GUI browser and type in the address.

Now, we need to look back at the httpd processes running on your system after you've retrieved the web page. Again type:

ps -ef|grep httpd
                

Now, in addition to the one server owned by root, another stack of servers are running, all being owned (in my case) by Apache.

apache    1791  1788  0 10:44 ?        00:00:00 /usr/sbin/httpd
apache    1792  1788  0 10:44 ?        00:00:00 /usr/sbin/httpd
......
apache    1798  1788  0 10:44 ?        00:00:00 /usr/sbin/httpd
                

Why is this? Why are there many servers started when you purposely only started one?

Multi-processing Modules (MPM)

Apache has been designed to run using child processes and threading. In essence, these features make it possible for Apache to service incoming client requests in a timely and efficient manner. It is important to understand the difference between threading and child processes. While this is not essential to the basic configuration of Apache it will help you understand how Apache is servicing the client requests.

Apache uses two sets of modules to address these modes of operation, namely:

  1. the worker module for the threading and

  2. the prefork module for the handling of the child process

The diagram below is a schematic showing the worker module in action.

Figure 1.1. The Apache Worker Module

The Apache Worker Module

It works as follows:

The workers

The following description should be read in conjunction with the diagram above. The root user starts the initial process, in order that Apache can use a low numbered port (80). Thereafter, this root-owned process forks a number of child processes. Each child process starts a number of threads (ThreadsPerChild directive) to answer client requests. Child processes are started or killed in order to keep the number of threads within the boundaries specified in MinSpareThreads and MaxSpareThreads. Every child forks a single listener thread which will accept incoming requests and forward them to a worker thread to be answered. Other parameters used in the worker.c directive are:

MaxClients: This is the maximum number of clients that can be served by all processes and threads. Note that the maximum active child processes is obtained by:

MaxClients / ThreadsPerChild
                    

ServerLimit is the hard limit of the number of active child processes. This should always be greater than:

MaxClients / ThreadsPerChild
                    

Finally, ThreadLimit is the hard limit of the number of threads. This should also be larger than ThreadsPerChild

All this is bounded by the IfModule worker.c </IfModule> directives. On the whole, you probably will not need to modify these settings, but at least you know what they are for now.

The child processes

Handling requests using the prefork module is similar in many respects to the worker module, the main difference being that this is not a threading modules, meaning that threads do not answer the client http requests, the child processes themselves do. Again, root starts the initial httpd process in order to bind to a low-numbered port.

The startup process then starts a number of child processes (StartServers) to answer incoming requests. Since children in the prefork module do not use threads, they have to answer the requests themselves. Apache will regulate the number of servers. MinSpareServers is the minimum number of spare servers that will be running at all times, with the upper limit bounded by MaxSpareServers. The MaxClients will be the maximum number of client requests that may be answered at one time. Finally, the MaxRequestsPerChild is the number of requests that a single child process can handle. All these are configurable within the IfModuleprefork.c</IfModule> directives, however the defaults will probably suite you for the purpose of this course and probably for a server in the office too.

Http ports

By default, web servers will mostly listen on port 80. While this is not a prerequisite, the browsers almost without fail will try to connect to a web server on port 80. So, the two need to co-encide - naturally. Linux ports below 1024 are considered well-known and only root can bind to these ports from the operating systems level. Hence the need to start Apache as root, and thereafter switching to an Apache (or other) user. As of Apache 2, a port must be specified within the configuration file as defined by the Listen directive. By default, Apache will offer web pages on every IP address on the server on this port. In some instances, this may not be the desired operation. Thus the Listen directive can also take an IP address and a port number as follows:

Listen 10.0.0.2:1080
Listen 192.168.1.144:80
                    

Here the server contains two host addresses each answering http requests on different ports.

Modularity of Apache

As mentioned earlier, Apache can be used as a monolithic server (all modules required are compiled into the executable), or in a modular fashion using dynamic shared objects (DSO's). The latter is the preferred means of using Apache as the server has a smaller memory footprint. we'll discuss the DSO method of configuring and using Apache, and leave it up to you, the learner, to read the documentation if you wish to do anything different from this.

Modules are loaded into the core server using the mod_so module. In order to see what modules are compiled into Apache running on your machine, you can issue the command (on the command line):

httpd -l
                

On my machine, this command returns:

linux:/ # httpd -l
Compiled-in modules:
  core.c
  prefork.c
  http_core.c
  mod_so.c
                

Clearly, from our discussion earlier on worker versus prefork modules, my Apache will use the prefork directives. Also, in order for Apache to be able to load modules as required, it needs the mod_so compiled in too. If it was not compiled in, it would be unable to load itself, never mind any other modules!

The LoadModule directive will load the associated module when needed. The syntax for this directive is straight-forward.

LoadModule foo_module modules/foo_mod.so
                

This will load foo_module from the file in the modules directory called foo_mod.so.

Once this is configured, there's little else to do. There are however, configuration files that may be associated with each module. PHP for example has a configuration file to specify run-time configuration directives. Configuration files usually reside in the same (or similar) place as your httpd.conf file described earlier. On my system, these files reside in:

/etc/httpd/conf.d/
                

Included below is the php.conf file from this directory.

#
# PHP is an HTML-embedded scripting language which attempts to
# make it easy for developers to write dynamically generated
# webpages.
#

LoadModule php4_module modules/libphp4.so

#
# Cause the PHP interpreter handle files with a .php extension.
#
<Files *.php>
    SetOutputFilter PHP
    SetInputFilter PHP
    LimitRequestBody 524288
</Files>

#
# Add index.php to the list of files that will be served
# as directory indexes.
#
DirectoryIndex index.php
                

This configuration file is "appended" to the httpd.conf file when the mod_php.so is loaded. You may notice that the structure of this file looks very similar to the Apache configuration file - for good reason.

Documentation of the modules is generally shipped with the Apache manual, which will reside on the default web site on the web server. In my case, my default DocumentRoot is /var/www/html, so the manual for Apache resides in /var/www/manual, and the module manuals reside in /var/www/manual/mod/.

We'll cover the use of some of these modules during the course. For now though, accept that the modules can be used for a variety of purposes.

User and group information

As mentioned earlier, Apache will run as root on startup (if you want to use low-numbered ports). Once this is done though, the server will start child process as the user/group specified in the server configuration file. The User and Group directives specify this.

User <your Apache user here>
Group <your Apache group here>
                    
[Note] Note

Starting child processes as this user will have an effect on your CGI scripts running later. We'll cover CGI and a program called suexec later in the course, but it is worth remembering this fact.

ServerName

Often a server may not be named the same as the host you wish your users to refer to it as. For example, my server in my office is called ipenguini.QEDux.co.za, while I wish users wanting to contact my web site to use www.QEDux.co.za. Since these are in fact the same server (using the CNAME Internet Resource Record in the DNS), I would put my ServerName as www.QEDux.co.za.

DocumentRoot

This directive is essential to the running of Apache. It dictates where the contents on the web site will reside. Assume that users wish to contact my web site www.QEDux.co.za. They will type in the following URL:

http://www.QEDux.co.za/index.html
                    

Clearly, they could have left out the index.html, but I have put it in here for a reason. Now, presumably, your index.html file does not reside in the root directory. In my case it lives in:

/var/www/QEDux.co.za/html
                    

So my DocumentRoot directory is set to:

DocumentRoot /var/www/QEDux.co.za/html
                    

The DocumentRoot directive translates in the URL to the "Apache root directory /", hence the "/" on the end of the URL

http://www.QEDux.co.za/
                    

We wouldn't want users to type in:

http://www.QEDux.co.za/var/www/QEDux.co.za/html/index.html
                    

Apart from this presenting a security problem, it would confuse the users no end.

In sum then, the DocumentRoot directory specifies the point of departure for the index.html page and the rest of the web site. In essence, it "hides" the /var/www/QEDux.co.za/html, parading it as "/" to the user.

OK. With all this new-found knowledge, you're wanting to get cracking with your new web server, but wait. There's more. While we have defined the DocumentRoot and the ServerName, we still need to define what operations can be performed in directories within the DocumentRoot.

In order to do this, Apache has "container" blocks. These blocks are very XML-ish in nature. For example, there is a Directory container, that contains information about a particular directory. The Directory container is closed using a </Directory> container directive. Other containers include DirectoryMatch, Files, FilesMatch, etc. The general format of a container is:

<Directory>
	Some options and directives go here
	Some more options and directives
</Directory>
                    

By default, we need to specify a really secure set of directives for the root ("/") directory. [1]

<Directory />
    Options FollowSymLinks
    AllowOverride None
</Directory>
                    

Without going in to too much detail here (it will be covered later), we allow users to follow symbolic links from the root directory, but allow no other options to override the options in this container. Options can be overridden using a .htaccess file (described later) if the AllowOverride was set to "All".

Once we've set a very restrictive set of permission for the root directory, we set up the permission for the DocumentRoot directory.

<Directory "/var/www/QEDux.co.za/html">
    Options Indexes FollowSymLinks
    AllowOverride None
    Order allow,deny
    Allow from all
</Directory>
                    

Here, the options are a little more relaxed. First, we allow users to follow symbolic links. Also, if they happen upon a directory without an index.html or default.html file, they will obtain an index of the files in the directory - in a similar way you get an index of your directory if you type:

file:///var/log
                    

into your browser window. We still disallow any directives set in a .htaccess file from overriding the directives set in this container, and finally we set access to this directory. In this example, we will check the Allow list, and then the deny list of whether this user can gain access. This may be particularly helpful when restricting web sites to say, the help desk operators, or the call-centre people. In the example, we first check the allow list (which admits everyone) and then the deny list (which is absent here), and allow or deny people accordingly.

After setting your Directory container symbol (adjust to your DocumentRoot), you are ready to start your web server. In order to do this, you will probably want to create a default web page. I have included one for download with this course - index-first.html. Rename it to index.html and put it in your DocumentRoot directory. Point your browser at your machine and voila, you should see the web page.

This is an example of the directory container for the "cgi-bin" directory. Note how the directory is specified.

<Directory "/srv/www/cgi-bin">
        AllowOverride None
        Options +ExecCGI -Includes
        Order allow,deny
        Allow from all
</Directory>
                    
[Note] Note

In order to use your machine name rather than your IP address in the URL, you will have to modify your hosts file to contain the FQDN of your hosts. My host for example is defender.QEDux.co.za, so I put www.QEDux.co.za as an alias in my hosts file as follows:

192.168.1.24	defender.QEDux.co.za	www.QEDux.co.za
                            

Now that you have got the basic thing going, it's time to return to those options in the containers. Options are controlled using a "+" or a "-" in front of the option. No sign preceding an option is taken to mean a "+". In our preceding examples, the option FollowSymLinks was not preceded by a "+", but the "+" was implied. A subtlety here is that if the option is preceded by a "+", the option will be added to any previous options as laid out elsewhere. So the options may be:

  • All: Allow all options, all that is except MultiViews.

  • FollowSymLinks: Follow symbolic links from this directory, even if the target directory does not actually reside in the same DocumentRoot directory.

  • SymLinksIfOwnerMatch: Follow the symbolic links, but only if the owner of the symbolic links matches the owner of the directory in which the file currently resides.

  • ExecCGI: Allow execution of CGI scripts in this directory

  • Includes: Allow server side includes. SSI will be covered in more detail later in the course.

  • Indexes: If the directory in which the user finds themselves does not contain any index.html or default.html (as specified by the directive DirectoryIndex), then a listing (probably with pretty icons) will be displayed.

  • MultiViews: This is a little more complex. Web sites can be tailored in a number of ways. They could offer your website in Xhosa for example, and the users home language would be selected automatically depending on their browser preferences. My index page may be index.xh.html as well as index.en.html. Now when a native English speaker views my site (with their browser configured appropriately), they will see the English page, while when the native Xhosa speaker views the site (browser configured correctly again), they will see the Xhosa version. Allowing MultiViews makes this possible. Language support is not the only thing available in MultiViews. Text/html versus text only, jpg or gif versus png's can also be selected based upon priorities.

The example below, taken from the Apache manual shows that text/html is preferred over straight text and gif's and jpg's are preferred over all other image types.

Accept: text/html; q=1.0, text/*; q=0.8, image/gif; q=0.6,
        image/jpeg; q=0.6, image/*; q=0.5, */*; q=0.1
                    

Since MultiViews are a directory based directive, they need to be enabled using the .htaccess files in a directory in order to take effect (see a description of the .htaccess file below).

let's return now to the AllowOverride directive.

  • The AllowOverride directive is only allowed in a Directory container.

  • As mentioned earlier when discussing the "/" (root) directory, when this directive is set to "None" the .htaccess files are completely ignored.

  • When set to "All" the directives in the .htaccess files are allowed based upon their "Context".

    [Note] Note

    Context refers to the context (or containers) in which directives are allowed or denied. In some contexts, directives are not allowed due to possible breaches of security, or because they don't make much sense in the context

Other override directives include:

  • authentication directives (AuthConfig - see mod_auth for more information),

  • file information directives (FileInfo - see mod_mime for more information), directory indexing (Indexes - see mod_autoindex for more information),

  • limitation of access to the web site (Limit - see mod_access for more information) or,

  • any of the options described above - provided of course they make sense in the context of the container (Options)

Using these options in, say, a .htaccess file in some web directory will allow us to limit access to that directory to individuals in a certain subnet, or show the indexes in a particular manner, etc.

The final piece of our container pie above relates to access to this page or site. This all falls under the mod_access module. mod_access is responsible for controlling access to the page/site based upon one, or a number of criteria:

  • Clients hostname

  • Clients IP address and/or subnet

  • Environmental variables being set

The last of these is too complex for this course and will not be discussed now. The other two are relatively easy.

We may set an allow/deny directive as follows:

Allow from all
                    

which will naturally allow for access from all clients. Alternatively:

Allow from 172.16.1.0/24
                    

will allow only users on the network 172.16.1.0 to connect to the web site. Others examples:

Allow from QEDux.co.za
                    

Only allow clients to connect from the domain QEDux.co.za.

Allow from \
  172.16.1.1 ipenguini.QEDux.co.za 192.168.10.0/255.255.255.0
                    

Only allow clients on the 192.168.10.x network to connect, as well as the hosts ipenguini.QEDux.co.za and 172.16.1.1.

The deny rules work identically. The only outstanding thing is to determine the order in which these rules are applied. For that the Order directive is used - simply indicating the order in which the allow and deny directives are applied.

Assuming I have the directives:

Allow from \
  172.16.1.1 ipenguini.QEDux.co.za 192.168.10.0/255.255.255.0
Deny from all
Order Allow,Deny
                    

Once a match is made, no further checking is done. So, had I switched the Order directive to:

Order Deny,Allow
                    

Then even the poor souls that appear in the allow directive would not be allowed through as the deny is denying everyone at the outset.

Virtual hosts

Apache is able to serve a number of different web-sites from the same machine. The way this is achieved is through the use of virtual hosts in the configuration file. It should be noted that if even a single virtual host is described, all containers outside the VirtualHosts container will be ignored. Thus, if you set up even a single virtual host, then your original web site must fall within the VirtualHosts container too.

There are two types of virtual hosts:

  1. name based virtual hosts and,

  2. IP based virtual hosts.

IP based virtual hosts require separate IP addresses in order to run, while name-based ones run off the same IP address, but are called different things (perhaps gadgets.co.za and widgets.co.za). We will look at name-based ones in this course. For IP based ones, consult the Apache documentation.

The first directive required is the "NameVirtualHost". This directive is used by the hosts to be associated with a single IP address. Remember, in name-based virtual hosts, there may be multiple names on a single IP address. Within the VirtualHosts container, we can include all the directives we used without virtual hosts.

An example is included below:

#the widgets website
NameVirtualHost 192.168.0.211
<VirtualHost 192.168.0.211>
   ServerAdmin webmaster@widgets.co.za
   DocumentRoot /var/www/widgets.co.za/html
   ServerName www.QEDux.co.za
   ErrorLog logs/www.widgets.co.za-error_log
   CustomLog logs/www.widgets.co.za-access_log common

     <Directory "/var/www/widgets.co.za/html">
        Options Indexes +Includes FollowSymLinks
        AllowOverride None
        Order allow,deny
        Allow from all
     </Directory>

     Alias /icons/ "/var/www/widgets.co.za/icons/"
     <Directory "/var/www/widgets.co.za/icons">
        Options Indexes MultiViews
        AllowOverride None
        Order allow,deny
        Allow from all
     </Directory>

    ScriptAlias /cgi-bin/ "/var/www/widgets.co.za/ \
        cgi-bin/"
    #
    # "/var/www/cgi-bin" should be changed to
    # whatever your ScriptAliased
    # CGI directory exists, if you have that configured.
    #
    <Directory "/var/www/widgets.co.za/cgi-bin">
       AllowOverride None
       Options None
       Order allow,deny
       Allow from all
    </Directory>

</VirtualHost>

# The gadgets web site.
<VirtualHost 192.168.0.211>
   ServerAdmin webmaster@gadgets.co.za
   DocumentRoot /var/www/gadgets.co.za/html
   ServerName www.QEDux.co.za
   ErrorLog logs/www.gadgets.co.za-error_log
   CustomLog logs/www.gadgets.co.za-access_log common

     <Directory "/var/www/gadgets.co.za/html">
        Options Indexes +Includes FollowSymLinks
        AllowOverride None
        Order allow,deny
        Allow from all
     </Directory>

     Alias /icons/ "/var/www/gadgets.co.za/icons/"
     <Directory "/var/www/gadgets.co.za/icons">
        Options Indexes MultiViews
        AllowOverride None
        Order allow,deny
        Allow from all
     </Directory>

    ScriptAlias /cgi-bin/ "/var/www/gadgets.co.za/cgi-bin/"
    #
    # "/var/www/cgi-bin" should be changed to 
    # whatever your ScriptAliased
    # CGI directory exists, if you have that configured.
    #
    <Directory "/var/www/gadgets.co.za/cgi-bin">
       AllowOverride None
       Options None
       Order allow,deny
       Allow from all
    </Directory>

</VirtualHost>
                    

As you can see, the widgets and the gadgets web sites are hosted on the same web server. A client pointing their browser to widgets.co.za would be directed to the widgets web site, while the gadgets.co.za This is virtual hosting. It can get more complex than this, but for now, this is adequate. Obviously the more virtual hosts you require, the more VirtualHosts containers you will require too.

Common Gateway Interface (CGI)

CGI is handled in Apache using a program called suexec. This program is used to run applications from within Apache. Naturally, this is a fairly dangerous application to allow anyone access to use. A simple CGI script, with malicious intent can wreak havoc on you system. It is wise to keep tight control over the CGI that is allowed to run on your system. CGI is quite simple really. It is a piece of source code (such as perl, C, Java, etc.) that runs, and creates output that in HTML format. let's look at a simple CGI program using the shell:

#!/bin/bash
echo "Content-type: text/html\n\n"
echo "<HTML>\n"
echo "<HEAD><TITLE>My First CGI Script</TITLE></HEAD>\n"
echo "<BODY> \ 
    <H1>About my first automatically generated HTML code \
    </H1>\n"
echo "<HR/>\n"
echo "This is a really cool way to generate HTML<br/>"
print "</BODY></HTML>\n"
exit (0)
                    

Putting this in a file, changing it's mode, and then pointing your web browser to it will cause it to run. The output is sent back to the web browser. CGI gets pretty complex and this is by no means a CGI tutorial, but simple CGI can be very effective in delivering information to the user.

A directive "ScriptAlias" indicates where CGI scripts can be placed in the directory hierarchy. In the case of the example below, the scripts are placed in the /var/www/widgets.co.za/cgi-bin directory.

ScriptAlias /cgi-bin/ "/var/www/widgets.co.za/cgi-bin/"
<Directory "/var/www/widgets.co.za/cgi-bin">
   AllowOverride None
   Options None
   Order allow,deny
   Allow from all
</Directory>
                    

Scripts placed anywhere else will simply not be allowed to be run - for obvious reasons. While CGI is still in wide use, a number of new languages have come of age. Specifically Personal Home Page (PHP) has become a very popular web authoring tool.

There is so much configuration that can be done to Apache, and taking some time to consider these options certainly does not cover it in any significant detail, but it is a start.



[1] don't get confused between the DocumentRoot and the real root directory. Here, we are referring to the real file system root directory.