yum
upgrades for production use, this is the repository for you.
Active subscription is required.
;tldr Why our repo rocks
- Serves
.rpm
files as immutable resources (Far Future Expires) - Serves
repomd.xml
in a way to ensure repo updates go through CDN uncached - Proxy, CDN cache-able
If you want to deliver a bunch of custom packages for RedHat and its clones, like CentOS, you will look into creating your own RPM repository.
What software to use and what is the efficient web stack for this? Let’s break it down to how we have setup our RPM repository in terms of performance.
RPM repository basics
Each RPM (YUM) repository consists mainly of these areas:
- The
repodata
directory: this holds the metadata about packages, e.g. list of packages in the repository with their version information, so that clientyum
programs are able to search packages in your repository, without inspecting.rpm
files themselves. It’s more like an index data about your packages, withrepomd.xml
file being the main index file - The
RPMS
directory: holds the actual.rpm
files
We won’t go into details on how to build .rpm
files or creating repodata
. Let’s concentrate on how to host the repository you have already created.
Since none of the files in your RPM repository are dynamic in nature, NGINX is the best software to use for hosting it. In fact, NGINX was designed to be efficient at serving static files from the very beginning. So we go ahead and create a server block in nginx confguration, e.g.:
http {
server {
server_name repo.example.com;
root /path/to/repo/files;
autoindex on;
}
}
Of course we would further do things like configuring SSL certificate and such, but is there more to it? How can we make it really fast without mirrors across the globe? 🙂
RPM repository caching essentials
To understand the best caching policy for RPM repositories, we have to break things down to essential categories.
1. URL resources that never change their content
An example of those would be .rpm
files. Once you’ve built your RPM package, its filename contains version information.
When you put it to your repository, the URL would look like this:
https://repo.example.com/redhat/7/x86_64/RPMS/package-1.14.0.el7.x86_64.rpm
Are we ever going to see a different .rpm
file than the one we initially put, on this exact URL ? No. This means that we can say that this URL resource is immutable, and essentially cache it forever.
One edge case where we might have a different .rpm
file on this URL, is when we forgot to sign it. Once it’s correctly signed, the actual .rpm
file will be different. However, with automated build systems, this case is ruled out. And even if we don’t resort to using automated signing, we can simply bump release number.
The secondary RPM metadata is another representative of URL resources with content that never changes. These files already bear version information within them (hash of RPM files). So these resources can be cached forever as well.
2. URL resources with changing content.
Primarily, this is the repomd.xml
file. It contains references to the secondary metadata indexes, and while its URL stays the the same, its content is going to be different after we build yet another package. Thus, we should not cache it forever. Proxies in the wild (that we have no control of), should never cache it.
Nginx caching policy for RPM repositories
So if you want to deliver packages efficiently, you want all the clients to cache both your .rpm
files and secondary metadata indexes forever. Here’s a simple configuration:
# Applicable to directory listings and repomd.xml: always fresh for end clients and shared proxies
add_header Cache-Control "no-cache, no-store, must-revalidate";
# These resources never change: RPMs and secondary metadata
location ~ \.(d?rpm|xml\.gz|sqlite\.bz2)$ {
add_header Cache-Control "public, max-age=31536000, immutable";
}
location ~ /repoview/.*\.html$ {
# Files are not updated atomically by repoview, so to avoid SPDY errors:
open_file_cache off;
}
The default no-cache
will apply to directory listings that you see in your browser, as well as to the repomd.xml
primary metadata index file.
So the primary metadata is going to be pulled fresh always, while the caching proxies (or browsers) will happily cache downloaded .rpm
files and reuse them. Thus speeding things up tremendously.
Shared proxies you can control
Things get more interesting if you use Cloudflare or Varnish. Both are shared caching proxies that you have control of. Subsequently, you can tell them to cache everything indefinitely, or for as long as they allow us. This opens possibility for great things like hosting the entire repository on the CDN. You can always purge the CDN cache after building a package.
A simple Cloudflare page rule allows for this:
URL: repo.example.com/*
Cache Level: Cache Everything
Edge Cache TTL: a month
The Cache Everything cache level instructs Cloudflare to cache .rpm
and other repository files, in addition to the default static file types they cache.
With Edge Cache TTL we basically override our previously defined caching policy just for Cloudflare and have it cache (and deliver) the entire RPM repository on their CDN, worldwide.
Subsequently, we only need to instruct Cloudflare to clear its caches after building a package. Sample Python 2.7 script will suffice for the job.
purge-cloudflare.py
Pre-requisite for the script in CentOS 7 is Cloudflare Python module. You can install it via yum install python2-cloudflare
. Our script’s logic is simple:
- Purge all the files which are not
.rpm
or secondary metadata file extensions - Purge all the directories
You would call this script after building a package and putting it into your RPM repository. This will ensure that metadata at Cloudflare is up-to-date, while keeping the cached RPMs intact at Cloudflare.
#!/usr/bin/env python
import os
import sys
import CloudFlare
import argparse
from pprint import pprint
parser = argparse.ArgumentParser()
parser.add_argument("--subdir", help="clear out specific directory", nargs='?', const='', default='')
args = parser.parse_args()
pprint(args.subdir)
httpdocs = '/path/to/repo/files' + args.subdir
siteurl = 'https://repo.example.com' + args.subdir
zone_name = 'example.com'
matches = []
exclude = ['.git', '.well-known']
for root, dirnames, filenames in os.walk(httpdocs, topdown=True):
dirnames[:] = [d for d in dirnames if d not in exclude]
for filename in filenames:
if not filename.endswith(('.drpm', '.rpm', '.xml.gz', '.sqlite.bz2', '.index.html')):
matches.append(os.path.join(root, filename).replace(httpdocs, siteurl))
for dirname in dirnames:
matches.append(os.path.join(root, dirname).replace(httpdocs, siteurl) + '/')
cf = CloudFlare.CloudFlare()
# grab the zone identifier
try:
params = {'name':zone_name, 'per_page':1}
zone_info = cf.zones.get(params=params)
except CloudFlare.exceptions.CloudFlareAPIError as e:
exit('/zones %d %s - api call failed' % (e, e))
except Exception as e:
exit('/zones - %s - api call failed' % (e))
try:
params = {'files':matches}
r = cf.zones.purge_cache.post(zone_info[0]['id'], data=params)
except CloudFlare.exceptions.CloudFlareAPIError as e:
exit('/zones %d %s - api call failed' % (e, e))
except Exception as e:
exit('/zones - %s - api call failed' % (e))
You can clear specific repository within the same domain, by providing the path to it, e.g.:
/path/to/purge-cloudflare.py --subdir=/redhat/7
So what we have here, is:
- An efficient caching mechanism for RPM files in both shared proxies we have no control of and Cloudflare CDN
- A way to host the entire repository on the CDN edge with ability to purge when a new package is pushed
- A big “eat my shorts” to packagecloud.io and the likes 🙂