Introduction

When implementing a database backend for portage, the versions of packages need to be directly comparable. An integer sized 64 or 128 bit is the way to go. This page describes a scheme to encode the original version strings into compacted integers.

Version Format

The format of version strings used in ebuilds can be categorized as:

version ::= <basic> <suffix>
basic   ::= <number> {'.' <number>} [char]
suffix  ::= ['_' <symbol> [number]] ['-r' <number>]
number  ::= an arbitary non-negative number
symbol  ::= one of "alpha", "beta", "rc", "pre", "p"
char    ::= 'a' to 'z'

NOTE: {'a'} means 'a' can appear none or arbitary times.

Comparison Rules

Basic version: Use the general comparison rules. For example: 1.0 < 1.1 < 1.1.1 < 1.2 < 1.2a < 1.3 < 2 < 2.0
Suffixes: _alpha < _beta < _pre < _rc < none < _p
Overall: If the basic versions are different, compare them. Or compare the suffixes. At last compare ebuild revisions.

Basics

We have to use a single integers to represent multiple integers plus some symbols while keep them comparable.

In this part we are focusing on the basic versions. Take 16 bit integers for explanation. Using 16 bit integer 2 and 12 have the form

0000000000000010
0000000000001100

and they are apparently comparable. What's more, even they appear in the middle of the integer:

0000001000000000
0001100000000000

Provided their LSB (least significant bit) are alligned, they are still comparable. So basically we can align the LSB of numbers to some fixed positions. If we use each 4 bit for a number we can get:

1.0      => 0001 0000 0000 0000
1.0.0    => 0001 0000 0000 0000
1.1      => 0001 0001 0000 0000
3.4.1.10 => 0011 0100 0001 1010

We put the numbers from left (most significant) to right (least significant) to give numbers the correct priority.

This method has a big limitation that the size of an integer and the number of the integers are all fixed. Neither we can't destinguish 1.0 and 1.0.0. To overcome these limitations, we can prepend a length marker before the number and use the following encoding:

Encoding	Range
`0xx`	0~3
`1xxxxx`	4~31

Losing one bit for a number, now we can encode more complex versions:

1.0         => 001 000               => 0010 0000 0000 0000
1.0.0       => 001 000 000           => 0010 0000 0000 0000
1.1         => 001 001               => 0010 0100 0000 0000
1.3         => 001 011               => 0010 1100 0000 0000
1.4         => 001 100100            => 0011 0010 0000 0000
2.6.20      => 010 100110 110100     => 0101 0011 0110 1000
3.0         => 001 000               => 0010 0000 0000 0000
3.1         => 001 001               => 0010 0100 0000 0000
3.2.3.3     => 011 010 011 011       => 0110 1001 1011 0000
3.2.3.3.1   => 011 010 011 011 001   => 0110 1001 1011 0010
3.4.1.10    => 011 100100 001 101010 => 0111 0010 0001 1010 10 (too long!)

You see previous version 3.4.1.10 fail using this method. This is a trade-off for flexibility.

The length markers can also be variable-size. The point here is to use Huffman coding. We just have to make sure:

The length markers are comparable. In other words, markers for shorter length must be less than markers for longer length, when aligned to MSB.
No length marker can be a prefix of another, or we can't distinguish them.

Specification

For an arbitary version string, we just encode each part of the version and put them to the encoded number from MSB to LSB one by one, using the following rules.

Firstly a general rule for numbers is given below.

Encoding	Meaning	Explanation
`01000xxxxx`	'a' ~ 'z'	For the single char at the last, can only appear once
`01x{3}` except `01000`	0 ~ 6	For commonly used small version numbers
`100x{6}`	7 ~ 70
`101x{12}`	71 ~ 4166
`110x{27}`	4167 ~ 134221894	For dates, 99999999 ~ 2^27
`1110x{40}`	134221895 ~ 1099645849670	For timestamps, 999999999999 ~ 2^40
`1111x{48}`	1099645849671 ~ 282574622560326	For some really large numbers

(x{n} means x repeat n times)

Here we apply two tricks:

we leave code 01000 for encoding the ending single char, so 01001 means 0 and then.
as 0~6 is already encoded as 01xxx, the code range 100000000 ~ 100000100 became illegal. To save the space, we use 100000000 to encode 7 immediately, which made the upper bound become 6 + 64 = 70 instead of 63. The same trick does to others.

We encode basic versions as a continuous sequence of variable size integers.

Now depends on the existance of suffixes (not including the ebuild revison '-r'), we append a sequence denoting the end of basic version:

Encoding	Meaning	Explanation
`000`	Suffix is "_alpha", "_beta", "_pre", "_rc"	These suffixes are less than nothing
`001`	No suffix, or '_p'	General case

Then if there are suffixes, we encode the suffixes and append the sequences according to this table:

Encoding	Meaning	Explanation
`0` + NUM	ebuild revision '-r'	NUM is a variable size integer
`10xxx`	Suffix without a number
`11xxx` + NUM	Suffix with a number	NUM is a variable size integer

We can encode the suffix as a 3-bit number, following their priority:

Encoding	Suffix
`000`	_alpha
`001`	_beta
`010`	_rc
`011`	_pre
`000`	_p

Here we reused 000 for '_p' because it's in different category (larger than nothing) as '_alpha' (less than nothing). As you can see 3-bit is efficient enough for further expanding.

Consideration

We can encode obvious date/timestamps using more compact numbers (think of 22b vs 27b for date). But the drawback is dates will be incomparable to ordinary numbers. When some numbers looks like dates but are actually not dates, this will cause severe issues.
There are still some packages whose version doesn't fit in a 64 bit integer (about ~40). Plus no database support native 128bit integers at this time. To workaround this issue we can either fork these packages and shorten their version number, or use a combination of two 64bit integers to form a 128bit integer. But the second method might be an ugly trick.
We could also leave these packages alone without encoding. When we need to do SQL query we just fetch all these packages and compare the versions using the original string-based comparison. Since there are not too many packages fail the 64 bit boundary, this will have little impact on the performance.

Script

The following is a perl script which does the counting of bits needed for each package. The numbers in subs can be changed to test adjusting the encoding.

The script is used for prototype only and isn't well tested. Please feel free to play with it and improve it.

#!/usr/bin/env perl

use 5.010;
use strict;

my @list = grep /^.+\.ebuild$/, `find /usr/portage/`;
my $max_bits = 0;
my $max_ebuild;
my $failed_count = 0;
my $orig;

# bits needed for an ordinary number
sub bits_for_number
{
	given ($_)
	{  
		when ($_ <= "6")                     { return 2 + 3;       }
		when ($_ <= "70")                    { return 3 + 6;       }
		when ($_ <= "4166")                  { return 3 + 12;      }
		when ($_ <= "134221894")             { return 3 + 27;      }
		when ($_ <= "1099645849670")         { return 4 + 40;      }
		when ($_ <= "282574622560326")       { return 4 + 48;      }
		default                              { die "ERROR: $orig"; }
	}
}

# bits needed for the single char immediately after basic version
sub bits_for_char
{
	return 10;
}

# bits needed to denote the end of basic version
sub bits_for_suffix_start
{
	return 3;
}

# bits needed for a suffix
sub bits_for_suffix
{
	/^(?<Suffix>[-_][a-z]+)(?<Num>\d*)$/;
	my $suffix = $+{Suffix};
	my $num = $+{Num};
	given ($suffix)
	{
		when (/-r/)      { return 1 + bits_for_number($num); }
		default          { return 5 + bits_for_number($num); }
	}
}

foreach (@list)
{
	chomp;
	$orig = $_;
	my $bits_needed = 0;
	
	/\/(?<Name>.+)\/\g{Name}-(?<Version>.+)\.ebuild$/;
	$+{Version} =~ /^(?<Basic>.+?)(?<Suffix>([-_].+)?)$/;
	
	my $basic_version = $+{Basic};
	my $suffix = $+{Suffix};
	
	my @basic_parts = $basic_version =~ /\d+/g;
	foreach (@basic_parts)
	{
		$bits_needed += bits_for_number($_);
	}
	if ($basic_version =~ /([a-z])$/)
	{
		$bits_needed += bits_for_char();
	}
	
	$bits_needed += bits_for_suffix_start();
	if ($suffix)
	{
		my @suffix_parts = $suffix =~ /[-_][a-z]+\d*/g;
		foreach (@suffix_parts)
		{
			$bits_needed += bits_for_suffix($_);
		}
	}
	
	if ($bits_needed > $max_bits)
	{
		$max_bits = $bits_needed;
		$max_ebuild = $orig;
	}
	if ($bits_needed > 64)
	{
		$failed_count++;
		say "$orig: $bits_needed";
	}
}

say;
say "Failed ebuild count: $failed_count";
say "Maximum: $max_ebuild: $max_bits";

Ebuild Version Encoding