Tuesday, May 15, 2007

Internals: Supporting binary output compression

I have been working on improving the CSV format’s binary encoding to reduce the volume of data being logged, which will reduce SD card writes and thus improve power performance. The following is an attempt to explain how compressed binary encoding can produce an 8x size improvement over text encoding.

Approach

The logger is designed to capture and output positioning data (i.e. longitude, latitude, altitude, timestamp, etc) at a regular rate. The native encoding of this data is absolute. For example, the absolute encoding of a latitude position requires 28 bits (a 1 bit sign, a 7 bit integral degree, and a 20 bit fractional degree).

However, the logger is typically moving at a velocity. This velocity has practical limits as it is unlikely to be faster than the fastest vehicle known to humans, and will typically be much less (e.g. walking, cycling or driving speeds). Each positioning data that is output will be at some offset to the proceeding data.

This could mean that while large numbers are required to express absolute positions, much lower numbers are required to express the relative offset between them. For example, consider an airplane that is moving at 939km/h in the latitude direction only, with a logger onboard that is generating positioning data every 1 second. With each latitude degree being about 111km, then the absolute latitude changes by 0.00235 at each interval. This relative offset can be encoded in about 12 bits, which is 50% more efficient than the 28 bits required for the absolute encoding.

That is the first decision: to use relative offsets.

If relative offsets are to be used, how should they be encoded? The simple approach is to store the offset value itself. A more complex, but ultimately more adaptive and optimal, approach is to use a statistical bit encoder (e.g. huffman, arithmetic, etc) and store the resulting bitstream. For now, the simple approach will suffice.

That’s the second decision: to store the actual relative offsets.

The next problem is in the handling large relative offsets (e.g. an extended interval with no coverage, say inside a moving vehicle, a road/rail tunnel, or a baggage compartment). In these cases, the relative offset soon becomes a large as an absolute position. There needs to be a limit to the size of the relative offset, which if exceeded, causes an absolute position to be used. These should be infrequent cases.

That’s the third decision: to allow both absolute positions and relative offsets.

The number of bits required to encode an absolute position is fixed, but should this also be the case for relative offsets? A variable size would require some form of length or terminal indicator to be used. This would be mandatory for the complex approach above as the statistical bit encoders produce variable length outputs, but can be chosen either way for the sample approach. For now, the simplest option of a fixed bit size will be used.

That’s the fourth decision: to use fixed (pre-determined) bit sizes for relative offsets.

This begs the next question: what should the fixed bit sizes be? One approach is to calculate the theoretical limits to a relative offset (e.g. the maximum speed of vehicles). Another approach is to use empirical observations, and 700MB of GPS sample data is available to help with that. By analysing the relative offsets, an optimal fixed bit size can be computed.

That’s the fifth decision: to empirically determine the fixed bit sizes for the relative offsets.

There’s one final implementation detail: how are absolute or relative entries distinguished in the output. There could be an absolute/relative indicator specified (a) for the entire entry (i.e. all fields in the entry have the same encoding), or (b) for individual fields (e.g. an absolute altitude, but a relative timestamp). Additionally, the indicator could be specified for (1) each entry/field, or (2) only when the current run changes (e.g. a relative indicator, then 50 altitudes relative encoded, then an absolute indicator, etc). For simplicity, the indicator is chosen to be a single bit (0 for absolute, 1 for relative), and specified on each field (b) at each time (1).

That’s the sixth decision: to use a one bit indicator on every field in every entry to indicate its encoding method.

Analysis

The fixed bit sizes for the relative offset encoding were derived from the 700MB of GPS sample data. The NMEA GPGGA sentences in the data were converted into binary format with absolute encoding, resulting in 2,083,953 entries. The relative offsets were computed and turned into a cumulative histogram.

The histogram showed the number of entries that could be encoded within a given range of values, i.e. a particular number of bits. For example, the altitude field had 1,394,639 fields of offset 0, 1,963,859 of offset 0-1, and 2,082,057 fields of offset 0-127. This means that 99.90902% of offsets can be encoded as relative offsets within 7 bits (0-127), and the remaining would need to be encoded as absolute positions.

From this, it is easy to calculate the average bit size. For instance, the altitude field is 17 bits with absolute encoding (a 1 bit absolute indicator, a 1 bit sign, and a 15 bit absolute height), or (2 + n) bits with relative encoding (a 1 bit relative indicator, a 1 bit sign, and an #bits bit relative height). With this, the total output (encoded) size of all altitudes is (#values[n] * (2 + n)) + ((#total - #values[n]) * 17), where #total is 2,083,953 and #values[n] is obtained from the histogram. Dividing the total output size by the total number of entries gives the average bit per entry. It’s just a matter of looking at what number of bits (n) results in the lowest average bit per entry.

When this analysis is carried out for all positional fields:
  • Longitude: since 99.9762% of integral parts of the degree are at offset 0, specify that relative encoding can only occur if the integral field is unchanged. The fractional part of the degree is optimally encoded in 9 bits (37.55%), rather than 8 bits (45.64%) or 10 bits (40.20%).
  • Latitude: since 99.9968% of integral parts of the degree are at offset 0, specify that relative encoding can only occur if the integral field is unchanged. The fractional part of the degree is optimally encoded in 8 bits (35.68%) rather than 7 bits (45.72%) or 9 bits (38.17%).
  • Altitude: The value is optimally encoded in 1 bit (22.39%) rather than 0 bits (40.95%) or 2 bits (24.50%) or 3 bits (29.77%).
  • Timestamp: The value is optimally encoded in 1 bit (20.46%) rather than 0 bits (99.99%) or 2 bits (20.64%) or 3 bits (24.21%).
Results

With all fields having absolute encoding, an entry requires 94 bits. If all fields have relative encoding, this drops to 25 bits, giving a best case compression factor of 26.60%. However, the sample data shows that the best case is not always achieved, and on average, an entry requires 29.50 bits, giving an average case compression factor of 31.38%. That is a factor of 3 reduction in size and possibly (a tenuous extrapolation) a factor of 3 increase in power performance.

In summary, consider logging at 1 second intervals in CSV format, storing longitude, latitude, altitude and a timestamp. With text encoding, approximately 31 bits are required per entry, which fills the 24K onboard file cache in 13.2 minutes. With the original, uncompressed, binary encoding, 90 bits are used, filling the cache in 36.4 minutes. With the compressed binary encoding, 29.50 bits are used, filling the cache in 111.1 minutes. The compressed binary encoding can give an 8x size improvement over text encoding, possibly resulting in an 8x power improvement.

No comments: