Backend

Avro Custom Logical Types

When to use logical types

The primary reason for using logical types is convenience, just like we prefer to operate on ZonedDateTime instead of long.

We need to keep in mind that logical types are built on top of standard Avro types, and looking at serialized data (bytes-array), we can’t figure out if it comes from the record having logical types or from the record without logical types.

In fact, serialized data is exactly the same, no matter which schema has been used. It’s absolutely fine to serialize a record without logical types and deserialize to a new record with logical types.

We can have two Avro schemas for the same data—one in the standard Avro way and the second with logical types. Adding aliases pointing to each other makes them compatible in the sense of org.apache.avro.SchemaCompatibility. We can smoothly switch between both schemas because of backward and forward compatibility.

Built-in logical types

Avro provides a few built-in logical types like Decimal, UUID, Date, and Time/Timestamp in some variants. 

Hint: logical type Duration is mentioned in the documentation, but still (as of version 1.11.3) is not supported. Older Avro versions have more issues like this—logical type is mentioned in the documentation but doesn’t work as expected.

Using predefined logical types is straightforward.

Let’s say our record has “standardUuid” field:

"fields": [
 {
   "name": "standardUuid",
   "type": "string"
 }
]

The disadvantage of such a definition is that we can pass any string to the field. Changing the definition to:

json:

"fields": [
 {
   "name": "standardUuid",
   "type": {
     "type": "string",
     "logicalType": "uuid"
   }
 }
]

This allows values to be set only in accordance with the UUID specification.

The codebase is then changed from

.setStandardUuid("any string")

to:

.setStandardUuid(UUID.fromString("d99246bf-2eea-49d0-ac95-48467d5c2024"))

Creating custom logical type

Predefined logical types are sometimes insufficient, and we would like to have something more specialized.

To create and use custom logical types with the avro-maven-plugin, we need to create two Maven modules.

  1. The first module defines our custom type(s). For each logical type, we should also provide two more classes:
    1. Conversion—responsible for converting our custom class to/from standard Avro type (primitive or complex).
    2. LogicalTypeFactory—used by org.apache.avro.LogicalTypes to register our custom logical type.
  2. The second module contains the actual .avsc schemas with logicalType attribute(s). Avro classes are generated in this module.

The reason why we can’t put it all in just one module is that logical types must already be compiled at the time of generating Avro classes. However, Avro class generation happens before compilation. We need to compile the first module to allow class generation in the second module.

Example 1—CustomDuration

The full source code of the examples below can be found on Github

According to the specification, there is a predefined logical type duration. The thing is…it does not exist (as of version 1.11.3). Let’s try to implement it! Our logical type could be relatively simple: 

public record CustomDuration(int months, int days, int millis)
    implements Comparable<CustomDuration> { … }

We need to implement Comparable because it’s needed by the equals() method of a record enclosing our custom logical type as a field.

The LogicalTypeFactory implementation may look like:

public class CustomDurationLogicalTypeFactory 
    implements LogicalTypes.LogicalTypeFactory {
   @Override
   public LogicalType fromSchema(Schema schema) {
       return new LogicalType("custom-duration") {};
   }
   @Override
   public String getTypeName() {
       return "custom-duration";
   }
}

The third part is the implementation of the Conversion class:

public class CustomDurationConversion extends Conversion<CustomDuration> {

   @Override
   public Class<CustomDuration> getConvertedType() {
       return CustomDuration.class;
   }

   @Override
   public String getLogicalTypeName() {
       return "custom-duration";
   }

   @Override
   public GenericFixed toFixed(CustomDuration customDuration,
       Schema schema, LogicalType type) {
     byte[] bytes = customDuration.serializeTo12Bytes();
     return new GenericData.Fixed(schema, bytes);   
   }

   @Override
   public CustomDuration fromFixed(
       GenericFixed value, Schema schema, LogicalType type) {
     return CustomDuration.deserializeFrom12Bytes(value.bytes());
   }
}

We need to put these three classes in one Maven module (alternatively as a separate Maven micro-project). Let’s name it: 

<artifactId>custom-logical-types-defs</artifactId>

Now, we provide it as a dependency of the avro-maven-plugin used in the second module. In the configuration of the plugin, we must also explicitly provide customLogicalTypeFactory and conversion. The full pom.xml can be found on Github.

<customLogicalTypeFactories>
  <customLogicalTypeFactory>
    packagename.CustomDurationLogicalTypeFactory
  </customLogicalTypeFactory>
</customLogicalTypeFactories>
<customConversions>
  <conversion>
    packagename.CustomDurationConversion
  </conversion>
</customConversions>

We’re ready to define logicalType in Avro schema. The field of record can be defined as follows:

{
  "name": "TwelveBytes",
  "type": "fixed",
  "size": 12,
  "logicalType": "custom-duration"
}

TwelveBytes is the name of the generated Avro class, and it must be provided when using the “fixed” type.

By adding just one attribute:

"logicalType": "custom-duration"

The Avro-generated class has a field of our custom type:

private packagename.CustomDuration customDuration;

which is much more convenient to use than the standard version that is just a named wrapper around bytes-array:

private packagename.TwelveBytes customDuration;

Example 2—Int2LongMultimap

Maps in Avro can have only String-type keys. If, for some reason, we wanted to have a map with non-String keys, we could create a custom logical type that pretends its keys are, for example, integers.

Let’s say we would like to have a map like Map<Integer, List<Long>>.

A possible implementation of custom logical type, conversion, and factory classes is presented below:

public class Int2LongMultimap 
     extends LinkedHashMap<Integer, List<Long>> {
   public Map<String, List<Long>> toAvroMap() {
       ...
   }
   public static Int2LongMultimap fromAvroMap(Map<?, ?> avroMap) {
       ...
   }
}
public class Int2LongMultimapConversion 
     extends Conversion<Int2LongMultimap> {

   @Override
   public Class<Int2LongMultimap> getConvertedType() {
       return Int2LongMultimap.class;
   }

   @Override
   public String getLogicalTypeName() {
       return "int-2-long-multimap";
   }

   @Override
   public Map<?, ?> toMap(Int2LongMultimap int2LongMultimap, 
                          Schema schema, LogicalType type) {
       return int2LongMultimap.toAvroMap();
   }

   @Override
   public Int2LongMultimap fromMap(Map<?, ?> vanillaAvroMap, 
                           Schema schema, LogicalType type) {
       return Int2LongMultimap.fromAvroMap(vanillaAvroMap);
   }
}
public class Int2LongMultimapLogicalTypeFactory 
     implements LogicalTypes.LogicalTypeFactory {
   @Override
   public LogicalType fromSchema(Schema schema) {
       return new LogicalType("int-2-long-multimap") {};
   }

   @Override
   public String getTypeName() {
       return "int-2-long-multimap";
   }
}

The tricky part is the implementation of toAvroMap() and fromAvroMap() methods where Integer keys are converted to Strings and String keys are parsed to Integers, respectively:

public Map<String, List<Long>> toAvroMap() {
   Map<String, List<Long>> avroMap = new LinkedHashMap<>();
   for (Map.Entry<Integer, List<Long>> entry : entrySet()) {
       String stringKey = entry.getKey().toString();
       avroMap.put(stringKey, entry.getValue());
   }
   return avroMap;
}

public static Int2LongMultimap fromAvroMap(Map<?, ?> avroMap) {
   Int2LongMultimap int2LongMultimap = new Int2LongMultimap();
   for (Map.Entry<?, ?> entry : avroMap.entrySet()) {
       int intKey = Integer.parseInt(entry.getKey().toString());
       @SuppressWarnings("unchecked")
       List<Long> longs = (List<Long>) entry.getValue();
       int2LongMultimap.put(intKey, longs);
   }
   return int2LongMultimap;
}

These 2 methods are formally needed by Conversion class only, however just for convenience (unit-test) we put them in Int2LongMultimap class.

Now we’re ready to use new logical type in avro schema:

{
 "type": "map",
 "values": {
   "type": "array",
   "items": "long"
 },
 "logicalType": "int-2-long-multimap"
}

so that generated classes use Map<Integer, …> but for (de-)serializing purposes it’s visible by Avro as Map<String, …>

Unusual cases

It’s possible to have a logical type which is a fully-functional JsonNode’s subclass (naturally backed by String in vanilla Avro):

public final class AnyJsonNode extends JsonNode {
   private final JsonNode jsonNode;
   // should be invoked by dedicated Conversion class:
   public AnyJsonNode(JsonNode jsonNode) {
       this.jsonNode = jsonNode;
   }

   @Override
   public void serialize(JsonGenerator gen,
          SerializerProvider serializers) throws IOException {
       serializers.defaultSerializeValue(jsonNode, gen);
   }

   @Override
   public void serializeWithType(JsonGenerator gen,
           SerializerProvider serializers, TypeSerializer typeSer)
           throws IOException {
       serializers.defaultSerializeValue(jsonNode, gen);
   }
   // delegate all required methods to jsonNode
}

We may also create a logical type for Bitmap and store its contents efficiently using Avro bytes type (which requires a bit of encoding when serializing/deserializing).

Limitations

As we’ve seen in Example 2, logical types can be built on top of complex types, not only primitive types. In theory, we could have logical types defined even on record level, however, we found this approach has many difficulties (e.g., logical type class must implement SpecificRecord and provide a bunch of Builder-related methods).

Also, we can’t have a custom logical type, which is a map (e.g., extends HashMap) and is backed by Avro-array. That would be a nice way to implement a logical type as Map<Integer, Integer>, which in vanilla Avro would be represented as an array of pairs of integers (in this particular example, the idea crashes on casting map to Collection in GenericData.compare(…) because the type is deducted from Schema).

avro-fastserde

At RTB House, we heavily use the avro-fastserde library (part of https://github.com/linkedin/avro-util) because it performs (de-)serialization a few times faster than standard Avro implementation. In production, we observed significantly lower (de-)serialization time depending on the Schema complexity—the larger the schema, the greater the improvement (up to 8x faster).

Thanks to our contribution, avro-fastserde supports logical types as well. Also, check our article from a few years back, which introduced avro-fastserde to the public. The library is now maintained by LinkedIn.

Conclusions

Logical types in Avro provide a nice way to operate on higher-level data structures, while standard Avro types are still used behind the scenes. This approach allows adding logical types in small steps, although the same data can be read/written at the same time using both vanilla-Avro types and logical types.

Resources

.

Comments are closed.

More in Backend