The Architecture of Scala Collections -- Integrating new collections

What needs to be done if you want to integrate a new collection class, so that it can profit from all predefined operations at the right types? On the next few pages you'll be walked through two examples that do this.

abstract class Base

case object A extends Base

case object T extends Base

case object G extends Base

case object U extends Base

object Base {

val fromInt: Int => Base = Array(A, T, G, U)

val toInt: Base => Int = Map(A -> 0, T -> 1, G -> 2, U -> 3)

}

import collection.IndexedSeqLike

import collection.mutable.{Builder, ArrayBuffer}

import collection.generic.CanBuildFrom

final class RNA1 private (val groups: Array[Int],

val length: Int) extends IndexedSeq[Base] {

import RNA1._

def apply(idx: Int): Base = {

if (idx < 0 || length <= idx)

throw new IndexOutOfBoundsException

Base.fromInt(groups(idx / N) >> (idx % N * S) & M)

}

object RNA1 {

// Number of bits necessary to represent group

private val S = 2

// Number of groups that fit in an Int

private val N = 32 / S

// Bitmask to isolate a group

private val M = (1 << S) - 1

def fromSeq(buf: Seq[Base]): RNA1 = {

val groups = new Array[Int]((buf.length + N - 1) / N)

for (i <- 0 until buf.length)

groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)

new RNA1(groups, buf.length)

}

def apply(bases: Base*) = fromSeq(bases)

}

Say you want to create a new sequence type for RNA strands, which are sequences of bases A (adenine), T (thymine), G (guanine), and U (uracil). The definitions for bases are easily set up as shown in the listing of RNA bases above.

Every base is defined as a case object that inherits from a common abstract class Base. The Base class has a companion object that defines two functions that map between bases and the integers 0 to 3. You can see in the examples two different ways to use collections to implement these functions. The toInt function is implemented as a Map from Base values to integers. The reverse function, fromInt, is implemented as an array. This makes use of the fact that both maps and arrays are functions because they inherit from the Function1 trait.

The next task is to define a class for strands of RNA. Conceptually, a strand of RNA is simply a Seq[Base]. However, RNA strands can get quite long, so it makes sense to invest some work in a compact representation. Because there are only four bases, a base can be identified with two bits, and you can therefore store sixteen bases as two-bit values in an integer. The idea, then, is to construct a specialized subclass of Seq[Base], which uses this packed representation.

The RNA strands class listing presents the first version of this class. It will be refined later. The class RNA1 has a constructor that takes an array of Ints as its first argument. This array contains the packed RNA data, with sixteen bases in each element, except for the last array element, which might be partially filled. The second argument, length, specifies the total number of bases on the array (and in the sequence). Class RNA1 extends IndexedSeq[Base]. Trait IndexedSeq, which comes from package scala.collection.immutable, defines two abstract methods, length and apply. These need to be implemented in concrete subclasses. Class RNA1 implements length automatically by defining a parametric field of the same name. It implements the indexing method apply with the code given in class RNA1. Essentially, apply first extracts an integer value from the groups array, then extracts the correct two-bit number from that integer using right shift (>>) and mask (&). The private constants S, N, and M come from the RNA1 companion object. S specifies the size of each packet (i.e. two); N specifies the number of two-bit packets per integer; and M is a bit mask that isolates the lowest S bits in a word.

Note that the constructor of class RNA1 is private. This means that clients cannot create RNA1 sequences by calling new, which makes sense, because it hides the representation of RNA1 sequences in terms of packed arrays from the user. If clients cannot see what the representation details of RNA sequences are, it becomes possible to change these representation details at any point in the future without affecting client code. In other words, this design achieves a good decoupling of the interface of RNA sequences and its implementation. However, if constructing an RNA sequence with new is impossible, there must be some other way to create new RNA sequences, else the whole class would be rather useless. In fact there are two alternatives for RNA sequence creation, both provided by the RNA1 companion object. The first way is method fromSeq, which converts a given sequence of bases (i.e., a value of type Seq[Base]) into an instance of class RNA1. The fromSeq method does this by packing all the bases contained in its argument sequence into an array, then calling RNA1's private constructor with that array and the length of the original sequence as arguments. This makes use of the fact that a private constructor of a class is visible in the class's companion object.

The second way to create an RNA1 value is provided by the apply method in the RNA1 object. It takes a variable number of Base arguments and simply forwards them as a sequence to fromSeq. Here are the two creation schemes in action:

scala> val xs = List(A, G, T, A)

xs: List[Product with Base] = List(A, G, T, A)

scala> RNA1.fromSeq(xs)

res1: RNA1 = RNA1(A, G, T, A)

scala> val rna1 = RNA1(A, U, G, G, T)

rna1: RNA1 = RNA1(A, U, G, G, T)