Adapting the result type of RNA methodsTopFactoring out common operationsIntegrating new collectionsContents

Integrating new collections

What needs to be done if you want to integrate a new collection class, so that it can profit from all predefined operations at the right types? On the next few pages you'll be walked through two examples that do this.

Integrating sequences

 

abstract class Base
case object A extends Base
case object T extends Base
case object G extends Base
case object U extends Base
  
object Base {
  val fromInt: Int => Base = Array(A, T, G, U)
  val toInt: Base => Int = Map(A -> 0, T -> 1, G -> 2, U -> 3)
}

RNA Bases.

 

  import collection.IndexedSeqLike
  import collection.mutable.{BuilderArrayBuffer}
  import collection.generic.CanBuildFrom
  
  final class RNA1 private (val groups: Array[Int],
      val length: Intextends IndexedSeq[Base] {
  
    import RNA1._
  
    def apply(idx: Int): Base = {
      if (idx < 0 || length <= idx)
        throw new IndexOutOfBoundsException
      Base.fromInt(groups(idx / N) >> (idx % N * S) & M)
    }
  }
  
  object RNA1 {
  
    // Number of bits necessary to represent group
    private val S = 2            
  
    // Number of groups that fit in an Int
    private val N = 32 / S       
  
    // Bitmask to isolate a group
    private val M = (1 << S) - 1 
  
    def fromSeq(buf: Seq[Base]): RNA1 = {
      val groups = new Array[Int]((buf.length + N - 1) / N)
      for (i <- 0 until buf.length)
        groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)
      new RNA1(groups, buf.length)
    }
  
    def apply(bases: Base*) = fromSeq(bases)
  }

RNA strands class, first version.

Say you want to create a new sequence type for RNA strands, which are sequences of bases A (adenine), T (thymine), G (guanine), and U (uracil). The definitions for bases are easily set up as shown in the listing of RNA bases above.

Every base is defined as a case object that inherits from a common abstract class Base. The Base class has a companion object that defines two functions that map between bases and the integers 0 to 3. You can see in the examples two different ways to use collections to implement these functions. The toInt function is implemented as a Map from Base values to integers. The reverse function, fromInt, is implemented as an array. This makes use of the fact that both maps and arrays are functions because they inherit from the Function1 trait.

The next task is to define a class for strands of RNA. Conceptually, a strand of RNA is simply a Seq[Base]. However, RNA strands can get quite long, so it makes sense to invest some work in a compact representation. Because there are only four bases, a base can be identified with two bits, and you can therefore store sixteen bases as two-bit values in an integer. The idea, then, is to construct a specialized subclass of Seq[Base], which uses this packed representation.

The RNA strands class listing presents the first version of this class. It will be refined later. The class RNA1 has a constructor that takes an array of Ints as its first argument. This array contains the packed RNA data, with sixteen bases in each element, except for the last array element, which might be partially filled. The second argument, length, specifies the total number of bases on the array (and in the sequence). Class RNA1 extends IndexedSeq[Base]. Trait IndexedSeq, which comes from package scala.collection.immutable, defines two abstract methods, length and apply. These need to be implemented in concrete subclasses. Class RNA1 implements length automatically by defining a parametric field of the same name. It implements the indexing method apply with the code given in class RNA1. Essentially, apply first extracts an integer value from the groups array, then extracts the correct two-bit number from that integer using right shift (>>) and mask (&). The private constants S, N, and M come from the RNA1 companion object. S specifies the size of each packet (i.e. two); N specifies the number of two-bit packets per integer; and M is a bit mask that isolates the lowest S bits in a word.

Note that the constructor of class RNA1 is private. This means that clients cannot create RNA1 sequences by calling new, which makes sense, because it hides the representation of RNA1 sequences in terms of packed arrays from the user. If clients cannot see what the representation details of RNA sequences are, it becomes possible to change these representation details at any point in the future without affecting client code. In other words, this design achieves a good decoupling of the interface of RNA sequences and its implementation. However, if constructing an RNA sequence with new is impossible, there must be some other way to create new RNA sequences, else the whole class would be rather useless. In fact there are two alternatives for RNA sequence creation, both provided by the RNA1 companion object. The first way is method fromSeq, which converts a given sequence of bases (i.e., a value of type Seq[Base]) into an instance of class RNA1. The fromSeq method does this by packing all the bases contained in its argument sequence into an array, then calling RNA1's private constructor with that array and the length of the original sequence as arguments. This makes use of the fact that a private constructor of a class is visible in the class's companion object.

The second way to create an RNA1 value is provided by the apply method in the RNA1 object. It takes a variable number of Base arguments and simply forwards them as a sequence to fromSeq. Here are the two creation schemes in action:

scala> val xs = List(A, G, T, A)
xs: List[Product with Base] = List(A, G, T, A)
  
scala> RNA1.fromSeq(xs)
res1: RNA1 = RNA1(A, G, T, A)
  
scala> val rna1 = RNA1(A, U, G, G, T)
rna1: RNA1 = RNA1(A, U, G, G, T)

Next: Adapting the result type of RNA methods


Adapting the result type of RNA methodsTopFactoring out common operationsIntegrating new collectionsContents