Integrating new sets and mapsTopAdapting the result type of RNA methodsDealing with map and friendsContents

Dealing with map and friends

However, there is another class of methods in collections that are not dealt with yet. These methods do not always return the collection type exactly. They might return the same kind of collection, but with a different element type. The classical example of this is the map method. If s is a Seq[Int], and f is a function from Int to String, then s.map(f) would return a Seq[String]. So the element type changes between the receiver and the result, but the kind of collection stays the same.

There are a number of other methods that behave like map. For some of them you would expect this (e.g., flatMap, collect), but for others you might not. For instance, the append method, ++, also might return a result of different type as its arguments--appending a list of String to a list of Int would give a list of Any. How should these methods be adapted to RNA strands? Ideally we'd expect that mapping bases to bases over an RNA strand would yield again an RNA strand:

scala> val rna = RNA(A, U, G, G, T)
rna: RNA = RNA(A, U, G, G, T)
  
scala> rna map { case A => T case b => b }
res7: RNA = RNA(T, U, G, G, T)

Likewise, appending two RNA strands with ++ should yield again another RNA strand:

scala> rna ++ rna
res8: RNA = RNA(A, U, G, G, T, A, U, G, G, T)

On the other hand, mapping bases to some other type over an RNA strand cannot yield another RNA strand because the new elements have the wrong type. It has to yield a sequence instead. In the same vein appending elements that are not of type Base to an RNA strand can yield a general sequence, but it cannot yield another RNA strand.

scala> rna map Base.toInt
res2: IndexedSeq[Int] = Vector(03221)
  
scala> rna ++ List("missing""data")
res3: IndexedSeq[java.lang.Object] = 
  Vector(A, U, G, G, T, missing, data)

This is what you'd expect in the ideal case. But this is not what the RNA2 class provides. In fact, if you ran the first two examples above with instances of this class you would obtain:

scala> val rna2 = RNA2(A, U, G, G, T)
rna2: RNA2 = RNA2(A, U, G, G, T)
  
scala> rna2 map { case A => T case b => b }
res0: IndexedSeq[Base] = Vector(T, U, G, G, T)
  
scala> rna2 ++ rna2
res1: IndexedSeq[Base] = Vector(A, U, G, G, T, A, U, G, G, T)

So the result of map and ++ is never an RNA strand, even if the element type of the generated collection is a Base. To see how to do better, it pays to have a close look at the signature of the map method (or of ++, which has a similar signature). The map method is originally defined in class scala.collection.TraversableLike with the following signature:

def map[B, That](f: A => B)
  (implicit cbf: CanBuildFrom[Repr, B, That]): That

Here A is the type of elements of the collection, and Repr is the type of the collection itself, that is, the second type parameter that gets passed to implementation classes such as TraversableLike and IndexedSeqLike. The map method takes two more type parameters, B and That. The B parameter stands for the result type of the mapping function, which is also the element type of the new collection. The That appears as the result type of map, so it represents the type of the new collection that gets created.

How is the That type determined? In fact it is linked to the other types by an implicit parameter cbf, of type CanBuildFrom[Repr, B, That]. These CanBuildFrom implicits are defined by the individual collection classes. In essence, an implicit value of type CanBuildFrom[From, Elem, To] says: "Here is a way, given a collection of type From, to build with elements of type Elem a collection of type To."

 

final class RNA private (val groups: Array[Int]val length: Int
  extends IndexedSeq[Base] with IndexedSeqLike[Base, RNA] {
  
  import RNA._
  
  // Mandatory re-implementation of `newBuilder` in `IndexedSeq`
  override protected[thisdef newBuilder: Builder[Base, RNA] = 
    RNA.newBuilder
  
  // Mandatory implementation of `apply` in `IndexedSeq`
  def apply(idx: Int): Base = {
    if (idx < 0 || length <= idx)
      throw new IndexOutOfBoundsException
    Base.fromInt(groups(idx / N) >> (idx % N * S) & M)
  }
  
  // Optional re-implementation of foreach, 
  // to make it more efficient.
  override def foreach[U](f: Base => U): Unit = {
    var i = 0
    var b = 0
    while (i < length) {
      b = if (i % N == 0) groups(i / N) else b >>> S
      f(Base.fromInt(b & M))
      i += 1
    }
  }
}

RNA strands class, final version.

 

  object RNA {
  
    private val S = 2            // number of bits in group
    private val M = (1 << S) - 1 // bitmask to isolate a group
    private val N = 32 / S       // number of groups in an Int
  
    def fromSeq(buf: Seq[Base]): RNA = {
      val groups = new Array[Int]((buf.length + N - 1) / N)
      for (i <- 0 until buf.length)
        groups(i / N) |= Base.toInt(buf(i)) << (i % N * S)
      new RNA(groups, buf.length)
    }
  
    def apply(bases: Base*) = fromSeq(bases)
  
    def newBuilder: Builder[Base, RNA] = 
      new ArrayBuffer mapResult fromSeq
  
    implicit def canBuildFrom: CanBuildFrom[RNA, Base, RNA] = 
      new CanBuildFrom[RNA, Base, RNA] {
        def apply(): Builder[Base, RNA] = newBuilder
        def apply(from: RNA): Builder[Base, RNA] = newBuilder
      }
  }

RNA companion object--final version.

Now the behavior of map and ++ on RNA2 sequences becomes clearer. There is no CanBuildFrom instance that creates RNA2 sequences, so the next best available CanBuildFrom was found in the companion object of the inherited trait IndexedSeq. That implicit creates IndexedSeqs, and that's what you saw when applying map to rna2.

To address this shortcoming, you need to define an implicit instance of CanBuildFrom in the companion object of the RNA class. That instance should have type CanBuildFrom[RNA, Base, RNA]. Hence, this instance states that, given an RNA strand and a new element type Base, you can build another collection which is again an RNA strand. The two listings above on class RNA and its companion object show the details. Compared to class RNA2 there are two important differences. First, the newBuilder implementation has moved from the RNA class to its companion object. The newBuilder method in class RNA simply forwards to this definition. Second, there is now an implicit CanBuildFrom value in object RNA. To create such an object you need to define two apply methods in the CanBuildFrom trait. Both create a new builder for an RNA collection, but they differ in their argument list. The apply() method simply creates a new builder of the right type. By contrast, the apply(from) method takes the original collection as argument. This can be useful to adapt the dynamic type of builder's return type to be the same as the dynamic type of the receiver. In the case of RNA this does not come into play because RNA is a final class, so any receiver of static type RNA also has RNA as its dynamic type. That's why apply(from) also simply calls newBuilder, ignoring its argument.

That is it. The final RNA class implements all collection methods at their natural types. Its implementation requires a little bit of protocol. In essence, you need to know where to put the newBuilder factories and the canBuildFrom implicits. On the plus side, with relatively little code you get a large number of methods automatically defined. Also, if you don't intend to do bulk operations like take, drop, map, or ++ on your collection you can choose to not go the extra length and stop at the implementation shown in for class RNA1.

The discussion so far centered on the minimal amount of definitions needed to define new sequences with methods that obey certain types. But in practice you might also want to add new functionality to your sequences or to override existing methods for better efficiency. An example of this is the overridden foreach method in class RNA.  foreach is an important method in its own right because it implements loops over collections. Furthermore, many other collection methods are implemented in terms of foreach. So it makes sense to invest some effort optimizing the method's implementation. The standard implementation of foreach in IndexedSeq will simply select every i'th element of the collection using apply, where i ranges from 0 to the collection's length minus one. So this standard implementation selects an array element and unpacks a base from it once for every element in an RNA strand. The overriding foreach in class RNA is smarter than that. For every selected array element it immediately applies the given function to all bases contained in it. So the effort for array selection and bit unpacking is much reduced.

Next: Integrating new sets and maps


Integrating new sets and mapsTopAdapting the result type of RNA methodsDealing with map and friendsContents