This page is no longer maintained — Please continue to the home page at www.scala-lang.org

Is there a way to decouple the AST generation from the Grammar?

4 replies
John Ky
Joined: 2009-10-06,
User offline. Last seen 42 years 45 weeks ago.
Hi all,

I'm interested in knowing if there a way to decouple the AST generation from the Grammar.

This way, I could write the grammar once, and then overlay an AST generator over it, or overlay an interpreter over it depending on the situation.

At the moment, I'm generating the AST and processing the grammar in the same place, which means I can't reuse the grammar for other purposes.

Cheers,

-John

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: Is there a way to decouple the AST generation from the Gram

On Thu, Dec 31, 2009 at 09:52:35AM +1100, John Ky wrote:
> I'm interested in knowing if there a way to decouple the AST
> generation from the Grammar.

Wildly guessing you're asking about combinators here... sure. Here's
actual authentic source code, so fresh it's still dripping bits.

First a couple tastes of the grammar in JVMS3 4.3.4:

FieldTypeSignature:
ClassTypeSignature
ArrayTypeSignature
TypeVariableSignature

FormalTypeParameter:
Identifier ClassBound InterfaceBound*

Now some code:

lazy val FieldSignature = (
ClassTypeSignature
| ArrayTypeSignature
| TypeVariableSignature
) ^^ mkFieldSignature

lazy val FormalTypeParameter =
Identifier ~ ClassBound ~ rep(InterfaceBound) ^^ mkFormalTypeParameter

I defy you to further decouple without inducing catastrophic
overdecoupling collapse.

So now all you need to do is write some functions. Put them wherever
you want, give them funny names, mix them in from any direction.

def mkFieldSignature: TypeSignature => FieldTypeSignature
def mkFormalTypeParameter: (String, Bound, List[Bound]) => FormalTypeParameter

You can add a bunch more layers of abstraction if "mkFieldSignature" is
too concrete for you, but I figure if the point is to reuse the grammar
then it's safe to take the grammar as a given. I actually do keep
finding more and more unexpected places to interject abstraction when
writing combinators, although it's at least as much for the
entertainment value as any significant wins. ("Unexpected" mostly
because before scala I never used a language with a low enough
"abstraction activation energy" to enable effective reuse of tiny
snippets in similar but different contexts.)

// overkill? Not if you ever tried to consistently pretty print the
// full range of gobbledygook which emerges from bytecode...
trait BracketArgs {
type BracketArgType
def bracketArgs: List[BracketArgType]

protected def bracketArgStr =
if (bracketArgs.isEmpty) ""
else bracketArgs map (_.toString) mkString("[", ", ", "]")
}

Parsers are really flexible. I was blowing the stack trying to use
"rep" reading constant pools, so I rolled it up a little differently:

def collect_cp_info(size: Int) = new Parser[constant_pool] {
private val pool = new Array[cp_info](size - 1)
private var index = 0

@tailrec
final def apply(in: Input): ParseResult[constant_pool] = {
if (index >= (size - 1)) Success(new constant_pool(pool), in)
else parse_cp_info(in) match {
case Success(entry, rest) =>
pool(index) = entry
index += entry.size
apply(rest)
case ns: NoSuccess => return ns
}
}
}

The whole classfile is read like so:

def collect[T](p: Parser[T]): Parser[List[T]] = u2 >> (x => repN(x, p))

lazy val classFile = (
magicNumber ~>
version ~
(u2 >> collect_cp_info) ~
access_flags ~
this_class ~
super_class ~
collect(parse_interface_info) ~
collect(parse_field_info) ~
collect(parse_method_info) ~
collect(one_attribute ^^ class_attribute_info)
)

I never get tired of bits like "u2 >> collect_cp_info", especially
anytime I am reminded of what typical parsers in java look like. It's
been pretty comical taking apart java bytecode tools. By the time
you're done stripping off the boilerplate and applying some higher order
logic, you're left with three lines of essential complexity and 50 KB of
pom-or-build.xml.

John Ky
Joined: 2009-10-06,
User offline. Last seen 42 years 45 weeks ago.
Re: Is there a way to decouple the AST generation from the Gra
Hi Paul,

Thanks for your assistance.

The following bit is where I am confused:

2009/12/31 Paul Phillips <paulp [at] improving [dot] org>
So now all you need to do is write some functions.  Put them wherever
you want, give them funny names, mix them in from any direction.

 def mkFieldSignature: TypeSignature => FieldTypeSignature
 def mkFormalTypeParameter: (String, Bound, List[Bound]) => FormalTypeParameter

As I understand, the parsers in my grammar need to be able to access the mk*** declaration, which means that I need to put them into the same trait as my parsers, which means I need to fix their types - in this case to TypeSignature => FieldTypeSignature and (String, Bound, List[Bound]) => FormalTypeParameter.

... Or does that not matter if I make them all empty traits that I mixin to my actual AST classes?

Sound like it could work.

Cheers,

-John

You can add a bunch more layers of abstraction if "mkFieldSignature" is
too concrete for you, but I figure if the point is to reuse the grammar
then it's safe to take the grammar as a given.  I actually do keep
finding more and more unexpected places to interject abstraction when
writing combinators, although it's at least as much for the
entertainment value as any significant wins.  ("Unexpected" mostly
because before scala I never used a language with a low enough
"abstraction activation energy" to enable effective reuse of tiny
snippets in similar but different contexts.)

 // overkill? Not if you ever tried to consistently pretty print the
 // full range of gobbledygook which emerges from bytecode...
 trait BracketArgs {
   type BracketArgType
   def bracketArgs: List[BracketArgType]

   protected def bracketArgStr =
     if (bracketArgs.isEmpty) ""
     else bracketArgs map (_.toString) mkString("[", ", ", "]")
 }

Parsers are really flexible.  I was blowing the stack trying to use
"rep" reading constant pools, so I rolled it up a little differently:

 def collect_cp_info(size: Int) = new Parser[constant_pool] {
   private val pool = new Array[cp_info](size - 1)
   private var index = 0

   @tailrec
   final def apply(in: Input): ParseResult[constant_pool] = {
     if (index >= (size - 1)) Success(new constant_pool(pool), in)
     else parse_cp_info(in) match {
       case Success(entry, rest) =>
         pool(index) = entry
         index += entry.size
         apply(rest)
       case ns: NoSuccess  => return ns
     }
   }
 }

The whole classfile is read like so:

 def collect[T](p: Parser[T]): Parser[List[T]] = u2 >> (x => repN(x, p))

 lazy val classFile = (
   magicNumber ~>
   version ~
   (u2 >> collect_cp_info) ~
   access_flags ~
   this_class ~
   super_class ~
   collect(parse_interface_info) ~
   collect(parse_field_info) ~
   collect(parse_method_info) ~
   collect(one_attribute ^^ class_attribute_info)
 )

I never get tired of bits like "u2 >> collect_cp_info", especially
anytime I am reminded of what typical parsers in java look like.  It's
been pretty comical taking apart java bytecode tools.  By the time
you're done stripping off the boilerplate and applying some higher order
logic, you're left with three lines of essential complexity and 50 KB of
pom-or-build.xml.

--
Paul Phillips      | We act as though comfort and luxury were the chief
In Theory          | requirements of life, when all that we need to make us
Empiricist         | really happy is something to be enthusiastic about.
pp: i haul pills   |     -- Charles Kingsley

extempore
Joined: 2008-12-17,
User offline. Last seen 35 weeks 3 days ago.
Re: Is there a way to decouple the AST generation from the Gram

BTW sorry about that broken EOF code in the byte parser, I just hadn't
gotten around to worrying about EOF yet. Here's what that looks like in
my "minimal" byte oriented parser now. I have a lot more convenience
going elsewhere. You should use ControlException for things like this
because it suppresses stack trace creation, and also for other reasons I
won't get into right now.

[More below]

object ParserEOFException extends ControlException
trait BinaryParsers extends Parsers {
type Elem = Byte

/** The standard library parsers are written with the assumption that end of stream
* can be recognized with inband data (i.e. EofCh) which is not true for binary data.
* So we are forced to override the methods which may encounter end of stream and
* throw a control exception, then catch that and translate it to parse Failure.
*/
private def eofWrapper[T](p: Parser[T], alt: Input => ParseResult[T]): Parser[T] = Parser { in =>
try p(in)
catch { case ParserEOFException => alt(in) }
}

override def acceptIf(p: Elem => Boolean)(err: Elem => String): Parser[Elem] =
eofWrapper[Elem](super.acceptIf(p)(err), Failure("EOF", _))

override def acceptMatch[U](expected: String, pf: Elem =>? U): Parser[U] =
eofWrapper[U](super.acceptMatch(expected, pf), Failure("EOF", _))
}

On Tue, Jan 05, 2010 at 11:23:32PM +1100, John Ky wrote:
> As I understand, the parsers in my grammar need to be able to access
> the mk*** declaration, which means that I need to put them into the
> same trait as my parsers, which means I need to fix their types - in
> this case to TypeSignature => FieldTypeSignature and (String, Bound,
> List[Bound]) => FormalTypeParameter.

But all the types can be abstract types. You may "fix" them in terms of
giving names to the abstractions without limiting what they can be. And
really you don't even have to put the mkXXX functions in the grammar, I
just think it makes more sense since by its very nature the grammar is
imposing a specific structure, so it may as well deliver things in the
structure it's imposing.

By way of example let's have the grammar do the absolute minimum. No
processing whatsoever, all it's going to do is recognize structure and
pour out the elements. This is 100% off the top of my fingers, no
testing, so don't hold me accountable for exactness.

type Elem = Byte
val any = elem("any", _ => true)

lazy val rule1 = repN(3, any)
lazy val rule2 = rule1 ~ rule1
lazy val rule3 = 'a'.toByte ~ rule2

phrase(rule3)(new ByteReader("abcdefg".getBytes))

Now you'll get back from this something like

~('a', ~(List('b', 'c', 'd'), List('e', 'f', 'g')))

This is happy to run and give you that mess of ~s without knowing
anything whatsoever about what you'll do with it. So it's evident that
part is completely separable. Now in real life you most likely don't
want to deal with ~ because we have better representations for
structured data, so you do just a little processing in the rules.

case class MyMeaningfulTokenGroup(x: List[Byte], y: List[Byte])
lazy val rule2 = rule1 ~ rule1 ^^ MyMeaningfulTokenGroup

(That last will only work as is if you have flattening implicits mixed
in.) But that could as well be abstract:

type Something // abstract
def myMeaningfulTokenCreator: (List[Byte, List[Byte]) => Something
lazy val rule2 = rule1 ~ rule1 ^^ myMeaningfulTokenCreator

Now you define those things when you mix in the grammar.

Here is a sort of related example from what I'm working on right now. I
want "toInt and "toLong" functions, but those mean different things
depending on what I'm parsing. So:

trait ByteInterpretation extends BinaryParsers {
def toInt(bytes: Seq[Byte]): Int
def toLong(bytes: Seq[Byte]): Long

/** Even this isn't a sure thing thanks to "Modified UTF-8" */
def toUTF8String(bytes: Seq[Byte]) = Codec toUTF8 bytes.toArray mkString

/** A reasonable supposition */
def toFloat(xs: Seq[Byte]) = intBitsToFloat(toInt(xs))
def toDouble(xs: Seq[Byte]) = longBitsToDouble(toLong(xs))
}

/** Eight bits to a byte, four bytes to an Int, etc.
*/
trait StandardByteInterpretation extends ByteInterpretation {
def toInt(xs: Seq[Byte]): Int = xs.foldLeft(0)((x, b) => (x << 8) + (b & 0xFF))
def toLong(xs: Seq[Byte]): Long = xs.foldLeft(0L)((x, b) => (x << 8) + (b & 0xFF))
}

/** Variable number of bytes, high bit used to signal end of number.
*/
trait PicklerByteInterpretation extends ByteInterpretation {
def toInt(xs: Seq[Byte]): Int = xs.foldLeft(0)((x, b) => (x << 7) + (b & 0x7F))
def toLong(xs: Seq[Byte]): Long = xs.foldLeft(0L)((x, b) => (x << 7) + (b & 0x7F))
}

And then:

trait JVMParsers extends StandardBinaryParsers with StandardByteInterpretation with ImplicitConversions

But:

object ScalaSigParser extends StandardBinaryParsers with PicklerByteInterpretation with ImplicitConversions

John Ky
Joined: 2009-10-06,
User offline. Last seen 42 years 45 weeks ago.
Re: Is there a way to decouple the AST generation from the Gra
Hi Paul,

Thanks for the follow up with the EOF.  It's nice to be able to do this the proper way.

On Tue, Jan 05, 2010 at 11:23:32PM +1100, John Ky wrote:
> As I understand, the parsers in my grammar need to be able to access
> the mk*** declaration, which means that I need to put them into the
> same trait as my parsers, which means I need to fix their types - in
> this case to TypeSignature => FieldTypeSignature and (String, Bound,
> List[Bound]) => FormalTypeParameter.

But all the types can be abstract types.  You may "fix" them in terms of
giving names to the abstractions without limiting what they can be.  And
really you don't even have to put the mkXXX functions in the grammar, I
just think it makes more sense since by its very nature the grammar is
imposing a specific structure, so it may as well deliver things in the
structure it's imposing.

Great Scott!  You're right!  The mkXXX are just fine with the abstract types.

After learning Scala, programming will never be the same for me.

Thanks!

-John

Copyright © 2012 École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland