6.2k
views15
comments

Decorating abstract syntax trees with line numbers

Hello all

I am fairly new in the functional world, so please forgive me if this is a basic question. I am implementing a compiler using fsyacc and currently have a working parser, type checker and code generator. Now I want to improve the type checker so that it reports line numbers in error messages, and my question is how best to do this in a functional manner.

My first idea would be to extend the abstract syntax tree with position (line/column) information. Say for example that my abstract syntax tree has the following disjoint union type:

program =
  | Parallel of program * program
  | Sequential of program * program
  | Nil

I would then extend each AST node constructor with an additional element for position information, e.g.

program =
  | Parallel of program * program * position
  | Sequential of program * program * position
  | Nil * position

However, this would require me to re-factor the code generator to handle the extended data types (and the extension is irrelevant for code generation). I would like the extension to be transparent to everything but the type checker.

So, my question: is there a better way of doing this? If I took an object oriented approach, each kind of AST node would be a child of some superclass, and I could simply add position information to this superclass without needing to refactor the code generator. Is there a functional counterpart of this?

Any suggestions would be much appreciated.

Thanks and best regards,

Michael.

I do not see the way type augumentation helps here, since you cannot augument union type with a field, and that is what one needs here. Can you elaborate on that?

Now, here are some ideas to solve the original problem.
I am afraid with the design you have you cannot add position data without some refactoring, but here is what you can do to accomodate further changes. Structure your data in this way:

type programNode = 
| Parallel of program * program
| Sequential of program * program
| Nil
and program = { node : programNode; pos : position }

Now all nodes in your program share pos field and you have an extension point to add furher adornments to your program. You can still pattern-match rather nicely:

1
2
3

match prog with
  { node = Parallel {node = Seqential _ } } -> (* do stuff *)
| ...

If this is too ugly for you, you may want to define active patterns to make pattern matching simplier:

let (|Par|Seq|Nill|) prog = match prog with
| { node = Parallel (x,y) } -> Par(x,y)
| { node = Sequential(x,y) } -> Seq(x,y)
| { node = Nil } -> Nill

and then pattern-match in the usual way.

However, if you are 100% sure that you won't need any more data in your AST nodes, I guess the easiest would be to go with the solution you yourself came up with.

Here are my thoughts.
Are there any better ideas?

Hope this helps,
Friendly,
Dmitry

By DmitryLomov on 4/14/2008 12:37 PM (permalink)

Hello Dmitry,

Thanks for your suggestions and input. At least it is reassuring that there is no obvious better way of doing things.

Best,
Michael.

By michaeldavid on 4/15/2008 6:39 AM (permalink)

Hi Michael,

The challenge here is that a discriminated union is just a structure -- there is no way to "hang" additional information on to it (as Dmitry correctly observed). One can add "member" augmentations; so you could add a "position" getter member, but you couldn't add a field to store the "position", so the getter would have nowhere to get it from.

IMHO, this is a significant limitation; I would love someday a way to fully augment DUs; internally this would probably be a class with a DU member, but I'd like the compiler to hide that implementation -- let us use it just like a DU structure today.

I like Dmitry's approach, if you know in advance that you will want to adorn your nodes.

If you don't want to rework your existing objects, consider a dictionary to associate each node with its position information. Not really a "functional" technique, but works fine if there is a natural point at which you will be done with the set of objects, and will drop the dictionary. (Otherwise, dictionary holds onto a reference to each object, so they never get garbage collected.)

For your case:

 
#light
module CGen = System.Collections.Generic

type position = {line:int; column:int}

/// key: node, value: position
let position_of = new CGen.Dictionary<obj, position>()

type node =
  interface
  end

type program =
  | Parallel of program * program
  | Sequential of program * program
  | Nil
  with interface node

// Returns "node1", for pass-thru declarations.
let adorn (node1:#node) (pos1:position) =
  position_of.Add(node1, pos1)
  node1

// Return position of node.
let pos (node1:#node) = position_of.Item node1

// ---------- tests ----------
let pam msg a = printfn "%s = %A" msg a

let pro1 = adorn (Nil) {line=1; column=15}
pos pro1 |> pam "pos pro1"
printf "----- Done: Press any key. -----"
System.Console.ReadKey(false) |> ignore

Notes:

* The "node" stuff is optional;
I just did this so "adorn" and "pos" are only called with types
that are intended. I like to catch mistakes at compile-time :)

* "adorn" returns the node, so that it can be used inline when defining a node. See "let pro1 = ...". The idea is that if position is known at the time the node is defined, this is a clean way to define both simultaneously.

~TMSteve

By ToolmakerSteve on 4/16/2008 11:03 PM (permalink)

I experimented with Dmitry's approach, and there is an inconvenience as soon as there is more than one type: the data constructors have to have different labels or the compiler gets confused. I couldn't even have one be "{node:program; pos:position}" and another be "{f:foo; pos:position}", I had to make ALL the names different, e.g. the 2nd became "{f:foo; pos1:position}".

So I would instead define a class for each adorned type:

 
#light
type position = {line:int; column:int}

type program0 =
  | Parallel of program * program
  | Sequential of program * program
  | Nil
and program(node:program0, pos:position) =
  member it.node = node
  member it.pos = pos

type foo0 =
  | NilFoo
type foo(node:foo0, pos:position) =
  member it.node = node
  member it.pos = pos

// ---------- tests ----------
let ps s = printfn "%s" s
let pam msg a = printfn "%s = %A" msg a

let ado1 = program(Nil, {line=1; column=15})
ado1.node |> pam "ado1.node"
match ado1.node with
  | Parallel(p1,p2) -> ps "Parallel"
  | Sequential(p1,p2) -> ps "Sequential"
  | Nil -> ps "Nil"
ado1.pos |> pam "ado1.pos"

let ado2 = foo(NilFoo, {line=2; column=4})
ado2.node |> pam "ado1.node"
match ado2.node with
  | NilFoo -> ps "NilFoo"
ado2.pos |> pam "ado2.pos"

printf "----- Done: Press any key. -----"
System.Console.ReadKey(false) |> ignore

NOTE: I think this approach is worth considering. It means passing around the adorned versions, so doing "ado1.node" to get at the DU when it is needed, but it is a clean design that does not mess with the code generator types.

I just realized that it means doing ".node" at each level when recursing through the tree -- slightly more inconvenient than I had at first thought.

~TMSteve

By ToolmakerSteve on 4/17/2008 12:12 AM (permalink)

> NOTE: I think this approach is worth considering. It means passing

around the adorned versions, so doing

> "ado1.node" to get at the DU when

it is needed, but it is a clean design that does not mess with the code

generator types.>I just realized that it means doing ".node" at each level when

recursing through the tree -- slightly more
> inconvenient than I had at

first thought.I guess one can go with active patterns, so doing ".node" is not needed:

 
let (|Par|Seq|Nill|) ado = match ado.node with
  Parallel(x,y) -> Par(x,y)
| Sequential(x,y) -> Seq(x,y)
| Nil -> Nill

Now you can pattern-match over your 'program' class:

let ado1 = program(Nil, {line=1; column=15})
match ado1 with
  | Par(_,_) -> ps "Parallel"
  | Seq(_,_) -> ps "Sequential"
  | Nill -> ps "Nil"

Hope this helps,
Friendly,
Dmitry

By DmitryLomov on 4/23/2008 2:15 AM (permalink)

Hi Steve,

Thanks a lot for your suggestions and sorry for my late reply.

The simplicity of the hash table solution is tempting, but the "functional" solutions is probably wiser for my purposes.

I have a related problem which I suspect does not have a very nice solution (but I would hope that I'm wrong). Consider the disjoint union from previous posts:

type program =
  | Parallel of program * program
  | Sequential of program * program
  | Nil

I would now like to define a "normal form" program which is the same as above except that it has no sequential composition:

1
2
3

type nfprogram =
  | Parallel of nfprogram * nfprogram
  | Nil

I would then define a function f : program -> nfprogram computing the normal form of a program, and subsequent functions g : nfprogram -> ... would then only need to consider the subset of programs which are on normal form.

But this doesn't work because the definitions of type constructors Parallel and Nil in the nfprogram disjoint union "overwrite" the type constructors in the program disjoint union. The straightforward solution is to give the type constructors in nfprogram different names, e.g.

1
2
3

type nfprogram =
  | NFParallel of nfprogram * nfprogram
  | NFNil

This isn't very nice because my normalisation function would have to explicitly convert Parallel(_,_) to NFParallel(_,_). More importantly, I loose the information that nfprogram is a "subtype" (in the obvious sense) of program, i.e. I would like to be able to pass an nfprogram to functions which expect a program.

Is there any nice way of doing this?

Again, thank you all for your suggestions.

Best,
Michael.

By michaeldavid on 4/23/2008 1:47 AM (permalink)

The kind of thing where you have the union-subtyping can be done in Ocaml:

# type vari = [`A of int | `B of int | `C of vari];;
type vari = [ `A of int | `B of int | `C of vari ]

# let rec to_a = function | `A n | `B n -> `A n | `C v -> `C (to_a v);;
val to_a :
  ([< `A of 'b | `B of 'b | `C of 'a ] as 'a) ->
  ([> `A of 'b | `C of 'c ] as 'c) = <fun>

It's part of the concept of "variant types" which basically amounts to "anonymous unions with subtyping" as my understanding implies (I think "anonymous records with subtyping" are implemented by the structural subtyping existing in Ocaml's object orientated nature.) Note that the type definition "vari" above was not necessary, and in effect the first type ([< `A of 'b | `B of 'b | `C of 'a ] as 'a) corresponds to vari and the second type ([> `A of 'b | `C of 'c ] as 'c) corresponds to the normal-form version of vari (the ">" and "<" symbols correspond to the fact that the input/output is in fact polymorphic - its input can be any subtype of vari and it outputs to any supertype of vari-nf.)

This is possibly confusing though, as it doesn't currently exist in F# : I wonder if variant types ever been considered as a further extension of F#? I have occasionally found them useful in "normal form" situations like the above. A solution that doesn't use variant types that I've found most convinient is simply to use the same type and put a comment by the function saying that its output is in a normal form, and maybe specifying (by a comment next to the original type) which constructors are allowed in a normal form element of the datatype. However it's of course nice if the type of the function itself declares this.

By mdchurchill on 4/25/2008 4:17 AM (permalink)

Sorry about the overly terse response before.You can add LOC data in at least four different ways:

Put LOC data in with the other data for each type constructor.
Create a mutual recursion with another type (as Dmitry did).
Parameterize your type over generic metadata.
Keep metadata externally in a hash table or functional map.

Putting the LOC data in a separate type and introducing a mutual recursion as Dmitry did is probably the best solution for you right now. You can then write either a static loc_of function or augment the type with a loc member (which is what I was trying to suggest before) to get at the LOC data.You said that you wanted to keep the interface the same to avoid refactoring. This can be done but I would recommend a proper refactoring if possible. F# makes this refactoring as quick and painless as possible by identifying all source locations that are out of date as non-exhaustive pattern matches. If you really must avoid refactoring (e.g. if you are supporting user code), you can hide your new API behind active patterns.The workaround of retrofitting data onto a variant type hierarchy by storing an external mapping using something like a hash table should be kept as a last resort for when you cannot fix the problem at source (typically because the source is not available!). The functional equivalent is, of course, a functional map.
Steve also asked how to avoid identifier shadowing like record field names. The solution is to structure your code using modules.Then you asked about overlapping variant types. Polymorphic variants solve exactly that problem in OCaml but F# does not have them. I do not believe polymorphic variants will be added to F# although I would also like them because they provide many of the benefits of static typing whilst retaining static checking.

By jdh30 on 4/25/2008 3:13 PM (permalink)

Hi again,

Thank you all for your ideas. Refactoring is certainly possible so I will go for Dmitry's approach, but I am glad to know of the various options and that there is no obvious better way.

Best,
Michael.

By michaeldavid on 4/27/2008 7:46 AM (permalink)

Hi Michael,

One way to work with disjoint types, is to wrap those types into another DU. Here, wrap "program" and "nfprogram" into a "wrap_program".
"wrap_program" is a way to pass around EITHER type of program; this is the key to being able to write logic that works with either type.

"match" statements must then be used to determine which type is present.
Rather than scatter the matches throughout your code, write a function for each of the actions to be performed on the programs. Below, "leaves" and "get_field1" are such functions.

To an OO programmer, such functions are the equivalent of public methods on the base class or interface. In this functional style, the logic is grouped into a match per action, rather than being organized around each subtype (as it would in the OO style).

 
#light

// ----- Wrapping related types "program" and
//       "nfprogram" in a DU "wrap_program"    -----

type t_field1 = char  // For Demo

type simple_program = t_field1
  
type program =
  | Parallel of program * program
  | Sequential of program * program
  | Simple of simple_program  // Leaf
  | Nil

type simple_nfprogram = t_field1 * char

// Considered a subtype of program.
type nfprogram =
  | Parallel of nfprogram * nfprogram
  | Simple of simple_nfprogram  // Leaf
  | Nil

type wrap_program =
  | Program of program
  | Nfprogram of nfprogram

// The "Simple" programs, wrapped.
type wrap_simple =
  | SimpleProgram of simple_program
  | SimpleNfprogram of simple_nfprogram
  
// The first field of either type of program.
let get_field1 (w:wrap_simple) :t_field1 =
  match w with
    | SimpleProgram f1
    | SimpleNfprogram (f1,_) -> f1

// Return all the leaves, wrapped.
let rec program_leaves (pro:program) : wrap_simple list =
  match pro with
  | program.Parallel (pro1,pro2)
  | program.Sequential (pro1,pro2) ->
    (program_leaves pro1) @ (program_leaves pro2)
  | program.Simple sp -> [SimpleProgram sp]
  | program.Nil -> []

// Return all the leaves, wrapped.
let rec nfprogram_leaves (nfpro:nfprogram) : wrap_simple list =
  match nfpro with
  | nfprogram.Parallel (pro1,pro2) ->
    (nfprogram_leaves pro1) @ (nfprogram_leaves pro2)
  | nfprogram.Simple sn -> [SimpleNfprogram sn]
  | nfprogram.Nil -> []

// Return all the leaves, wrapped.
let leaves (wrap:wrap_program) : wrap_simple list =
  match wrap with
    | Program pro -> program_leaves pro
    | Nfprogram nfpro -> nfprogram_leaves nfpro

let test() =
  let pro0 = program.Sequential (program.Simple 'b', program.Simple 'c')
  let pro1 = program.Parallel (program.Simple 'a', pro0)
  let nfpro2 = nfprogram.Parallel (nfprogram.Simple ('d','z'),
                                   nfprogram.Simple ('e','y'))
  // A "mixed list" containing both program types.
  let wraps = [ Program pro1; Nfprogram nfpro2 ]
  // Using a mixed list of wrapped programs.
  for wrap in wraps do
    match wrap with
      | Program pro -> ()
      | Nfprogram nfpro -> ()
  // Mapping a function on the mixed list.
  let mixed_leaves = List.map_concat leaves wraps
  let field1s = List.map get_field1 mixed_leaves
  field1s
  
test()

== output =>
> val it : t_field1 list = ['a'; 'b'; 'c'; 'd'; 'e']

To show how to extract information from these recursive structures, even given a list that mixes the two types of programs, I've added a "Simple" branch to each of your definitions. The tree bottoms out on these Simple branches; each simple program has a "field1" with information to be extracted. Here, simply a character, to identify each program.

In order to work with "Simple"s that are either simple_program or simple_nfprogram, I've made a second wrapper DU, "wrap_simple". Each leaf is a "wrap_simple".

test() shows the creation of a "mixed list" "wraps" that includes both program types.
Then shows directly working with that list, using "for" and "match".
Then shows the preferred style, calling functions "leaves" and "get_field1", using map_concat and map.

NOTE: Also demonstrates using "program.Parallel" and "nfprogram.Parallel" to allow the same names to be used both places -- though this doesn't provide any "subclass"-like behaviour in F#.

By ToolmakerSteve on 4/30/2008 3:38 PM (permalink)

For comparison, here is a version that wraps each Du into a class.

Rather than making nfprogram a subtype of program, a common interface "face_simple" is implemented by both simple_program and simple_nfprogram, so that each can return their field1. And "face_program" is implemented by both program and nfprogram, so that each can return a list of their contained simple programs.

The "leaves" logic has moved from stand-alone functions to being part of each class.

Functional access is retained by having stand-alone functions "get_field1" and "leaves". These are now trivial -- they just call the corresponding member.

 
#light

// ----- Using classes with a DU field, and active patterns -----

type t_field1 = char  // For Demo

// ----- (inter)faces -----

// Any simple program has a field1.
type face_simple =
  abstract get_field1: t_field1

// Any program can return a list of contained simple programs.
type face_program =
  abstract leaves: face_simple list

// ----- program -----

type simple_program(field1:t_field1) =
  with interface face_simple with
    member it.get_field1 = field1
  
type program0 =
  | Parallel of program * program
  | Sequential of program * program
  | Simple of simple_program  // Leaf
  | Nil
and program(node:program0) =
  member it.node = node
  with interface face_program with
    member it.leaves =
      match it.node with
      | program0.Parallel (pro1,pro2)
      | program0.Sequential (pro1,pro2) ->
        (pro1 :> face_program).leaves @ (pro2 :> face_program).leaves
      | program0.Simple sp -> [ (sp :> face_simple) ]
      | program0.Nil -> []

// ----- nfprogram -----

type simple_nfprogram(field1:t_field1, field2:char) =
  member it.get_field2 = field2
  with interface face_simple with
    member it.get_field1 = field1

// Considered a subtype of program.
type nfprogram0 =
  | Parallel of nfprogram * nfprogram
  | Simple of simple_nfprogram  // Leaf
  | Nil
and nfprogram(node:nfprogram0) =
  member it.node = node
  with interface face_program with
    member it.leaves =
      match it.node with
      | nfprogram0.Parallel (pro1,pro2) ->
        (pro1 :> face_program).leaves @ (pro2 :> face_program).leaves
      | nfprogram0.Simple sn -> [ (sn :> face_simple) ]
      | nfprogram0.Nil -> []

// ----- functional access -----
 
// The first field of either type of program.
let get_field1 (sim:#face_simple) :t_field1 =
  sim.get_field1

// Return all the leaves.
let leaves (pro:#face_program) : face_simple list =
  pro.leaves

let test() =
  let proA = new program( program0.Simple (new simple_program 'a') )
  let proB = new program( program0.Simple (new simple_program 'b') )
  let proC = new program( program0.Simple (new simple_program 'c') )
  let pro0 = new program( program0.Sequential (proB, proC) )
  let pro1 = new program( program0.Parallel (proA, pro0) )
  let nfproD = new nfprogram( nfprogram0.Simple (new simple_nfprogram ('d','z')) )
  let nfproE = new nfprogram( nfprogram0.Simple (new simple_nfprogram ('e','y')) )
  let nfpro2 = new nfprogram( nfprogram0.Parallel (nfproD, nfproE) )
  // A "mixed list" containing both program types.
  let programs = [ (pro1 :> face_program); (nfpro2 :> face_program) ]
  // Using a mixed list of programs.
  for prog in programs do
    match prog with
      | :? program as pro -> ()
      | :? nfprogram as nfpro -> ()
      | _ -> failwith "unknown program subtype"
  // Mapping a function on the mixed list.
  let mixed_leaves = List.map_concat leaves programs
  let field1s = List.map get_field1 mixed_leaves
  field1s
  
test()

Compared to the previous, "wrapped functional" version:

OO Disadvantages:
1. Declaration is more verbose.
2. In the mixed list, if access is needed to information beyond the shared interface "face_program", a dynamic type cast ":?" is required to get back the full type.

OO Advantages:
I don't see any significant advantages. Traditionally the main benefit of OO here would be the ease of getting at field1 of any program type; in the "wrapped functional" it was essential to create -- and use -- the corresponding function:

 
  let get_field1 (w:wrap_simple) :t_field1 =
    match w with
      | SimpleProgram f1
      | SimpleNfprogram (f1,_) -> f1

Trade-off:
The usual "organizational" decision between the "loose" functions organizing logic "per action" in a match versus the type-centric organization with "rigid" interface declaration and logic in members "of each type". Each approach is preferable sometimes. It is easier to add new actions (functions) in the functional approach, but (arguably) easier to add new subtypes in the interface+class approach. In this sample, the verbosity of the OO approach strongly tilts the "ease" in favor of "functional", even if you expect to add more new subtypes than new functions.

By ToolmakerSteve on 4/30/2008 4:49 PM (permalink)

It's been a while since the last post, but I just came across this topic as I have a very similar problem with my compiler and I found this thread very helpful.

Here's some more information:

1)
I looked at the OCaml compiler and it is using the mutual recursion between a union and a record type (as proposed by Dmitry).

So while this is maybe the best solution it still makes the code less readable and if there was a simple way to specify an common attribute for a union type this would reduce the code size (complexity) e.g. of a typical lexer/parser quite a bit.

2)
And as a note for the suggestion using a hash table to map from the AST to a position: Doesn't work poroperly!
The problem is that you can't distinguish one particular instance of a union type from another one if they have the same 'content'.
Let me illustrate this with a very similar problem of adding positions to the Tokens generated by a lexer:

let tokenPositions = new System.Collections.Generic.Dictionary<token, position>()

let adorn (token:token) (lexbuf: lexbuf) =
tokenPositions.Add(token, lexbuf.StartPos);
token

Now when I'm feeding my lexer an input that generates two of the same symbols (e.g. OPENPAREN) I will get an exception when I try to add it a second time. There is already an OPENPAREN key in my dictionary!

This is a general problem and would also happen with your above grammar.

By daniel.furrer on 8/15/2008 2:21 AM (permalink)

>And as a note for the suggestion using a hash table to map from the AST to a position: Doesn't work properly!
The problem is that you can't distinguish one particular instance of a union type from another one if they have the same 'content'.
...
This is a general problem and would also happen with your above grammar.

For the hash table to work, the keys must be objects that are not re-used. In my example, I have "nodes", where a node instance is created for each use. The node is a unique identity, hashed by its memory location [NOTE: I'm not actually running that old example to verify that this is so as I wrote the code - just explaining the principle, in case anyone wishes to use the technique.]

By ToolmakerSteve on 10/29/2009 6:21 PM (permalink)

Hi again,

Thank you all for your ideas. Refactoring is certainly possible so I will go for Dmity's approach, but I am glad to know of the various options and that there is no obvious better way.

Best,
Michael.

By michaeldavid on 4/27/2008 7:45 AM (permalink)

Use an augmentation. See Expert F# or F# for Scientists for more details.

By jdh30 on 4/14/2008 10:06 AM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags