The LexBuffer<'pos,'char> interface does not expose buffer_scan_length, so "putting back" the current regexp match is not an option. This is in contrast to Ocaml, which allows allows to manipulate lexbuf.lex_curr_pos for this purpose.
Is there any chance that a future version of the F# lexer exposes an interface for "putting back" (part of) the current regexp match in a lexer action?
Is there maybe any UGLY HACK (tm), other than changing the fslib, which would allow me to access a hidden field of LexBuffer?
Stephan
Is there any chance that a future version of the F# lexer exposes an interface for "putting back" (part of) the current regexp match in a lexer action?
Is there maybe any UGLY HACK (tm), other than changing the fslib, which would allow me to access a hidden field of LexBuffer?
Stephan
The following code shows how to use reflection to manipulate the protected fields in lexbuf in order to put back chars of the current match:
{
// (...)
open Lexing
open System.Reflection
let scanLengthField = (type Lexing.lexbuf).GetField("_buffer_scan_length", BindingFlags.Instance ||| BindingFlags.NonPublic)
let lexemeLengthField = (type Lexing.lexbuf).GetField("_lexemeLength", BindingFlags.Instance ||| BindingFlags.NonPublic)
let putBack lb n = scanLengthField.SetValue(lb, (scanLengthField.GetValue(lb) :?> int) - n);
lexemeLengthField.SetValue(lb, (lexemeLengthField.GetValue(lb) :?> int) - n)
// (...)
}
let text = // (...)
let markup = ['{' '}']
rule token = parse
| text markup { putBack lexbuf 1; TEXT(lexeme lexbuf) } // put back the markup char and return text
// (...)
Hi Stephan,
Would it be better to have two different lexer rules for this? you can create seperate rules using the "and" keyword then call them in your rule matched code. So the idea would be as soon as you find the first chacter of your inbetween bit you hop into a new rule which gathers up all the tokens and then passes them to lex (the lexer supplied with F# uses this technique for parsing comments and strings). I haven't got time to put together a working sample, but a lexer like this in pseduo code would look something like:
Hope that helps,
Rob
Would it be better to have two different lexer rules for this? you can create seperate rules using the "and" keyword then call them in your rule matched code. So the idea would be as soon as you find the first chacter of your inbetween bit you hop into a new rule which gathers up all the tokens and then passes them to lex (the lexer supplied with F# uses this technique for parsing comments and strings). I haven't got time to put together a working sample, but a lexer like this in pseduo code would look something like:
rule token = parse
| "<starttag>" { STARTTAG }
| . { middleBit lexeme}
and middleBit x = parse
| . { middleBit (x + lexeme) }
| "<endtag>" { ENDTAG }
Hope that helps,
Rob
Hi Robert,
thanks for your reply.
Actually I'm already using separate rules and I could need the "put back match" feature for the following kind of setup:
My problem is that my "endtag" is not context independent and may in a particular context just be normal Text. The above lexer would allow me to parse the text relatively efficiently.
Stephan
thanks for your reply.
Actually I'm already using separate rules and I could need the "put back match" feature for the following kind of setup:
let regex1 = // a more or less complicated regex identifying a markup token
(...)
let non_markup = // any character that can not be the
// first char of regex1,...,regexn
rule token = parse
| regex1 { MARKUP1() }
| regex2 { MARKUP2() }
(...)
| _ { text (lexeme lexbuf) lexbuf }
and text str = parse
| non_markup* { TEXT(str ^ lexeme lexbuf) }
| _ { put back character on lexbuf; token lexbuf }My problem is that my "endtag" is not context independent and may in a particular context just be normal Text. The above lexer would allow me to parse the text relatively efficiently.
Stephan
I don't know how to do this in fslex / fsyacc, but in the past I've done something like the following:
1) on the 'entry' token, switch the lexer to a 'text' state.
2) in the text state, recognize strings that match any chars followed by F(end) as a token, and return the token (call it PARTIALTEXT or something like that, for instance), where F(end) is the first character of the 'exit' token.
3) in the text state, recognize strings that match F(end)...L(end) (for instance '<' ... '>') where ... is anything valid for a 'exit' token, not necessary the correct exit token. If the text matches the current 'entry' token, then exit the 'text' state when returning this token. If it doesn't match, return the text as a PARTIALTEXT token.
4) in the text state, return invalid end tokens as PARTIALTEXT tokens (i.e. '<' followed by something not valid for a end token).
5) in the parser, make a rule that combines a string of PARTIALTEXT into a single TEXT (or something like that).
Hope this helps,
Kelly Leahy
Milliman, Inc.
1) on the 'entry' token, switch the lexer to a 'text' state.
2) in the text state, recognize strings that match any chars followed by F(end) as a token, and return the token (call it PARTIALTEXT or something like that, for instance), where F(end) is the first character of the 'exit' token.
3) in the text state, recognize strings that match F(end)...L(end) (for instance '<' ... '>') where ... is anything valid for a 'exit' token, not necessary the correct exit token. If the text matches the current 'entry' token, then exit the 'text' state when returning this token. If it doesn't match, return the text as a PARTIALTEXT token.
4) in the text state, return invalid end tokens as PARTIALTEXT tokens (i.e. '<' followed by something not valid for a end token).
5) in the parser, make a rule that combines a string of PARTIALTEXT into a single TEXT (or something like that).
Hope this helps,
Kelly Leahy
Milliman, Inc.
Topic tags
- f# × 3658
- compiler × 263
- functional × 199
- c# × 119
- websharper × 113
- classes × 96
- web × 94
- book × 84
- .net × 82
- async × 72
- parallel × 43
- server × 43
- parsing × 41
- testing × 41
- asynchronous × 30
- monad × 28
- ocaml × 26
- tutorial × 26
- haskell × 25
- workflows × 22
- html × 21
- linq × 21
- introduction × 19
- silverlight × 19
- wpf × 19
- fpish × 18
- collections × 14
- pipeline × 14
- templates × 12
- monads × 11
- opinion × 10
- reactive × 10
- plugin × 9
- scheme × 9
- sitelets × 9
- solid × 9
- basics × 8
- concurrent × 8
- deployment × 8
- how-to × 8
- python × 8
- complexity × 7
- javascript × 6
- jquery × 6
- lisp × 6
- real-world × 6
- workshop × 6
- xaml × 6
- conference × 5
- dsl × 5
- java × 5
- metaprogramming × 5
- ml × 5
- scala × 5
- visual studio × 5
- formlets × 4
- fsi × 4
- lift × 4
- sql × 4
- teaching × 4
- alt.net × 3
- aml × 3
- enhancement × 3
- list × 3
- reflection × 3
- blog × 2
- compilation × 2
- computation expressions × 2
- corporate × 2
- courses × 2
- cufp × 2
- enterprise × 2
- entity framework × 2
- erlang × 2
- events × 2
- f# interactive × 2
- fsc × 2
- google maps × 2
- html5 × 2
- http × 2
- interactive × 2
- interface × 2
- iphone × 2
- iteratee × 2
- jobs × 2
- keynote × 2
- mvc × 2
- numeric × 2
- obfuscation × 2
- oop × 2
- packaging × 2
- pattern matching × 2
- pipelines × 2
- rx × 2
- script × 2
- seq × 2
- sockets × 2
- stm × 2
- tcp × 2
- trie × 2
- type × 2
- type provider × 2
- xna × 2
- zh × 2
- .net interop × 1
- 2012 × 1
- abstract class × 1
- accumulator × 1
- active pattern × 1
- addin × 1
- agents × 1
- agile × 1
- android × 1
- anonymous object × 1
- appcelerator × 1
- architecture × 1
- array × 1
- arrays × 1
- asp.net 4.5 × 1
- asp.net mvc × 1
- asp.net mvc 4 × 1
- asp.net web api × 1
- aspnet × 1
- ast × 1
- b-tree × 1
- bistro × 1
- bug × 1
- camtasia studio × 1
- canvas × 1
- class × 1
- client × 1
- clojure × 1
- closures × 1
- cloud × 1
- cms × 1
- coding diacritics × 1
- color highlighting × 1
- combinator × 1
- confirm × 1
- constructor × 1
- continuation-passing style × 1
- coords × 1
- coursera × 1
- csla × 1
- css × 1
- data × 1
- database × 1
- declarative × 1
- delete × 1
- dhtmlx × 1
- discriminated union × 1
- distance × 1
- docs × 1
- documentation × 1
- dol × 1
- domain × 1
- du × 1
- eclipse × 1
- edsl × 1
- em algorithm × 1
- emacs × 1
- emotion × 1
- error × 1
- etw × 1
- euclidean × 1
- event × 1
- example × 1
- ext js × 1
- extension methods × 1
- extra × 1
- facet pattern × 1
- fantomas × 1
- fear × 1
- float × 1
- fp × 1
- frank × 1
- fsdoc × 1
- fsharp.core × 1
- fsharp.powerpack × 1
- fsharpx × 1
- function × 1
- functional style × 1
- gc × 1
- generic × 1
- geometry × 1
- getlastwin32error × 1
- google × 1
- group × 1
- hash × 1
- history × 1
- hosting × 1
- httpcontext × 1
- https × 1
- hubfs × 1
- ie 8 × 1
- if-doc × 1
- inheritance × 1
- installer × 1
- interpreter × 1
- io × 1
- ios × 1
- ipad × 1
- kendo × 1
- learning × 1
- licensing × 1
- macro × 1
- macros × 1
- maps × 1
- markup × 1
- marshal × 1
- math × 1
- metro style × 1
- micro orm × 1
- minimum-requirements × 1
- multidimensional × 1
- multithreading × 1
- mysql × 1
- mysqlclient × 1
- nancy × 1
- nested × 1
- nested loops × 1
- node × 1
- object relation mapper × 1
- object-oriented × 1
- offline × 1
- option × 1
- orm × 1
- osx × 1
- owin × 1
- paper × 1
- parameter × 1
- performance × 1
- persistent data structure × 1
- phonegap × 1
- pola × 1
- powerpack × 1
- prefix tree × 1
- principle of least authority × 1
- programming × 1
- projekt_feladat × 1
- protected × 1
- provider × 1
- ptvs × 1
- quant × 1
- quotations × 1
- range × 1
- raphael × 1
- razor × 1
- rc × 1
- real-time × 1
- reference × 1
- restful × 1
- round table × 1
- runtime × 1
- scriptcs × 1
- scripting × 1
- service × 1
- session-state × 1
- sitelet × 1
- stickynotes × 1
- stress × 1
- strong name × 1
- structures × 1
- tdd × 1
- template × 1
- tracing × 1
- tsunamiide × 1
- type inference × 1
- type providers × 1
- upload × 1
- vb × 1
- vb.net × 1
- vector × 1
- visual f# × 1
- visual studio 11 × 1
- visual studio shell × 1
- visualstudio × 1
- web api × 1
- webapi × 1
- windows 8 × 1
- windows-phone × 1
- winrt × 1
- xml × 1
|
Copyright (c) 2011-2012 IntelliFactory. All rights reserved. Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us |
Built with WebSharper |
I'm tyring to parse a documentation format with fslex & fsyacc and I'm having problems with finding an efficient tokenization scheme for lex. Documentation in this format basically consists of some easily recognizable markers that define structure/format and text in between the markers. Defining the regexes for the markers is easy, what I can't figure out is how to retrieve the text between the markers as a single token. Passing the text between the markers char-wise to yacc strikes me as rather inefficient.
The text between the markers has no structure and might contain any char, so one can't just scan for [A-Za-z1-9 \t]* or similar patterns. If one could put a matched string back on the lexbuf one could probably solve the problem by introducing additional state, but there doesn't seem to be a documented way to put back strings into the lexbuf. Maybe there's some neat functional/recursive trick? Has anyone an idea?
Thanks in advance for any hint.
Stephan