1.3k
views1
comment

Hi,

I am trying to write a simple text parser in F#, which will be used later in C#.
I implemented some rules in fslex and now trying to implement a class with one method, which takes a string and return parsed string with use of this rule.
Unfortunately parsing does not work with Polish letters (and I'm sure won't work with any except ASCII), which is very important to me.

Could you help me please and tell me how to do this??
I am fighting with it for a long long time.
I tried a lot of ways to implement it ( various Lexing.from_... , various ways of writing a string with various
encodings) but unfortunately nothing worked. Maybe I was doing something wrong...?

My simple (and now also simplified) way of implementing a usage of such a parser:

lets say we have rule main which takes additional record with a TextWriter in it, to which it writes parsed text.

method should look something like this(simplified):

member x.ConvertString(input : string) =
	let writer = new System.IO.StringWriter() in
        let lexbuf = Lexing.from_string input in
			begin
				main {wr= (writer :> System.IO.TextWriter) } lexbuf;
                writer.Flush();
				writer.GetStringBuilder().ToString();
	        end

rule looks something like this:

rule main i = parse 
	| newline    { record_newline lexbuf; i.bw.Write("
\n"); main i lexbuf} 
	| ...

I will be very thankful for any help,
Kind regards,
Zarzyk

Hi zarzyk

FsLex currently only accepts 8-bit inputs. This is not ideal, and Unicode lexing has been long on our TODO list. If you can work with a Unicode encoding where polish characters are in the sub-256 character range then you may be able to make things work very smoothly (just convert the string to bytes in that encoding using one of the System.Text encoding objects and write your lexer using the byte-encodings for the characters you want)

For F# we implement non-ASCII lexing by accepting an approximtion of UTF-8 encodings - take a look for the UTF-8 encoded lex rules in lex.mll in the F# source. You may be able to do this as well. However it's a fair bit of work and you should make sure to get from bytes to Unicode strings as soon as possible.

Unicode lexing isn't actually too hard to implement for us. I'll try to take a look at it for the next release.

Kind regards

Don

By dsyme on 6/26/2007 7:12 PM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags