My get-started example is the implementation of Norvig's toy spelling corrector.

What a wonderful example program! This is so enticing that I can't help but have a go myself.

My first impressions are that Peter's Python implementations are not a good starting point for several reasons. The most important of which is that it is very idiomatic Python, to the extent that the program appears to be heavily optimized for an interpreted language like Python. I think F# can do vastly better but you'll have to start from scratch with a completely different approach.

My second concern is that I believe the algorithm could be better. Rather than searching for single or double errors, it should search for all errors but bias the search towards the highest probability errors. For example, the error from "ace" to "oace" is probably much less likely than the error "tomorrow" to "tommorow" despite the fact that the latter involves two separate errors.

I'll let you know how I get on.

Cheers,
Jon.

By on 5/13/2007 5:20 AM ()

Hi Sebastian,

It's a beautiful program!

I ran the program and got 2.8s for "something". My laptop is fairly quick though. But I'm certain we could do a lot better. Could you send us the full C# and F# code to fsbugs AT microsoft.com?

Sequence expressions are odd with regard to performance. The aim of the first release of the feature (in December 06) was absolutely to "get the feature in there in time for the books", with a focus on getting some (but not all) of the performance cases right. Since then we've mainly been fixing correctness problems (e.g. with regard to Dispose of database connections). But we'll certainly look into this particular issue.

"yield" is quite magical in C#: it's imperative programming on steroids. It's not the right model for F#, but it is an amazing construct. While the two look similar, the compilation strategies are indeed wildly different, and the C# team has a 6 year lead over us in terms of implementation effort on this one. So this is one area where I expect the "natural equivalent" C# program to currently be faster.

You can use Seq.map and Seq.filter over enormous sequences and things work out pretty well. However the sequence expression model is harder to compile than C# iterators, so I think that in this case the C# iterator code will always be at least as fast as the F# code. Also, I suspect most F# programmers intuitively switch to working over concrete data structures (arrays and lists) when performance really matters. I'm not sure if that's possible here.

Regards,

Don

By on 5/7/2007 4:24 PM ()

Sorry - have no reproduced your perf problem - I was using

1
let (|..>) x y = Seq.map_concat y x

instead of

1
let (|..>) x y = Seq.map y x

We'll look into where the perf problems are.

regards,

don

By on 5/7/2007 4:43 PM ()

We'll look into this further over the next few days. For starters, Seq.map_concat is considerably faster than your fold/append/concat combination. A very long chain of Seq.append's causes problems. This is exactly the sort of situation where C# iterators do better, since they compile to a state machine.

Also removing the "Set.of_seq" in "edits" makes things faster. But that's more of an algorithmic issue.

Am I right in thinking this is logically equivalent code?

1
2
3
4
5
6
7
let edits (word: string) =  
    { for i in {0 .. word.Length-1} -> word.Substring(0,i) + word.Substring(i+1) } +..                                               (* delete *)  
    { for i in {0 .. word.Length-2} -> word.Substring(0,i) + word.Substring(i+1,1) + word.Substring(i,1) + word.Substring(i+2) } +.. (* transpose *)  
    { for i in {0 .. word.Length-1} for c in {'a' .. 'z'} -> word.Substring(0,i) + (String.make 1 c) + word.Substring(i+1) } +..     (* alter *)  
    { for i in {0 .. word.Length} for c in {'a' .. 'z'} -> word.Substring(0,i) + (String.make 1 c) + word.Substring(i) }             (* insert *)  
    |> Set.of_seq  
let edits2 word = Set.of_seq (Seq.map_concat edits (edits word))

Regards,

don

By on 5/7/2007 5:01 PM ()

"A very long chain of Seq.append's causes problems"

This is definitely true, as you must have found when you removed the thousands of appends implied by edit2. When I replace the guts of edit2 with a Seq.map_concat, the generation of edits2("something") takes a little under 5 seconds... something like 10 or 15 times faster. Certainly the state machine implementation of Seq.append is trivial in C# with the magical yield operator.

1
2
3
4
5
6
7
8
9
public static IEnumerable<T> Sequence<T>(params IEnumerable<T>[] es)
{
    int i = 0;
    foreach (IEnumerable<T> e in es) 
    { 
        foreach (T t in e) yield return t;
        es[i++] = null; // oh why not be nice to the garbage collector
    }
}

I see that the underlying implementation of most of the seq functions is inherently stateful, and perhaps I am glossing over too many details, but the F# doesn't look that different. The performance drop is surprising. Perhaps because it is creating so many more enumerator objects instead of surfing through the one it started with? Anyway, fascinating.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
let append (ie1: #IEnumerable<'a>) (ie2: #IEnumerable<'a>) = 
  generate
    // the state holds two enumerators: one for the outer list and one for the current inner list
    (fun () -> (ref (ie1.GetEnumerator()), ref false))
    (fun (curr,right) -> 
        let rec take () = 
          // check the inner list
          match get !curr with 
          | None -> 
            // check the outer list
            if !right then None
            else 
                // don't forget to dispose of the enumerator for the inner list now we're done with it
                (!curr).Dispose(); 
                curr := ie2.GetEnumerator(); 
                right := true;
                take ()
          | res-> res
        take ())
     (fun (curr,_) -> (!curr).Dispose())
By on 5/7/2007 5:50 PM ()

Hi Sebastian,

I took a bit more of a look at this. I came up with a way to optimize Seq.append under the hood - it reduces runtimes dramatically. The overhead then becomes the use of Sets to implement "distinct" - I think a HashSet would be better here. When the use of sets is removed I get a runtime of 0.8s, but I'm certain further optimizations are possible.

The optimization is as follows. The problem with a naive implementation of a binary "append" operator on IEnumerables is that a long chain of appends creates a long chain of cascading IEnumerators - if you have a chain of length N and yur getting the last element then your still asking the IEnumerator for the top most "append", which asks teh IEnumerator for the next-top-most "append" etc. all the way down the chain. Thus iterating becomes quadratic.

The trick is to represent the results of Seq.append symbolically, rather than by creating an object that represents a fixed strategy for how the resulting IEnumerable will be implemented. If a zillion Seq.appends are then concatenated then you can optimize the symbolic representation (by concatenating all the sub-elements to append) and when you actually come to iterate you just use an integer state index to record which node you're yielding from. Effectively we're working out the best way to record the state "behind the scenes". I'd say this can be generalized to further behind-the-scenes deforesting, for example "map f (map g E)" can become "map (f o g) E". All very interesting. Here are some code snippets (note the use of active patterns to factor out some bits :-) ):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
 

    type ScriptedEnumerator<'a>(script: Script<'a>) = 
        class
            interface IEnumerable<'a> with
                override x.GetEnumerator() = script.GetEnumeratorForScript()
            interface IEnumerable with
                override x.GetEnumerator() = (script.GetEnumeratorForScript() :> IEnumerator)
            member x.Script = script
        end
                
    
    let (|HasScript|) (x : #IEnumerable<'a>) = 
        match (x :> IEnumerable<'a>) with 
        | :? ScriptedEnumerator<'a> as s -> s.Script 
        | ie -> Run(ie)

    let append (HasScript(s1)) (HasScript(s2)) = 
        (new ScriptedEnumerator<'a>(Script.MkAppend(s1,s2)) :> seq<'a>)
        

And the bit that interprets scripts (note: uses of "generate" are always a bit ugly - the downside of not having "yield", but then this goes in the library, right!)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
 

    type Script<'a> = 
        | Append of Script<'a> [] 
        | Run of IEnumerable<'a>
        with 
            static member MkAppend(s1,s2) = 
                let (|Flat|) = function Append(arr) -> arr | x -> [| x |]
                match s1,s2 with 
                | Flat(arr1),Flat(arr2) -> Append(Array.append arr1 arr2)

            // based on Seq.concat

            member script.GetEnumeratorForScript() =
                match script with 
                | Append(arr) ->
                    IEnumerator.generate
                      // the state holds two enumerators: one for the outer list and one for the current inner list
                      (fun () -> (ref 0, ref (IEnumerator.empty())))
                      (fun (n, curr) -> 
                          let rec take () = 
                            // check the inner list
                            match get !curr with 
                            | None -> 
                              // check the outer list
                              if !n >= arr.Length then None 
                              else 
                                  let subScript = arr.[!n]
                                  incr n;
                                  // don't forget to dispose of the enumerator for the inner list now we're done with it
                                  (!curr).Dispose(); 
                                  curr := subScript.GetEnumeratorForScript(); 
                                  take ()
                            | res-> res
                          take ())
                       (fun (iese,curr) -> (!curr).Dispose())
                 | Run(ie) -> ie.GetEnumerator()
        end

By on 5/11/2007 5:44 PM ()

Your changes to the edit function look right to me. It's always a fun tradeoff with these things knowing where to "stop the seq train". having edit produce a set rather than a stream of unique words makes as much sense as anything. I realized later I had not posted the |..> operator I toyed with. In my case it was

1
let (|..>) a b = a |> Seq.map b

I will email you the entire code separately, but I've pasted it in below for those playing along at home. I agree with the statement that yield is imperative programming on steroids. It's a wonderful construct and very intuitive for most imperative programmers, once they get over the shock of having the compiler write non-trivial code for them.

1
2
3
4
5
6
7
8
9
10
11
12
#light
let (|..>) a b = a |> Seq.map b
let (+..) = Seq.append
let edits (word: string) =
  { for i in {0 .. word.Length-1} -> word.Substring(0,i) + word.Substring(i+1) } +..                                               (* delete *)
  { for i in {0 .. word.Length-2} -> word.Substring(0,i) + word.Substring(i+1,1) + word.Substring(i,1) + word.Substring(i+2) } +.. (* transpose *)
  { for i in {0 .. word.Length-1} for c in {'a' .. 'z'} -> word.Substring(0,i) + (String.make 1 c) + word.Substring(i+1) } +..     (* alter *)
  { for i in {0 .. word.Length} for c in {'a' .. 'z'} -> word.Substring(0,i) + (String.make 1 c) + word.Substring(i) }             (* insert *)
  |> Set.of_seq
  |> Set.to_seq
let edits2 word = Set.of_seq (Seq.fold Seq.append Seq.empty (edits word |..> edits))
printf "%d" (Seq.length (edits2 "something"))

and the C# below, including the "F" class referenced within the code. With LINQ, much of the "F" class withers away in favor of the native implementations of the same concepts, but I haven't quite gotten there yet.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
class Program
{
	// what a shame you cannot yield from an anonymous function
	// or this would be inside Edits
	private static IEnumerable<string> EditsRaw(string word)
	{
		for(int i=0; i < word.Length; ++i)
			yield return word.Substring(0, i) + word.Substring(i + 1);
		for(int i=0; i < word.Length-1; ++i)
			yield return word.Substring(0, i) + word.Substring(i + 1, 1) + word.Substring(i, 1) + word.Substring(i + 2);
		for(int i=0; i < word.Length; ++i)
			for(char c='a'; c <= 'z'; ++c)
				yield return word.Substring(0,i) + c.ToString() + word.Substring(i + 1);
		for(int i=0; i <= word.Length; ++i)
			for(char c='a'; c <= 'z'; ++c)
				yield return word.Substring(0, i) + c.ToString() + word.Substring(i);
	}

	private static IEnumerable<string> Edits(string word)
	{
		return F.Distinct(EditsRaw(word));
	}

	private static IEnumerable<string> Edits2Raw(string word)
	{
		foreach(string edit in Edits(word))
			foreach(string edit2 in Edits(edit))
				yield return edit2;
	}

	private static IEnumerable<string> Edits2(string word)
	{
		return F.Distinct(Edits2Raw(word));
	}


	static void Main(string[] args)
	{
		Console.Out.WriteLine(F.Count(Edits2("something")));
	}
}

public static class F
{
	public static IEnumerable<T> Distinct<T>(IEnumerable<T> from)
	{
		if (from == null) yield break;
		Dictionary<T, byte> values = new Dictionary<T, byte>();
		foreach (T t in from)
		{
			if (!values.ContainsKey(t))
			{
				values.Add(t, 0);
				yield return t;
			}
		}
	}
	public static int Count<T>(IEnumerable<T> from)
	{
		int i = 0;
		foreach(T _ in from) ++i;
		return i;
	}
}
By on 5/7/2007 5:34 PM ()
IntelliFactory Offices Copyright (c) 2011-2012 IntelliFactory. All rights reserved.
Home | Products | Consulting | Trainings | Blogs | Jobs | Contact Us | Terms of Use | Privacy Policy | Cookie Policy
Built with WebSharper