4.9k
views20
comments

Suitability of Seq for this looping/recursive problem

Hi All,

I am wondering whether a seq is appropriate for the following problem. What I want to do is loop through a potentially huge seq and perform the following operations:

1
2

 
let x = { for i in "abcdefghijklmnopqrstuvwxyz" yield i }

Now, we have the following seq:

1
2

 
seq [ 'a'; 'b'; 'c'; 'd'; 'e'; 'f'; 'g'; 'h'; 'i'; ... ]

I want to run the following recursive function (pseudo-code):

let rec myfunc myseq len =
  match myseq with
  | x when Seq.nonempty x -> let y = hash Seq.take len x
                             // do some stuff with y
                             myfunc (Seq.tl x) len
  | _ -> () // seq is empty

Obviously, this is kind of simplified, but the crux is this: How do I get the tail (or pop off the first element) from a seq? Is the better option to use a list (since it's decomposition is also simpler in pattern matching)? Any thoughts greatly appreciated.

EDIT: I know that the Seq.take could fail after a certain point, but don't worry about that. The main problem is that I want to get the tail of a seq without forcing evaluation of the whole seq (as it could potentially be very large).

Thanks and regards,

If you would like to use recursion, a list is a much better construct. However, since you want to do some processing per individual element of the sequence, why not simply used Seq.iter?

Ralf Herbrich

By SniperEye on 11/20/2007 4:15 PM (permalink)

I wanted to do that initially (when I realized I couldn't use a sequence computation expression). However, Seq.iter works on 1 element at a time. The thing is that for each element in the sequence, I wanted to work with it + next 5 elements.

To illustrate which elements will be used in each step:

Full:  "abcdefghijklmnopqrstuvwxyz"
Step1: "abcde"
Step2:  "bcdef"
Step3:   "cdefg"
Step4:    "defgh"
...

I don't see how I could use Seq.iter, because Seq.take doesn't allow an offset into the sequence (as to where to start taking elements from).

EDIT: Having looked over the APIs for Seq, List and Array countless times over the last 2 hours, I'm beginning to think that Array is my best bet. List allows easy decomposition, but has no take function. Seq has take, but no easy way to get the tail.

EDIT2: On second thought, List will be best. Most of Seq's code works accepts a #seq as an argument. So, I'll go with that while keeping an eye out on this thread for more ideas.

Thanks,

By zakaluka on 11/20/2007 4:25 PM (permalink)

first off: you could use a LazyList, which would enable you to
do pretty much exactly what you were after to start with,
while still avoiding evaluating the whole list.

second, if you wanted something potentially more
efficient, i've found Seq.fold to be very useful for
this kind of thing. i like to think of
the value that threads through the function as like the accumulator
argument in a tail-recursive function.

so for this example:

Full:  "abcdefghijklmnopqrstuvwxyz"
Step1: "abcde"
Step2:  "bcdef"
Step3:   "cdefg"
Step4:    "defgh"
...

my immediate reaction would be to write something like the below
code. it calls f on a sliding window through a sequence,
calling it only when there are n elements in the window.
it threads a value through the computation in the same
way as fold, so you could easily use this to build a progressive
sliding window implementation. note that i haven't
even tried to compile this, so it's probably riddled with errors,
but i hope you get the gist of it.

let slidingwindow (n: int) (f: 'b -> 'a list -> 'b) (b: 'b) (s: seq<'a>) =
    Seq.fold (fun (b, (buf: Nbuf<'a>)) e s ->
        buf.add e
        if buf.len >= n then
            (f b buf.contents, buf)
        else
            (b, buf)) (b, new Nbuf(n)) s

where Nbuf is some implementation of a bounded-length queue.
for instance, here's a quick and inefficient one (a more
efficient one could use a ring buffer):

type Nbuf<'a> (max: int) =
    let p: 'a list ref
    member nb.contents = !p
    member nb.add x =
        let q = if (!p).Length < max then !p else List.tl !p
        p := q |> List.rev |> List.cons x |> List.rev

BTW, is there a nice efficient way of implementing a ring
buffer in a purely functional style (other than using monads, i guess)?

i'm only starting to feel my way into the F# idioms, so any comments
on the above style (e.g. is that a reasonable way to use an object
expression?) would be very welcome!

By rog on 11/23/2007 10:05 AM (permalink)

Your original code looks good to me. If your "main problem is that I want to get the tail of a seq without forcing evaluation of the whole seq" then you must use a Seq (or LazyList) because List and Array always have the whole collection evaluated at once.

So it seems the Seq library lacks a tail function so you'll have to write one yourself. You could do this quickly like so

let drop n s =
  let drop' () =
    let r = ref n
    let f x = if !r > 0 then (decr r; None) else Some x
    Seq.choose f s
  Seq.delay drop'

let tail s = drop 1 s

However this code has a huge performance problem but it is not easy to write an efficient drop function for seq.

I agree that it would be nice if there was some uniformity of what functions are available across the different collection libraries.

If you like pattern matching syntax then you can use active patterns to give you a cons/nil view of seq. I thought maybe these active patterns were included in fslib but it appears not, but they are there for LazyList though. But again writing an efficient cons/nil pattern for seq is difficult.

By gneverov on 11/20/2007 10:36 PM (permalink)

Ok... this code should do what you want <i>and<i> it should be efficient (meaning O(len * Seq.length myseq) operations). It uses the drop function from my previous post but does not use it in an inefficient manner as would happen if you used drop in a recursive decomposition of a seq. Hope this helps. <code lang=fsharp>let transpose (ss : seq<seq<_>>) = let opn () = ss |> Seq.map (fun s -> s.GetEnumerator()) |> Seq.cache_all let gen (es : seq<IEnumerator<_>>) = if es |> Seq.for_all (fun e -> e.MoveNext()) then es |> Seq.map (fun e -> e.Current) |> Some else None let cls (es : seq<IEnumerator<_>>) = es |> Seq.iter (fun e -> e.Dispose()) Seq.generate opn gen cls let myfunc len myseq = Seq.init_finite len (fun i -> drop i myseq) |> transpose |> Seq.map hash //|> do some stuff with y</code>

By gneverov on 11/20/2007 11:56 PM (permalink)

Here is another version. It is more efficient because myseq is only enumerated once whereas in my previous version it is enumerated len times. Which depending on what myseq is, may have been a problem. But on the other hand this code doesn't "look" as nice.

let chunks n s =
  let chunks' () =
    let r = ref (n+n-2)
    let b = Array.zero_create n
    let f x = 
      let i = !r
      if i >= n then
        // waiting to accumulate enough items for first chunk
        b.[-i+n+n-2] <- x
        decr r
        None
      else 
        // output a chunk
        b.[ i ] <- x
        let j = if i+1 >= n then 0 else i+1
        let c = Array.zero_create n
        Array.blit b j c 0 (n-j)
        if j <> 0 then Array.blit b 0 c (n-j) (i+1)
        r := j
        Some (c :> seq<_>)
    Seq.choose f s
  Seq.delay chunks'


let myfunc len myseq = 
  chunks len myseq
  |> Seq.map hash
  //|> do some stuff with y

By gneverov on 11/21/2007 3:43 AM (permalink)

Here is another version. It is more efficient because myseq is only enumerated once whereas in my previous version it is enumerated len times. Which depending on what myseq is, may have been a problem. But on the other hand this code doesn't "look" as nice.
1
2
// code from comment 4121
...

gneverov,

I've gone through this code and just wanted to thank you for an excellent example of rotating array blits, as well as the most illustrative example of Seq.delay I've come across so far. I never really played around with Seq.delay before, and this really opened my eyes to the possibilities.

In that vein, I want to contribute a version of your chunks function that I've tried to simplify. Instead of relying on two arrays, it uses the F# Array's in-place update functionality. Also, It reduces the number of counters that need to be kept track of and trades rotating array blits (as I'm calling them) for a shift & replace. Preliminary testing indicates that both functions work identically (using

Seq.compare compare x y

let chunks2 n s =
  let chunks2' () =
    let first = ref 0
    let b = Array.zero_create n
    let f x =
      // If we are working on the first value
      if !first < n then
        b.[!first] <- x
        incr first
        // an exception needs to be made for the first value in order to keep the 'else' part simpler.
        if !first = n then Some (b :> seq<_>) else None
      // The rest of the values
      else
        Array.blit b 1 b 0 (n - 1)
        b.[n - 1] <- x
        Some (b :> seq<_>)
    Seq.choose f s
  Seq.delay chunks2'

Again, thanks for the inspiration and detailed code.

Regards,

By zakaluka on 11/23/2007 2:47 AM (permalink)

Good work, zakaluka. However you cannot get rid of the "second" array. In F#/CLR arrays are reference objects, which means that chunks2 is using the exact same array for its internal state and every element in its output. This is clearly not what you want. You can see the results of this by executing this code:

{1..10} |> chunks2 3 |> Seq.take 3

So it is essential that a new array is created for every element in the output.

I guess the moral of this story is that mixing mutable state and lazy evaluation is a real mess. So beware!

By gneverov on 11/23/2007 8:37 PM (permalink)

gneverov,

Thanks for the heads-up and sage warning at the end. Before (and after) I posted the chunks2 code, I tried to find a way to see if the arrays were being copied or not because this exact scenario kept bothering me. I guess I didn't test hard enough. Here is an updated version:

 
let chunks2 n s =
  let chunks2' () =
    let first = ref 0
    let b = Array.zero_create n
    let f x =
      if !first < n then
        b.[!first] <- x
        incr first
        // could use 'Array.of_seq b'  -- need to profile
        if !first = n then Some (Array.sub b 0 n:> seq<_>) else None
      else
        Array.blit b 1 b 0 (n - 1)
        b.[n - 1] <- x
        // could use 'Array.of_seq b'  -- need to profile
        Some (Array.sub b 0 n :> seq<_>)
    Seq.choose f s
  Seq.delay chunks2'

As per Array.sub's definition, it creates a new array from an existing one.

Array.of_seq

would work as well, though I don't know which is faster.

I think the reasons I was moving away from mutability are slowly coming back to me ;).

Thanks and regards,

By zakaluka on 11/23/2007 9:48 PM (permalink)

Hi zakaluka,

As this discussion has shown, some useful transformations on Seq objects can be written using interior mutable state within the transformer. One common pattern is to syntactically isolate this state inside a sequence expression. For example, let's say we want to implement

 

    val pairs : #seq<'b> -> seq<'b * 'b>

where the function stutters pairs, e.g.

 

pairs [1;2;3;4] = seq [(1, 2); (2, 3); (3, 4)]

Here's one way to do that using a single sequence expression.

let pairs (s: #seq<_>) = 
    seq { let delayed = ref None
          let ie  = s.GetEnumerator() 
          while ie.MoveNext() do
             match !delayed with 
             | None -> 
                 do delayed := Some ie.Current
             | Some v -> 
                 yield (v,ie.Current)
                 do delayed := Some ie.Current }

This is executed lazily: the interior "while" loop is part of a sequence expression construction and hence is a generator loop (it ultimately becomes a call to Seq.generated).

Note the interior of the sequence expression uses mutable state, but in an isolated way: an enumerator over the input sequence (ie) and an isolated lookahead buffer (delayed). The interior is also in sequence expression syntax, hence the need to use a couple of additional "do" and "yield" keywords (sequence expression syntax is basically like normal expression syntax except imperative actions are always prefixed by do, so every line begins with a keyword: do, do!, yield, yield!, match, while etc.)

You can use this technique to write "chunks" in a similar way:

let chunks n (s: #seq<_>) = 
    seq { let chunk = Array.zero_create n
          let i = ref 0 
          let count = ref 0 
          let ie  = s.GetEnumerator() 
          while ie.MoveNext() do
             do chunk.[!i] <- ie.Current
             do i := (!i + 1) % n
             do count := !count + 1
             if !count >= n then 
                 yield Array.permute (Permutation.Rotate(size=n,distance= - !i))  chunk }

 

chunks 3 [1;2;3;4;5;6;7;8];;

<i>val it : seq<int array> = seq [[|1; 2; 3|]; [|2; 3; 4|]; [|3; 4; 5|]; [|4; 5; 6|]; ...]<i>

Here the lookahead buffer is now an array plus "count" to tell us when it fills up for the first time. Otherwise the code is much as before, except we use Array.permute to copy and transform the returned array to be the right shape.

There are many interesting seqeunce expression transformations you can write using variations on the above technique. It's one of the more reasonable ways to combine laziness and side effects, because the mutable objects are clearly isolated (technique: we analyze the yield expressions, noting we copy out the arrays using Array.permute. We could also just return lists copied from the array.)

The above could also have been written using direct calls to Seq.generate (which can be thought of as a sort of "while" loop for sequences).

Kind regards,

Don

By dsyme on 11/24/2007 10:41 PM (permalink)

Don,

Thanks for the illustrative use of isolated side effects within sequence expressions. Actually, this style of programming reminds me somewhat of monads, mainly in its ability to sequester mutability from the rest of the code.

Since I extensively use sequence expressions in the rest of my code, I'll definitely store this technique away for future use.

Thanks and regards,

By zakaluka on 11/25/2007 2:04 PM (permalink)

Hi,

I found this an interesting problem and learned a lot in this thread. [:)]

However, my initial feeling would be to use a pure functional solution. A good candidate is the scan function which is like a fold but generates a new sequence with all intermediate results.

First off, we need a function to shift a new element into a list (drop the head and append the element):

let shift xs x =
   match xs with
   | []   -> [x]
   | h::t -> t @ [x]

Next the function that implements the sliding window. I use a tuple to keep a count of the item (to skip the first entries for filling up the list) and the current list.

let window (n,l) x =
    match n with
    | 0 -> (  0, shift l x)   // Shift the next elements
    | _ -> (n-1, l @ [x])     // First n entries, fill up the list

let chunks n s =
    Seq.scan window (n,[]) s
    |> Seq.filter (fun (n,_) -> n = 0)
    |> Seq.map (fun (_,x) -> x)

The chunks function uses scan to generate the following sequence:

{ (5, []);
  (4, ['a']);
  (3, ['a'; 'b']);
  (2, ['a'; 'b'; 'c']);
  (1, ['a'; 'b'; 'c'; 'd']);
  (0, ['a'; 'b'; 'c'; 'd'; 'e']);
  (0, ['b'; 'c'; 'd'; 'e'; 'f']);
  (0, ['c'; 'd'; 'e'; 'f'; 'g']); ... }

Then we filter out the first entries where n is not equal to 0 and use map to extract the lists and drop the count. The output is

{ ['a'; 'b'; 'c'; 'd'; 'e']; ['b'; 'c'; 'd'; 'e'; 'f']; ... }

I didn't expect this to be very efficient but some timing measurements (a 5 element sliding window on a 10000000 sequence taking the length of the output to completely generate the sequence) shows that it is nearly as fast as the version with array mutation of zakaluka. The solution of Don Syme with the single seq expression only runs at half that speed. [:(]

This functional solution can still be improved. We append an element to the end of a list which means traversing the list every time. Another datastructure could rectify this. As rog already suggested, a functional ring buffer would be a possible solution. I suspect that a datastructure based on a zipper can be used to implement this but didn't try it (yet).

Personally I like this kind of implementation better, no explicit mutable state or copying, so less possibilities for error. [:)]

Just another view to add to the discussion,

-- Freddy

By fpotargent on 11/26/2007 2:31 AM (permalink)

Is no one using Firefox to read this forum? The present control crashes Firefox, at least for me.

I'm not sure what the performance of my implementation is (I'd have to come up with a way to determine only that the length of the sequence is >= than the desired window), but it's certainly short, with no side effects, etc.

>  let rec chunk n l =
- match l with
- | h::t when (Seq.length l) >= n -> [(Seq.take n l)] @ (chunk n t)
- | _ -> [];;

val chunk : int -> 'a list -> 'a list list

> chunk 5 [1..10];;
val it : int list list
= [[1; 2; 3; 4; 5]; [2; 3; 4; 5; 6]; [3; 4; 5; 6; 7]; [4; 5; 6; 7; 8];
   [5; 6; 7; 8; 9]; [6; 7; 8; 9; 10]]
>

In fact, I could easily implement Window as a parametrized active pattern (parametrized for the size of the window) applied to sequences. Then I never need to get the sequence length at all. Well, I shouldn't say "easily"; I'm just learning F#.

By SlideGuitarist on 12/8/2007 12:25 PM (permalink)

On Firefox, I don't personally use it, but I know that you can't compose messages using Opera. Now, onto the code.

The biggest problem is that your signature comes to

val chunk : int -> 'a list -> 'a list list

, which doesn't work in this case. The input is coming in as a seq, which is not conducive to the sort of pattern matching/decomposition used here. If the input was a list, most of the code posted in this thread could be greatly simplified.

The reason this distinction is important is that the generation of each additional input element can take a long time, so the generation is done on demand. This ensures that there isn't a huge performance hit on the first invocation of the function. So, I should be able to do

Seq.init_finite 100000 (fun x -> x) |> chunk 10 |> Seq.take 5

and not have it evaluate the entire input.

Thanks and regards,

By zakaluka on 12/9/2007 9:31 PM (permalink)

28 NOV 2007 13:41 - Updated znMisc with a few more tests (checking which Array functions are faster)
29 NOV 2007 13:09 - Added summarized observations for people who don't want to download the zip file.

First, hats off to you, sir, for providing the most elegant solution so far. Not only does it remove mutability, but it is very pleasing to look at (a major reason, I would say, to use F# and other functional languages).

Since so many people have provided solutions, I thought I would profile them using different kinds of data. For each run, I generate char/integer/seq<int> elements using Seq.init_finite. The search spaces are 1,000, 100,000 and 1,000,000 elements in size. The window sizes are 5, 5,000 and 50,000.

I've attached the entire code base for repeatability. The excel file (profiling data) is a work in progress, and contains raw numbers plus percentage differences. If you are plotting the values, try and remove extraneous cases (easy to identify) so that the graph is easier to read.

If data is missing in the graph, it is because the time required to run that test case took "too long" on my laptop. The times are provided in milliseconds, so you can judge how much longer some of the test cases would have taken.

NOTE1: The times recorded are those required to run the Main function. This way, the setup times for all the functions are the same. It also means that time requried to set up a domain and a default domain manager are not part of the benchmark.

NOTE2: Do not compare times between character, integer and seq<int> tests. For different tests, the CPU is throttled to various levels depending on time of day, heat factor, etc. Use percentages instead.

NOTE3: Do not test multiple functions in the same compiled binary. The reason is that caching really skews the results and provides very unreliable data (sometimes orders of magnitude off).

Computer configuration:

Dell e1705 notebook
1.66 Ghz Core 2 Duo Intel processor
2GB of RAM
Windows XP SP2
VS 2005 (used for editing/compiling)
Profiler: Yourkit Profiler 3.0 EAP for .NET -- I tried NProf (couldn't get raw timing data, data not properly labeled under each function), IJW Profiler (takes up more disk space than I have free (> 4.5GB)) and CLR Profiler (no timing data).

All the applications (one for each 'sensible' combination of function/window size/search space) were compiled with the default 'Release' settings in VS 2005.

EDIT: I realized I had a question I forgot to ask. Looking at the excel file, I see that Don's solution really bombed when the window was changed to 5000 within a 100,000 element string. This result was about 56000% slower than the fastest solution. If someone wants to hazard a guess as to why, I'd very much appreciate it (note: I haven't looked at the IL or anything like that yet).

SUMMARIZED OBSERVATIONS:

gneverov's function from post 4121 is the fastest function posted here, and is treated as the baseline for these observations.
My fastest variation of gneverov's function was slower by an average of 20% (1.21:1).
dsyme's function, ignoring the extreme satellite observation, was, on average 385% slower (4.85:1).
fpotargent's function from post 4163 was, on average, 215% slower (3.13:1).
gneverov's modification to fpotargent's function improved that to 200% slower (3.02:1).

For most of these observations (save gneverov's first function), I did not increase the window size due to time constraints. Doing so could shed further light on how well some of these functions scale. All the data used for these observations are available in the Excel file within the attached zip.

Thanks and regards to all,

By zakaluka on 11/28/2007 10:10 AM (permalink)

Hmm... I think fpotargent has taken the lead for best solution now. [:)]

I had thought of using scan early on but saw no way of not producing the unwanted initial part of the list. It didn't occur to me that this can be cleaned up after the fact.

One small tweak is that you could combine the filter and map into

Seq.choose (function (0,x) -> Some x | _ -> None)

Other misconception I had was that using lists for the internal state would be slow because the snoc operation (appending an element to the end of a list) is O(n) in the length of the list whereas adding to a mutable circular queue is O(1). However you need to copy the array for output anyway, which is O(n), so it's all the same. So lists are just great for performance here.

By gneverov on 11/26/2007 5:52 AM (permalink)

Thanks! [:)]

I didn't think of using Seq.choose although I noticed that you used it in your code! That's a nice tip, thanks.

And I agree that using a double ended queue or something alike wouldn't bring much since you need to copy out the array anyway. I guess we can't get better then O(n*m). [:(]

But it still amazes me how pure functional code can be concise, beautiful and still pretty efficient.

By fpotargent on 11/26/2007 6:55 AM (permalink)

But it still amazes me how pure functional code can be concise, beautiful and still pretty efficient.

Ahh... 'tis the way code was meant to be written. [:D]

By gneverov on 11/26/2007 7:39 AM (permalink)

Thanks to all who have replied so far.

Yes, my main issue is that I do not want to evaluate the whole seq, primarily for performance reasons. Between my last post and this one, I implemented a version that uses Lists, but of course this is less than ideal.

I did try to write a tail function for Seq, but stopped soon because I was trying to stay as functional as possible (read: I'm lazy) and wanted to get the rest of the logic sorted out first. I have taken your advice and am looking at the LazyList data type.

Let me more adequately describe the scenario so that I can explain my motivation for this thread. I am implementing a number of string search algorithms. These algorithms, instead of working on the raw string, are passing it through a filter. This filter is what I am trying to implement as a Seq. So, this filter reads in a character at a time, checks it against various options (for example, in English, check if punctuation on/off, whitespace on/off, etc) and decides whether to pass that character through or not. Currently, the filter is implemented as a comprehension. Using a seq will hopefully avoid long initial delays (since the whole string won't be filtered at once).

Now, as I described in my previous post, the filtered string is taken and each of the (element + n more characters) are subjected to some processing.

My goal, as of now, is to avoid having two copies of the entire string in active memory at the same time. As I mentioned, the string could be very long, and it is extremely inefficient to have copies floating around.

Having said this, I am currently looking at the chunks function posted above. Although I haven't really looked at Seq.delay in the past, I understand what you are trying to do here. I need to look at it a little more, but preliminary tests show that it works perfectly (and handles fringe cases as well).

Thanks for all the help,

By zakaluka on 11/21/2007 6:59 PM (permalink)

I do not know how the compiler handles this but if you wanted to use list, you could use something like :

let rec run f l =
  match l with 
  | a :: b :: c :: d :: e :: tl -> 
      f (a,b,c,d,e)
      run f tl
  | _ -> ()

let f (a,b,c,d,e) = 
  print_int (a+b+c+d+e)
  print_strign "\n"

let _ =
  run f  [for i in 0 .. 8 -> i]

Of course, as was mentioned in the posts above, the catch is that you do not get the lazy evaluation of your collection...

By Julien on 11/21/2007 3:13 AM (permalink)

Topic tags

Built with WebSharper

Home

Answers

Events

Courses

Groups and Conferences

Blogs

Jobs

Developers

Topic tags