I am doing writing code for PhD research and starting to use Scala. I often have to do text processing. I am used to Python, whose \'yield\' statement is extremely useful for i
The implementation below provides a Python-like generator.
Notice that there's a function called _yield
in the code below, because yield
is already a keyword in Scala, which by the way, does not have anything to do with yield
you know from Python.
import scala.annotation.tailrec
import scala.collection.immutable.Stream
import scala.util.continuations._
object Generators {
sealed trait Trampoline[+T]
case object Done extends Trampoline[Nothing]
case class Continue[T](result: T, next: Unit => Trampoline[T]) extends Trampoline[T]
class Generator[T](var cont: Unit => Trampoline[T]) extends Iterator[T] {
def next: T = {
cont() match {
case Continue(r, nextCont) => cont = nextCont; r
case _ => sys.error("Generator exhausted")
}
}
def hasNext = cont() != Done
}
type Gen[T] = cps[Trampoline[T]]
def generator[T](body: => Unit @Gen[T]): Generator[T] = {
new Generator((Unit) => reset { body; Done })
}
def _yield[T](t: T): Unit @Gen[T] =
shift { (cont: Unit => Trampoline[T]) => Continue(t, cont) }
}
object TestCase {
import Generators._
def sectors = generator {
def tailrec(seq: Seq[String]): Unit @Gen[String] = {
if (!seq.isEmpty) {
_yield(seq.head)
tailrec(seq.tail)
}
}
val list: Seq[String] = List("Financials", "Materials", "Technology", "Utilities")
tailrec(list)
}
def main(args: Array[String]): Unit = {
for (s <- sectors) { println(s) }
}
}
It works pretty well, including for the typical usage of for loops.
Caveat: we need to remember that Python and Scala differ in the way continuations are implemented. Below we see how generators are typically used in Python and compare to the way we have to use them in Scala. Then, we will see why it needs to be like so in Scala.
If you are used to writing code in Python, you've probably used generators like this:
// This is Scala code that does not compile :(
// This code naively tries to mimic the way generators are used in Python
def myGenerator = generator {
val list: Seq[String] = List("Financials", "Materials", "Technology", "Utilities")
list foreach {s => _yield(s)}
}
This code above does not compile. Skipping all convoluted theoretical aspects, the explanation is: it fails to compile because "the type of the for loop" does not match the type involved as part of the continuation. I'm afraid this explanation is a complete failure. Let me try again:
If you had coded something like shown below, it would compile fine:
def myGenerator = generator {
_yield("Financials")
_yield("Materials")
_yield("Technology")
_yield("Utilities")
}
This code compiles because the generator can be decomposed in a sequence of yield
s and, in this case, a yield
matches the type involved in the continuation. To be more precise, the code can be decomposed onto chained blocks, where each block ends with a yield
. Just for the sake of clarification, we can think that the sequence of yield
s could be expressed like this:
{ some code here; _yield("Financials")
{ some other code here; _yield("Materials")
{ eventually even some more code here; _yield("Technology")
{ ok, fine, youve got the idea, right?; _yield("Utilities") }}}}
Again, without going deep into convoluted theory, the point is that, after a yield
you need to provide another block that ends with a yield
, or close the chain otherwise. This is what we are doing in the pseudo-code above: after the yield
we are opening another block which in turn ends with a yield
followed by another yield
which in turn ends with another yield
, and so on. Obviously this thing must end at some point. Then the only thing we are allowed to do is closing the entire chain.
OK. But... how we can yield
multiple pieces of information? The answer is a little obscure but makes a lot of sense after you know the answer: we need to employ tail recursion, and the the last statement of a block must be a yield
.
def myGenerator = generator {
def tailrec(seq: Seq[String]): Unit @Gen[String] = {
if (!seq.isEmpty) {
_yield(seq.head)
tailrec(seq.tail)
}
}
val list = List("Financials", "Materials", "Technology", "Utilities")
tailrec(list)
}
Let's analyze what's going on here:
Our generator function myGenerator
contains some logic that obtains that generates information. In this example, we simply use a sequence of strings.
Our generator function myGenerator
calls a recursive function which is responsible for yield
-ing multiple pieces of information, obtained from our sequence of strings.
The recursive function must be declared before use, otherwise the compiler crashes.
The recursive function tailrec
provides the tail recursion we need.
The rule of thumb here is simple: substitute a for loop with a recursive function, as demonstrated above.
Notice that tailrec
is just a convenient name we found, for the sake of clarification. In particular, tailrec
does not need to be the last statement of our generator function; not necessarily. The only restriction is that you have to provide a sequence of blocks which match the type of an yield
, like shown below:
def myGenerator = generator {
def tailrec(seq: Seq[String]): Unit @Gen[String] = {
if (!seq.isEmpty) {
_yield(seq.head)
tailrec(seq.tail)
}
}
_yield("Before the first call")
_yield("OK... not yet...")
_yield("Ready... steady... go")
val list = List("Financials", "Materials", "Technology", "Utilities")
tailrec(list)
_yield("done")
_yield("long life and prosperity")
}
One step further, you must be imagining how real life applications look like, in particular if you are employing several generators. It would be a good idea if you find a way to standardize your generators around a single pattern that demonstrates to be convenient for most circumstances.
Let's examine the example below. We have three generators: sectors
, industries
and companies
. For brevity, only sectors
is completely shown. This generator employs a tailrec
function as demonstrated already above. The trick here is that the same tailrec
function is also employed by other generators. All we have to do is supply a different body
function.
type GenP = (NodeSeq, NodeSeq, NodeSeq)
type GenR = immutable.Map[String, String]
def tailrec(p: GenP)(body: GenP => GenR): Unit @Gen[GenR] = {
val (stats, rows, header) = p
if (!stats.isEmpty && !rows.isEmpty) {
val heads: GenP = (stats.head, rows.head, header)
val tails: GenP = (stats.tail, rows.tail, header)
_yield(body(heads))
// tail recursion
tailrec(tails)(body)
}
}
def sectors = generator[GenR] {
def body(p: GenP): GenR = {
// unpack arguments
val stat, row, header = p
// obtain name and url
val name = (row \ "a").text
val url = (row \ "a" \ "@href").text
// create map and populate fields: name and url
var m = new scala.collection.mutable.HashMap[String, String]
m.put("name", name)
m.put("url", url)
// populate other fields
(header, stat).zipped.foreach { (k, v) => m.put(k.text, v.text) }
// returns a map
m
}
val root : scala.xml.NodeSeq = cache.loadHTML5(urlSectors) // obtain entire page
val header: scala.xml.NodeSeq = ... // code is omitted
val stats : scala.xml.NodeSeq = ... // code is omitted
val rows : scala.xml.NodeSeq = ... // code is omitted
// tail recursion
tailrec((stats, rows, header))(body)
}
def industries(sector: String) = generator[GenR] {
def body(p: GenP): GenR = {
//++ similar to 'body' demonstrated in "sectors"
// returns a map
m
}
//++ obtain NodeSeq variables, like demonstrated in "sectors"
// tail recursion
tailrec((stats, rows, header))(body)
}
def companies(sector: String) = generator[GenR] {
def body(p: GenP): GenR = {
//++ similar to 'body' demonstrated in "sectors"
// returns a map
m
}
//++ obtain NodeSeq variables, like demonstrated in "sectors"
// tail recursion
tailrec((stats, rows, header))(body)
}
Credits to Rich Dougherty and huynhjl.
See this SO thread: Implementing yield (yield return) using Scala continuations*
Credits to Miles Sabin, for putting some of the code above together
http://github.com/milessabin/scala-cont-jvm-coro-talk/blob/master/src/continuations/Generators.scala