Scala Parser Combinators: Parsing in a stream

我是研究僧i 提交于 2019-12-07 18:12:56

问题


I'm using the native parser combinator library in scala, and I'd like to use it to parse a number of large files. I have my combinators set up, but the file that I'm trying to parse is too large to be read into memory all at once. I'd like to be able to stream from an input file through my parser and read it back to disk so that I don't need to store it all in memory at once.My current system looks something like this:

val f = Source.fromFile("myfile")
parser.parse(parser.document.+, f.reader).get.map{_.writeToFile}
f.close

This reads the whole file in as it parses, which I'd like to avoid.


回答1:


There is no easy or built-in way to accomplish this using scala's parser combinators, which provide a facility for implementing parsing expression grammars.

Operators such as ||| (longest match) are largely incompatible with a stream parsing model, as they require extensive backtracking capabilities. In order to accomplish what you are trying to do, you would need to re-formulate your grammar such that no backtracking is required, ever. This is generally much harder than it sounds.

As mentioned by others, your best bet would be to look into a preliminary phase where you chunk your input (e.g. by line) so that you can handle a portion of the stream at a time.




回答2:


One easy way of doing it is to grab an Iterator from the Source object and then walk through the lines like so:

val source = Source.fromFile("myFile")
val lines = source.getLines
for (line <- lines) {
    // Do magic with the line-value
}
source.close // Close the file

But you will need to be able to use the lines one by one in your parser of course.

Source: https://groups.google.com/forum/#!topic/scala-user/LPzpXo3sUVE




回答3:


You might try the StreamReader class that is part of the parsing package.

You would use it something like:

val f = StreamReader( fromFile("myfile","UTF-8").reader() )

parseAll( parser, f )



回答4:


The longest match as one poster above mentioned combined with regex's using source.subSequence(0, source.length) means even StreamReader doesn't help.

The best kludgy answer I have is use getLines as others have mentioned, and chunk as the accepted answer mentions. My particular input required me to chunk 2 lines at a time. You could build an iterator out of the chunks you build to make it slightly less ugly.



来源:https://stackoverflow.com/questions/19014207/scala-parser-combinators-parsing-in-a-stream

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!