What is the best way to read a very large file (like a text file having 100 000 names one on each line) into a list (lazily - loading it as needed) in clojure?
Basic
You need to use line-seq. An example from clojuredocs:
;; Count lines of a file (loses head):
user=> (with-open [rdr (clojure.java.io/reader "/etc/passwd")]
(count (line-seq rdr)))
But with a lazy list of strings, you cannot do those operations efficiently which require the whole list to be present, like sorting. If you can implement your operations as filter
or map
then you can consume the list lazily. Otherwise it'll be better to use an embedded database.
Also note that you should not hold on to the head of the list, otherwise the whole list will be loaded in memory.
Furthermore, if you need to do more than one operation, you'll need to read the file again and again. Be warned, laziness can make things difficult sometimes.
see my answer here
(ns user
(:require [clojure.core.async :as async :refer :all
:exclude [map into reduce merge partition partition-by take]]))
(defn read-dir [dir]
(let [directory (clojure.java.io/file dir)
files (filter #(.isFile %) (file-seq directory))
ch (chan)]
(go
(doseq [file files]
(with-open [rdr (clojure.java.io/reader file)]
(doseq [line (line-seq rdr)]
(>! ch line))))
(close! ch))
ch))
so:
(def aa "D:\\Users\\input")
(let [ch (read-dir aa)]
(loop []
(when-let [line (<!! ch )]
(println line)
(recur))))
Andrew's solution worked well for me, but nested defn
s are not so idiomatic, and you don't need to do lazy-seq
twice: here is an updated version without the extra prints and using letfn
:
(defn lazy-file-lines [file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
There are various ways of doing this, depending on exactly what you want.
If you have a function
that you want to apply to each line in a file, you can use code similar to Abhinav's answer:
(with-open [rdr ...]
(doall (map function (line-seq rdr))))
This has the advantage that the file is opened, processed, and closed as quickly as possible, but forces the entire file to be consumed at once.
If you want to delay processing of the file you might be tempted to return the lines, but this won't work:
(map function ; broken!!!
(with-open [rdr ...]
(line-seq rdr)))
because the file is closed when with-open
returns, which is before you lazily process the file.
One way around this is to pull the entire file into memory with slurp
:
(map function (slurp filename))
That has an obvious disadvantage - memory use - but guarantees that you don't leave the file open.
An alternative is to leave the file open until you get to the end of the read, while generating a lazy sequence:
(ns ...
(:use clojure.test))
(defn stream-consumer [stream]
(println "read" (count stream) "lines"))
(defn broken-open [file]
(with-open [rdr (clojure.java.io/reader file)]
(line-seq rdr)))
(defn lazy-open [file]
(defn helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) (println "closed") nil))))
(lazy-seq
(do (println "opening")
(helper (clojure.java.io/reader file)))))
(deftest test-open
(try
(stream-consumer (broken-open "/etc/passwd"))
(catch RuntimeException e
(println "caught " e)))
(let [stream (lazy-open "/etc/passwd")]
(println "have stream")
(stream-consumer stream)))
(run-tests)
Which prints:
caught #<RuntimeException java.lang.RuntimeException: java.io.IOException: Stream closed>
have stream
opening
closed
read 29 lines
Showing that the file wasn't even opened until it was needed.
This last approach has the advantage that you can process the stream of data "elsewhere" without keeping everything in memory, but it also has an important disadvantage - the file is not closed until the end of the stream is read. If you are not careful you may open many files in parallel, or even forget to close them (by not reading the stream completely).
The best choice depends on the circumstances - it's a trade-off between lazy evaluation and limited system resources.
PS: Is lazy-open
defined somewhere in the libraries? I arrived at this question trying to find such a function and ended up writing my own, as above.