Processing a file character by character in Clojure

后端 未结 3 1980
礼貌的吻别
礼貌的吻别 2020-12-20 21:38

I\'m working on writing a function in Clojure that will process a file character by character. I know that Java\'s BufferedReader class has the read() method that reads one

相关标签:
3条回答
  • 2020-12-20 22:04
    (with-open [reader (clojure.java.io/reader "path/to/file")] ...
    

    I prefer this way to get a reader in clojure. And, by character by character, do you mean in file access level, like read, which allow you control how many bytes to read?

    Edit

    As @deterb pointed out, let's check the source code of line-seq

    (defn line-seq
      "Returns the lines of text from rdr as a lazy sequence of strings.
       rdr must implement java.io.BufferedReader."
      {:added "1.0"
       :static true}
      [^java.io.BufferedReader rdr]
      (when-let [line (.readLine rdr)]
        (cons line (lazy-seq (line-seq rdr)))))
    

    I faked a char-seq

     (defn char-seq 
       [^java.io.Reader rdr]
       (let [chr (.read rdr)]
         (if (>= chr 0)
         (cons chr (lazy-seq (char-seq rdr))))))
    

    I know this char-seq reads all chars into memory[1], but I think it shows that you can directly call .read on BufferedReader. So, you can write your code like this:

    (let [chr (.read rdr)]
      (if (>= chr 0)
        ;do your work here
      ))
    

    How do you think?

    [1] According to @dimagog's comment, char-seq not read all char into memory thanks to lazy-seq

    0 讨论(0)
  • 2020-12-20 22:10

    You're pretty close - keep in mind that Strings are a sequence. (concat "abc" "def") results in the sequence (\a \b \c \d \e \f).

    mapcat is another really useful function for this - it will lazily concatenate the results of applying the mapping fn to the sequence. This means that mapcating the result of converting all of the line strings to a seq will be the lazy sequence of characters you're after.

    I did this as (mapcat seq (line-seq reader)).

    For other advice:

    • For creating the reader, I would recommend using the clojure.java.io/reader function instead of directly creating the classes.
    • Consider breaking apart the reading the file and the processing (in this case printing) of the strings from each other. While it is important to keep the full file parsing inside the withopen clause, being able to test the actual processing code outside of the file reading code is quite useful.
    • When navigating multiple (potentially nested) sequences consider using for. for does a nice job handling nested for loop type cases.

      (take 100 (for [line (repeat "abc") char (seq line)] (prn char)))

    • Use prn for debugging output. It gives you real output, as compared to user output (which hides certain details which users don't normally care about).

    0 讨论(0)
  • 2020-12-20 22:12

    I'm not familiar with Java or the read() method, so I won't be able to help you out with implementing it.

    One first thought is maybe to simplify by using slurp, which will return a string of the text of the entire file with just (slurp filename). However, this would get the whole file, which maybe you don't want.

    Once you have a string of the entire file text, you can process any string character by character by simply treating it as though it were a sequence of characters. For example:

    => (doseq [c "abcd"]
         (prntln c))
    a
    b
    c
    d
    => nil
    

    Or:

    => (remove #{\c} "abcd")
    => (\a \b \d)
    

    You could use map or reduce or any sort of sequence manipulating function. Note that after manipulating it like a sequence, it will now return as a sequence, but you could easily wrap the outer part in (reduce str ...) to return it back to a string at the end--explicitly:

    => (reduce str (remove #{\c} "abcd"))
    => "abd"
    

    As for your problem with your specific code, I think the problem lies with what words is: a vector of strings. When you print each words you are printing a vector. If at the end you replaced the line (println words) with (doseq [w words] (println w))), then it should work great.

    Also, based on what you say you want your output to look like (a vector of all the different words in the file), you wouldn't want to only do (println w) at the base of your expression, because this will print values and return nil. You would simply want w. Also, you would want to replace your doseqs with fors--again, to avoid return nil.

    Also, on improving your code, it looks generally great to me, but--and this is going with all the first change I suggest above (but not the others, because I don't want to draw it all out explicitly)--you could shorten it with a fun little trick:

    (doseq [item seq]
            (let [words (split item #"\s")]
                (doseq [w words]
                  (println w))))
    
    ;//Could be rewritten as...
    
    (doseq [item s
            :let [words (split item #"\s")]
            w words]
      (println w))
    
    0 讨论(0)
提交回复
热议问题