Programmatically converting/parsing LaTeX code to plain text

前端 未结 8 1327
梦毁少年i
梦毁少年i 2021-02-04 10:06

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However,

相关标签:
8条回答
  • 2021-02-04 10:44

    I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

    from TexSoup import TexSoup
    soup = TexSoup("""
    \begin{document}
    
    \section{Hello \textit{world}.}
    
    \subsection{Watermelon}
    
    (n.) A sacred fruit. Also known as:
    
    \begin{itemize}
    \item red lemon
    \item life
    \end{itemize}
    
    Here is the prevalence of each synonym.
    
    \begin{tabular}{c c}
    red lemon & uncommon \\
    life & common
    \end{tabular}
    
    \end{document}
    """)
    

    Here's how to navigate the parse tree.

    >>> soup.section  # grabs the first `section`
    \section{Hello \textit{world}.}
    >>> soup.section.name
    'section'
    >>> soup.section.string
    'Hello \\textit{world}.'
    >>> soup.section.parent.name
    'document'
    >>> soup.tabular
    \begin{tabular}{c c}
    red lemon & uncommon \\
    life & common
    \end{tabular}
    >>> soup.tabular.args[0]
    'c c'
    >>> soup.item
    \item red lemon
    >>> list(soup.find_all('item'))
    [\item red lemon, \item life]
    

    Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

    0 讨论(0)
  • 2021-02-04 10:48

    I would try pandoc [enter link description here][1]. It is written in Haskell, but it is a really nice latex 2 whatever converter.

    [1]: http://johnmacfarlane.net/pandoc/index.html .

    0 讨论(0)
  • 2021-02-04 10:54

    Building the other post Eduardo Leoni, I was looking at pandoc and I see that it comes with a standalone executable but also on this page it promises a way to build to a C-callable system library. Perhaps this is something that you can live with?

    0 讨论(0)
  • 2021-02-04 10:59

    As you're considering using TeX itself for doing the rendering, I suspect that performance is not an issue. In this case you've got a couple of options: dvi2txt to fetch your text from a single dvi file (be prepared to generate one for each label) or even rendering dvi into raster images, if it's ok for you - that's how hevea or latex2html treats formulas.

    0 讨论(0)
  • 2021-02-04 11:02

    Necroing this old thread, but found this nifty library called pylatexenc that seems to do almost exactly what the OP was after:

    from pylatexenc.latex2text import LatexNodes2Text
    
    
    LatexNodes2Text().latex_to_text(r"""\
    \section{Euler}
    \emph{This} bit is \textbf{very} clever:
    \begin{equation}
        \mathrm{e}^{i \pi} + 1 = 0  % wow!!
    \end{equation}
    where
    \[
    \mathrm{e} = \lim_{n \to \infty} \left(1 + \frac{1}{n}\right)^n
    \]
    """)
    

    which produces

    
    § EULER
    
    This bit is very clever:
    
        e^i π + 1 = 0
    
    where
    
        e = lim_n →∞(1 + 1/n)^n
    

    As you can see, the result is not perfect for the equations, but it does a great job of stripping and converting all the tex commands.

    0 讨论(0)
  • 2021-02-04 11:03

    A word of caution: It is much more difficult to write a complete parser for plain TeX than what you might think. The TeX-level (not LaTeX) \def command actually extends TeX's syntax. For example, \def\foo #1.{{\bf #1}} will expand \foo goo. into goo - Notice that the dot became a delimiter for the foo macro! Therefore, if you have to deal with any form of TeX, without restrictions on which packages may be used, it is not recommended to rely on simple parsing. You need TeX rendering. catdvi is what I use, although it is not perfect.

    0 讨论(0)
提交回复
热议问题