Programmatically converting/parsing LaTeX code to plain text

前端 未结 8 1343
梦毁少年i
梦毁少年i 2021-02-04 10:06

I have a couple of code projects in C++/Python in which LaTeX-format descriptions and labels are used to generate PDF documentation or graphs made using LaTeX+pstricks. However,

8条回答
  •  栀梦
    栀梦 (楼主)
    2021-02-04 10:44

    I understand this is an old post, but since this post comes up often in latex-python-parsing searches (as evident by Extract only body text from arXiv articles formatted as .tex), leaving this here for folks down the line: Here's a LaTeX parser in Python that supports search over and modification of the parse tree, https://github.com/alvinwan/texsoup. Taken from the README, here is sample text and how you can interact with it via TexSoup.

    from TexSoup import TexSoup
    soup = TexSoup("""
    \begin{document}
    
    \section{Hello \textit{world}.}
    
    \subsection{Watermelon}
    
    (n.) A sacred fruit. Also known as:
    
    \begin{itemize}
    \item red lemon
    \item life
    \end{itemize}
    
    Here is the prevalence of each synonym.
    
    \begin{tabular}{c c}
    red lemon & uncommon \\
    life & common
    \end{tabular}
    
    \end{document}
    """)
    

    Here's how to navigate the parse tree.

    >>> soup.section  # grabs the first `section`
    \section{Hello \textit{world}.}
    >>> soup.section.name
    'section'
    >>> soup.section.string
    'Hello \\textit{world}.'
    >>> soup.section.parent.name
    'document'
    >>> soup.tabular
    \begin{tabular}{c c}
    red lemon & uncommon \\
    life & common
    \end{tabular}
    >>> soup.tabular.args[0]
    'c c'
    >>> soup.item
    \item red lemon
    >>> list(soup.find_all('item'))
    [\item red lemon, \item life]
    

    Disclaimer: I wrote this lib, but it was for similar reasons. Regarding the post by Little Bobby Tales (regarding def), TexSoup doesn't handle definitions.

提交回复
热议问题