I\'m looking for a way to parse c++ code to retrieve some basic information about classes. I don\'t actually need much information from the code itself, but I do need it to hand
Running Doxygen on the code would give you most of that, wouldn't it?
In what format do you want the output?
Exuberant Ctags will give you most of what you need, it's usually used by editors to provide code navigation.
May choke on some templates though...
Doxygen can also produce a detailed XML by setting an option in the configuration file. It is quite thorough, and very easy to use. From the doxygen home page:
The XML output consists of a structured "dump" of the information gathered by doxygen. Each compound (class/namespace/file/...) has its own XML file and there is also an index file called index.xml.
A file called combine.xslt XSLT script is also generated and can be used to combine all XML files into a single file.
Doxygen also generates two XML schema files index.xsd (for the index file) and compound.xsd (for the compound files). This schema file describes the possible elements, their attributes and how they are structured, i.e. it the describes the grammar of the XML files and can be used for validation or to steer XSLT scripts.
In the addon/doxmlparser directory you can find a parser library for reading the XML output produced by doxygen in an incremental way (see addon/doxmlparser/include/doxmlintf.h for the interface of the library)
Sounds like a job for gcc-xml in combination with the c++ xml-library or xml-friendly scripting language of your choice.
See also Ira Baxter here, where he cites his own product.
Warning: mind you, only Elsa "..I hear does a fairly good job.." at constructing a symbol table, which according to Ira Baxter is necessary for OP's original intent (see comments to this answer - I quote him because he is an expert in the field).
The DMS Software Reengineering Toolkit is general purpose program analysis and transformation machinery. Its C++ Front End builds on DMS to provide full featured C++ parsing for a variety of common C++ dialects, can process set of C++ classes simulataneously, and constructs full name/type/access information that you can use any way you want. Information is tagged as to precise origin file/line/column. (It includes a full preprocessor).
You are right; regex can't even come close to this.