I\'m trying to make \"a script\" - essentially an awk command - to extract the prototypes of functions of C code in a .c file to generate automatically a header .h. I\'m new wit
Note: the question has changed substantially since I wrote this answer.
Replace [:space:]
with [[:space:]]
:
$ awk '/^[a-zA-Z*_]+[[:space:]]+[a-zA-Z*_]+[[:space:]]*[(].*?[)]/{ print $0 }' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)
The reason is that [:space:]
will match any of the characters :
, s
, p
, a
, c
, or e
. This is not what you want.
You want [[:space:]]
which will match any whitespace.
The native Sun/Solaris awk is notoriously bug-filled. If you are on that platform, try nawk
or /usr/xpg4/bin/awk
or /usr/xpg6/bin/awk
.
A very similar approach can be used with sed
. This uses a regex based on yours:
$ sed -n '/^[a-zA-Z_*]\+[ \t]\+[a-zA-Z*]\+ *[(]/p' dict3.c
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)
The -n
option tells sed not to print unless we explicitly ask it to. The construct /.../p
tells sed to print the line if the regex inside the slashes is matched.
All the improvements to the regex suggested by Ed Morton apply here also.
The above can also be adopted to perl:
perl -ne 'print if /^[a-zA-Z_*]+[ \t]+[a-zA-Z*]+ *[(]/' dict3.c
The regexp you're trying to write would be:
$ awk '/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)
which written without character classes and making assumptions about your locale would be:
$ awk '/^[a-zA-Z_][a-zA-Z0-9_]*\**[ \t]+[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\([^)]*\)/' file
dictent_t* dictentcreate(const char * key, const char * val)
dict_t* dictcreate()
void dictdestroy(*dict_t d)
void dictdump(dict_t *d)
int dictlook(dict_t *d, const char * key)
int dictget(char* s, dict_t *d, const char *key)
dict_t* dictadd(dict_t* d, const char * key, const char * val)
dict_t dictup(dict_t d, const char * key, const char *newval)
dict_t* dictrm(dict_t* d, const char * key)
but:
int foo(int x /* always > 0 (I hope) */)
. When providing sample input/output you should always include some text that you think will be hard for a script to NOT select given it "looks" a lot like the text you do want to select but in the wrong context for your needs.Note that C symbols cannot start with a number and so the regexp to match one is not [[:alnum:]_]+
but is instead [[:alpha:]_][[:alnum:]_]*
. Also functions can and often do return pointers to pointers to pointers and the *
can be next to the function name instead of the function return type so you REALLY should be using a regexp like this (untested since you didn't provide input of the format that this would match) if your function declarations can be any of the normal formats:
awk '/^[[:alpha:]_][[:alnum:]_]*((\*[[:space:]]*)*|(\*[[:space:]]*)*|[[:space:]]+)[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
That won't of course match declarations that span lines - that is a whole other can of worms.
In general you can't parse C without a C parser but if you want something cheap and cheerful then at least run a C beautifier on the code first to try to get all the various possible layouts into one consistent format (google "C beautifier" and you also need to strip out the comments (see for example https://stackoverflow.com/a/13062682/1745001).
Given your new requirements and your new sample input/output, this is what you are asking for:
$ awk 'match($0,/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/) { print substr($0,RSTART,RLENGTH) ";" }' file
dict_t dictup(dict_t d, const char * key, const char * newval);
dict_t* dictrm(dict_t* d, const char * key);
but again - this is by no means robust given the possible layouts of C code in general. You need a C parser, a C beautifier, and/or a specialized tool to do this job (e.g. googl cscope
) robustly.