Awk doesn't match all match all my entries

后端 未结 2 589
爱一瞬间的悲伤
爱一瞬间的悲伤 2021-01-22 19:40

I\'m trying to make \"a script\" - essentially an awk command - to extract the prototypes of functions of C code in a .c file to generate automatically a header .h. I\'m new wit

相关标签:
2条回答
  • 2021-01-22 20:06

    Note: the question has changed substantially since I wrote this answer.

    Replace [:space:] with [[:space:]]:

    $ awk '/^[a-zA-Z*_]+[[:space:]]+[a-zA-Z*_]+[[:space:]]*[(].*?[)]/{ print $0 }' dict3.c
    dictent_t* dictentcreate(const char * key, const char * val)  
    dict_t* dictcreate() 
    void dictdestroy(*dict_t d) 
    void dictdump(dict_t *d) 
    int dictlook(dict_t *d, const char * key) 
    int dictget(char* s, dict_t *d, const char *key)
    dict_t* dictadd(dict_t* d, const char * key, const char * val)
    dict_t dictup(dict_t d, const char * key, const char *newval) 
    dict_t* dictrm(dict_t* d, const char * key)
    

    The reason is that [:space:] will match any of the characters :, s, p, a, c, or e. This is not what you want.

    You want [[:space:]] which will match any whitespace.

    Sun/Solaris

    The native Sun/Solaris awk is notoriously bug-filled. If you are on that platform, try nawk or /usr/xpg4/bin/awk or /usr/xpg6/bin/awk.

    Using sed

    A very similar approach can be used with sed. This uses a regex based on yours:

    $ sed -n '/^[a-zA-Z_*]\+[ \t]\+[a-zA-Z*]\+ *[(]/p' dict3.c
    dictent_t* dictentcreate(const char * key, const char * val)  
    dict_t* dictcreate() 
    void dictdestroy(*dict_t d) 
    void dictdump(dict_t *d) 
    int dictlook(dict_t *d, const char * key) 
    int dictget(char* s, dict_t *d, const char *key)
    dict_t* dictadd(dict_t* d, const char * key, const char * val)
    dict_t dictup(dict_t d, const char * key, const char *newval) 
    dict_t* dictrm(dict_t* d, const char * key)
    

    The -n option tells sed not to print unless we explicitly ask it to. The construct /.../p tells sed to print the line if the regex inside the slashes is matched.

    All the improvements to the regex suggested by Ed Morton apply here also.

    Using perl

    The above can also be adopted to perl:

    perl -ne  'print if /^[a-zA-Z_*]+[ \t]+[a-zA-Z*]+ *[(]/' dict3.c
    
    0 讨论(0)
  • 2021-01-22 20:08

    The regexp you're trying to write would be:

    $ awk '/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
    dictent_t* dictentcreate(const char * key, const char * val)
    dict_t* dictcreate()
    void dictdestroy(*dict_t d)
    void dictdump(dict_t *d)
    int dictlook(dict_t *d, const char * key)
    int dictget(char* s, dict_t *d, const char *key)
    dict_t* dictadd(dict_t* d, const char * key, const char * val)
    dict_t dictup(dict_t d, const char * key, const char *newval)
    dict_t* dictrm(dict_t* d, const char * key)
    

    which written without character classes and making assumptions about your locale would be:

    $ awk '/^[a-zA-Z_][a-zA-Z0-9_]*\**[ \t]+[a-zA-Z_][a-zA-Z0-9_]*[ \t]*\([^)]*\)/' file
    dictent_t* dictentcreate(const char * key, const char * val)
    dict_t* dictcreate()
    void dictdestroy(*dict_t d)
    void dictdump(dict_t *d)
    int dictlook(dict_t *d, const char * key)
    int dictget(char* s, dict_t *d, const char *key)
    dict_t* dictadd(dict_t* d, const char * key, const char * val)
    dict_t dictup(dict_t d, const char * key, const char *newval)
    dict_t* dictrm(dict_t* d, const char * key)
    

    but:

    1. Get/use an awk that has character classes because if it doesn't have that then who knows what else it's missing?
    2. It's always trivial to write a script to find the strings you want but MUCH harder to NOT find the strings you DON'T want. For example, the above will match text inside comments and would fail given a declaration like int foo(int x /* always > 0 (I hope) */). When providing sample input/output you should always include some text that you think will be hard for a script to NOT select given it "looks" a lot like the text you do want to select but in the wrong context for your needs.

    Note that C symbols cannot start with a number and so the regexp to match one is not [[:alnum:]_]+ but is instead [[:alpha:]_][[:alnum:]_]*. Also functions can and often do return pointers to pointers to pointers and the * can be next to the function name instead of the function return type so you REALLY should be using a regexp like this (untested since you didn't provide input of the format that this would match) if your function declarations can be any of the normal formats:

    awk '/^[[:alpha:]_][[:alnum:]_]*((\*[[:space:]]*)*|(\*[[:space:]]*)*|[[:space:]]+)[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/' file
    

    That won't of course match declarations that span lines - that is a whole other can of worms.

    In general you can't parse C without a C parser but if you want something cheap and cheerful then at least run a C beautifier on the code first to try to get all the various possible layouts into one consistent format (google "C beautifier" and you also need to strip out the comments (see for example https://stackoverflow.com/a/13062682/1745001).

    Given your new requirements and your new sample input/output, this is what you are asking for:

    $ awk 'match($0,/^[[:alpha:]_][[:alnum:]_]*\**[[:space:]]+[[:alpha:]_][[:alnum:]_]*[[:space:]]*\([^)]*\)/) { print substr($0,RSTART,RLENGTH) ";" }' file
    dict_t dictup(dict_t d, const char * key, const char * newval);
    dict_t* dictrm(dict_t* d, const char * key);
    

    but again - this is by no means robust given the possible layouts of C code in general. You need a C parser, a C beautifier, and/or a specialized tool to do this job (e.g. googl cscope) robustly.

    0 讨论(0)
提交回复
热议问题