AWK模式与操作
模式由正则表达式、判别条件真伪的表达式或者二者的组合构成。awk默认打印所有是表达式结果为真的文本行。模
式表达式中暗含着if语句,如此,就不必用花括号将它括起来。当if是显式给出时,这个表达式就成了操作语句,语法将不
一样
操作是花括号中以分号分隔的语句。若操作前有模式,则该模式控制执行操作的时机。
正则表达式
\ 取消字符的特殊含义
^ 在行首匹配。^不能用于匹配嵌套在一个字符串中的行首,if ("line1\nLINE 2" ~ /^L/) ...不为真。
$ 在行尾匹配。$不能用于匹配嵌套在一个字符串中的行尾,if ("line1\nLINE 2" ~ /1$/) ...不为真
. 匹配单个任意字符,包括换行符。
[...] 匹配制定字符组中的任意一个。
[^ …] 匹配任何一个不在制定字符组中的字符
| 匹配|两侧的任意的字符(组),在所有的正则表达式中优先级最低。The alternation applies to the
largest possible regexps on either side.
(...) Parentheses are used for grouping in regular expressions, as in arithmetic. They can
be used to concatenate regular expressions containing the alternation operator.
* 匹配零个或者多个前导字符
+ 匹配一个或者多个前导字符
? 匹配零个或者多个前导字符
{n} ,{n,} ,{n,m} One or two numbers inside braces denote an interval expression. If there is
one number in the braces, the preceding regexp is repeated n times. If there are two
numbers separated by a comma, the preceding regexp is repeated n to m times. If there
is one number followed by a comma, then the preceding regexp is repeated at least n
times:
wh{3}y
Matches ‘whhhy’, but not ‘why’ or ‘whhhhy’.
wh{3,5}y
Matches ‘whhhy’, ‘whhhhy’, or ‘whhhhhy’, only.
wh{2,}y
Matches ‘whhy’ or ‘whhhy’, and so on.
Interval expressions were not traditionally available in awk. They were added as part
of the POSIX standard to make awk and egrep consistent with each other.
However, because old programs may use ‘{’ and ‘}’ in regexp constants, by default
gawk does not match interval expressions in regexps. If either --posix or --re-interval are
specified, then interval expressions are allowed in regexps.
For new programs that use ‘{’ and ‘}’ in regexp constants, it is good practice to
always escape them with a backslash. Then the regexp constants are valid and work the way
you want them to.
正则表达式中 ‘*’,,‘+’, ‘?’以及‘{’ 和 ‘}’有最高的优先级,解析来是连接操作符,最后是‘|’. 算术中一
样,括号可以用来改变顺序。
在POSIX awk和gawk中,如果正则表达式里'*','+','?'前面没有任何字符,那么这三个字符代表他们自己。
很多其他版本的awk中,将把这视为错误。
gawk-Specific Regexp Operators
\Y 匹配一个单词开头或者末尾的空字符串。
\B 匹配单词内的空字符串。
\< 匹配一个单词的开头的空字符串,锚定开始。
\> 匹配一个单词的末尾的空字符串,锚定末尾。
\w 匹配一个字母数字组成的单词。
\W 匹配一个非字母数字组成的单词。
\‘ 匹配字符串开头的一个空字符串。
\' 匹配字符串末尾的一个空字符串。
The various command-line options control how gawk interprets characters in regexps:
Nooptions :
In the default case, gawk provides all the facilities of POSIX regexps and the
previously described GNU regexp operators. GNU regexp operators described in Regexp
Operators. However, interval expressions are not supported.
--posix :
Only POSIX regexps are supported; the GNU operators are not special (e.g., ‘\w’
matches a literal ‘w’). Interval expressions are allowed.
--traditional :
Traditional Unix awk regexps are matched. The GNU operators are not special, interval
expressions are not available, nor are the POSIX character classes ([[:alnum:]], etc.).
Characters described by octal and hexadecimal escape sequences are treated literally, even
if they represent regexp metacharacters. Also, gawk silently skips directories named on the
command line.
--re-interval :
Allow interval expressions in regexps, even if --traditional has been provided. (--
posix automatically enables interval expressions, so --re-interval is redundant when --posix
is is used.)
POSIX增加的括号字符类
Class Meaning
[:alnum:] Alphanumeric characters.
[:alpha:] Alphabetic characters.
[:blank:] Space and TAB characters.
[:cntrl:] Control characters.
[:digit:] Numeric characters.
[:graph:] Characters that are both printable and visible.
[:lower:] Lowercase alphabetic characters.
[:print:] Printable characters (characters that are not control characters).
[:punct:] Punctuation characters
[:space:] Space characters (such as space, TAB, and formfeed, to name a few).
[:upper:] Uppercase alphabetic characters.
[:xdigit:] Characters that are hexadecimal digits.
范围模板
范围模板匹配从第一个模板的第一次出现到第二个模板的第一次出现之间所有行。如果有一个模板没 出现,则匹配
到开头或末尾。如$ awk '/root/,/mysql/' test将显示root第一次出现到mysql第 一次出现之间的所有行。