Struct
GLibRegex
since: 2.14
Description [src]
struct GRegex {
/* No available fields */
}
A GRegex
is a compiled form of a regular expression.
After instantiating a GRegex
, you can use its methods to find matches
in a string, replace matches within a string, or split the string at matches.
GRegex
implements regular expression pattern matching using syntax and
semantics (such as character classes, quantifiers, and capture groups)
similar to Perl regular expression. See the
PCRE documentation for details.
A typical scenario for regex pattern matching is to check if a string matches a pattern. The following statements implement this scenario.
const char *regex_pattern = ".*GLib.*";
const char *string_to_search = "You will love the GLib implementation of regex";
g_autoptr(GMatchInfo) match_info = NULL;
g_autoptr(GRegex) regex = NULL;
regex = g_regex_new (regex_pattern, G_REGEX_DEFAULT, G_REGEX_MATCH_DEFAULT, NULL);
g_assert (regex != NULL);
if (g_regex_match (regex, string_to_search, G_REGEX_MATCH_DEFAULT, &match_info))
{
int start_pos, end_pos;
g_match_info_fetch_pos (match_info, 0, &start_pos, &end_pos);
g_print ("Match successful! Overall pattern matches bytes %d to %d\n", start_pos, end_pos);
}
else
{
g_print ("No match!\n");
}
The constructor for GRegex
includes two sets of bitmapped flags:
GRegexCompileFlags
—These flags control how GLib compiles the regex. There are options for case sensitivity, multiline, ignoring whitespace, etc.GRegexMatchFlags
—These flags controlGRegex
’s matching behavior, such as anchoring and customizing definitions for newline characters.
Some regex patterns include backslash assertions, such as \d
(digit) or
\D
(non-digit). The regex pattern must escape those backslashes. For
example, the pattern "\\d\\D"
matches a digit followed by a non-digit.
GLib’s implementation of pattern matching includes a start_position
argument for some of the match, replace, and split methods. Specifying
a start position provides flexibility when you want to ignore the first
n characters of a string, but want to incorporate backslash assertions
at character n - 1. For example, a database field contains inconsistent
spelling for a job title: healthcare provider
and health-care provider
.
The database manager wants to make the spelling consistent by adding a
hyphen when it is missing. The following regex pattern tests for the string
care
preceded by a non-word boundary character (instead of a hyphen)
and followed by a space.
const char *regex_pattern = "\\Bcare\\s";
An efficient way to match with this pattern is to start examining at
start_position
6 in the string healthcare
or health-care
.
const char *regex_pattern = "\\Bcare\\s";
const char *string_to_search = "healthcare provider";
g_autoptr(GMatchInfo) match_info = NULL;
g_autoptr(GRegex) regex = NULL;
regex = g_regex_new (
regex_pattern,
G_REGEX_DEFAULT,
G_REGEX_MATCH_DEFAULT,
NULL);
g_assert (regex != NULL);
g_regex_match_full (
regex,
string_to_search,
-1,
6, // position of 'c' in the test string.
G_REGEX_MATCH_DEFAULT,
&match_info,
NULL);
The method g_regex_match_full()
(and other methods implementing
start_pos
) allow for lookback before the start position to determine if
the previous character satisfies an assertion.
Unless you set the G_REGEX_RAW
as one of
the GRegexCompileFlags
, all the strings passed to GRegex
methods must
be encoded in UTF-8. The lengths and the positions inside the strings are
in bytes and not in characters, so, for instance, \xc3\xa0
(i.e., à
)
is two bytes long but it is treated as a single character. If you set
G_REGEX_RAW
, the strings can be non-valid UTF-8 strings and a byte is
treated as a character, so \xc3\xa0
is two bytes and two characters long.
Regarding line endings, \n
matches a \n
character, and \r
matches
a \r
character. More generally, \R
matches all typical line endings:
CR + LF (\r\n
), LF (linefeed, U+000A, \n
), VT (vertical tab, U+000B,
\v
), FF (formfeed, U+000C, \f
), CR (carriage return, U+000D, \r
),
NEL (next line, U+0085), LS (line separator, U+2028), and PS (paragraph
separator, U+2029).
The behaviour of the dot, circumflex, and dollar metacharacters are
affected by newline characters. By default, GRegex
matches any newline
character matched by \R
. You can limit the matched newline characters by
specifying the G_REGEX_MATCH_NEWLINE_CR
,
G_REGEX_MATCH_NEWLINE_LF
, and
G_REGEX_MATCH_NEWLINE_CRLF
compile options, and
with G_REGEX_MATCH_NEWLINE_ANY
,
G_REGEX_MATCH_NEWLINE_CR
,
G_REGEX_MATCH_NEWLINE_LF
and
G_REGEX_MATCH_NEWLINE_CRLF
match options.
These settings are also relevant when compiling a pattern if
G_REGEX_EXTENDED
is set and an unescaped
#
outside a character class is encountered. This indicates a comment
that lasts until after the next newline.
Because GRegex
does not modify its internal state between creation and
destruction, you can create and modify the same GRegex
instance from
different threads. In contrast, GMatchInfo
is not thread safe.
The regular expression low-level functionalities are obtained through the excellent PCRE library written by Philip Hazel.
Available since: 2.14
Constructors
g_regex_new
Compiles the regular expression to an internal form, and does
the initial setup of the GRegex
structure.
since: 2.14
Functions
g_regex_check_replacement
Checks whether replacement
is a valid replacement string
(see g_regex_replace()), i.e. that all escape sequences in
it are valid.
since: 2.14
g_regex_escape_nul
Escapes the nul characters in string
to “\x00”. It can be used
to compile a regex with embedded nul characters.
since: 2.30
g_regex_escape_string
Escapes the special characters used for regular expressions
in string
, for instance “a.b*c” becomes “a.b*c”. This
function is useful to dynamically generate regular expressions.
since: 2.14
g_regex_split_simple
Breaks the string on the pattern, and returns an array of the tokens. If the pattern contains capturing parentheses, then the text for each of the substrings will also be returned. If the pattern does not match anywhere in the string, then the whole string is returned as the first token.
since: 2.14
Instance methods
g_regex_get_has_cr_or_lf
Checks whether the pattern contains explicit CR or LF references.
since: 2.34
g_regex_get_max_backref
Returns the number of the highest back reference in the pattern, or 0 if the pattern does not contain back references.
since: 2.14
g_regex_get_max_lookbehind
Gets the number of characters in the longest lookbehind assertion in the pattern. This information is useful when doing multi-segment matching using the partial matching facilities.
since: 2.38
g_regex_get_pattern
Gets the pattern string associated with regex
, i.e. a copy of
the string passed to g_regex_new().
since: 2.14
g_regex_match
Scans for a match in string
for the pattern in regex
.
The match_options
are combined with the match options specified
when the regex
structure was created, letting you have more
flexibility in reusing GRegex
structures.
since: 2.14
g_regex_match_all
Using the standard algorithm for regular expression matching only the longest match in the string is retrieved. This function uses a different algorithm so it can retrieve all the possible matches. For more documentation see g_regex_match_all_full().
since: 2.14
g_regex_match_all_full
Using the standard algorithm for regular expression matching only
the longest match in the string
is retrieved, it is not possible
to obtain all the available matches. For instance matching
"<a> <b> <c>"
against the pattern "<.*>"
you get "<a> <b> <c>"
.
since: 2.14
g_regex_match_full
Scans for a match in string
for the pattern in regex
.
The match_options
are combined with the match options specified
when the regex
structure was created, letting you have more
flexibility in reusing GRegex
structures.
since: 2.14
g_regex_replace
Replaces all occurrences of the pattern in regex
with the
replacement text. Backreferences of the form \number
or
\g<number>
in the replacement text are interpolated by the
number-th captured subexpression of the match, \g<name>
refers
to the captured subexpression with the given name. \0
refers
to the complete match, but \0
followed by a number is the octal
representation of a character. To include a literal \
in the
replacement, write \\\\
.
since: 2.14
g_regex_replace_eval
Replaces occurrences of the pattern in regex with the output of
eval
for that occurrence.
since: 2.14
g_regex_replace_literal
Replaces all occurrences of the pattern in regex
with the
replacement text. replacement
is replaced literally, to
include backreferences use g_regex_replace().
since: 2.14
g_regex_split
Breaks the string on the pattern, and returns an array of the tokens. If the pattern contains capturing parentheses, then the text for each of the substrings will also be returned. If the pattern does not match anywhere in the string, then the whole string is returned as the first token.
since: 2.14
g_regex_split_full
Breaks the string on the pattern, and returns an array of the tokens. If the pattern contains capturing parentheses, then the text for each of the substrings will also be returned. If the pattern does not match anywhere in the string, then the whole string is returned as the first token.
since: 2.14
g_regex_unref
Decreases reference count of regex
by 1. When reference count drops
to zero, it frees all the memory associated with the regex structure.
since: 2.14