-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathpcre2.txt
12716 lines (9777 loc) · 624 KB
/
pcre2.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
-----------------------------------------------------------------------------
This file contains a concatenation of the PCRE2 man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
synopses of each function in the library have not been included. Neither has
the pcre2demo program. There are separate text files for the pcre2grep and
pcre2test commands.
-----------------------------------------------------------------------------
PCRE2(3) Library Functions Manual PCRE2(3)
NAME
PCRE2 - Perl-compatible regular expressions (revised API)
INTRODUCTION
PCRE2 is the name used for a revised API for the PCRE library, which is
a set of functions, written in C, that implement regular expression
pattern matching using the same syntax and semantics as Perl, with just
a few differences. After nearly two decades, the limitations of the
original API were making development increasingly difficult. The new
API is more extensible, and it was simplified by abolishing the sepa-
rate "study" optimizing function; in PCRE2, patterns are automatically
optimized where possible. Since forking from PCRE1, the code has been
extensively refactored and new features introduced. The old library is
now obsolete and is no longer maintained.
As well as Perl-style regular expression patterns, some features that
appeared in Python and the original PCRE before they appeared in Perl
are available using the Python syntax. There is also support for some
.NET and Oniguruma syntax items, and there are options for requesting
minor changes that give better ECMAScript (JavaScript) compatibility.
The source code for PCRE2 can be compiled to support strings of 8-bit,
16-bit, or 32-bit code units, which means that up to three separate li-
braries may be installed, one for each code unit size. The size of a
code unit is not related to the bit size of the underlying hardware. In
a 64-bit environment that also supports 32-bit applications, versions
of PCRE2 that are compiled in both 64-bit and 32-bit modes may be
needed.
The original work to extend PCRE to 16-bit and 32-bit code units was
done by Zoltan Herczeg and Christian Persch, respectively. In all three
cases, strings can be interpreted either as one character per code
unit, or as UTF-encoded Unicode, with support for Unicode general cate-
gory properties. Unicode support is optional at build time (but is the
default). However, processing strings as UTF code units must be enabled
explicitly at run time. The version of Unicode in use can be discovered
by running
pcre2test -C
The three libraries contain identical sets of functions, with names
ending in _8, _16, or _32, respectively (for example, pcre2_com-
pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
32, a program that uses just one code unit width can be written using
generic names such as pcre2_compile(), and the documentation is written
assuming that this is the case.
In addition to the Perl-compatible matching function, PCRE2 contains an
alternative function that matches the same compiled patterns in a dif-
ferent way. In certain circumstances, the alternative function has some
advantages. For a discussion of the two matching algorithms, see the
pcre2matching page.
Details of exactly which Perl regular expression features are and are
not supported by PCRE2 are given in separate documents. See the
pcre2pattern and pcre2compat pages. There is a syntax summary in the
pcre2syntax page.
Some features of PCRE2 can be included, excluded, or changed when the
library is built. The pcre2_config() function makes it possible for a
client to discover which features are available. The features them-
selves are described in the pcre2build page. Documentation about build-
ing PCRE2 for various operating systems can be found in the README and
NON-AUTOTOOLS-BUILD files in the source distribution.
The libraries contains a number of undocumented internal functions and
data tables that are used by more than one of the exported external
functions, but which are not intended for use by external callers.
Their names all begin with "_pcre2", which hopefully will not provoke
any name clashes. In some environments, it is possible to control which
external symbols are exported when a shared library is built, and in
these cases the undocumented symbols are not exported.
SECURITY CONSIDERATIONS
If you are using PCRE2 in a non-UTF application that permits users to
supply arbitrary patterns for compilation, you should be aware of a
feature that allows users to turn on UTF support from within a pattern.
For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
mode, which interprets patterns and subjects as strings of UTF-8 code
units instead of individual 8-bit characters. This causes both the pat-
tern and any data against which it is matched to be checked for UTF-8
validity. If the data string is very long, such a check might use suf-
ficiently many resources as to cause your application to lose perfor-
mance.
One way of guarding against this possibility is to use the pcre2_pat-
tern_info() function to check the compiled pattern's options for
PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
calling pcre2_compile(). This causes a compile time error if the pat-
tern contains a UTF-setting sequence.
The use of Unicode properties for character types such as \d can also
be enabled from within the pattern, by specifying "(*UCP)". This fea-
ture can be disallowed by setting the PCRE2_NEVER_UCP option.
If your application is one that supports UTF, be aware that validity
checking can take time. If the same data string is to be matched many
times, you can use the PCRE2_NO_UTF_CHECK option for the second and
subsequent matches to avoid running redundant checks.
The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
to problems, because it may leave the current matching point in the
middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C op-
tion can be used by an application to lock out the use of \C, causing a
compile-time error if it is encountered. It is also possible to build
PCRE2 with the use of \C permanently disabled.
Another way that performance can be hit is by running a pattern that
has a very large search tree against a string that will never match.
Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
vides some protection against this: see the pcre2_set_match_limit()
function in the pcre2api page. There is a similar function called
pcre2_set_depth_limit() that can be used to restrict the amount of mem-
ory that is used.
USER DOCUMENTATION
The user documentation for PCRE2 comprises a number of different sec-
tions. In the "man" format, each of these is a separate "man page". In
the HTML format, each is a separate page, linked from the index page.
In the plain text format, the descriptions of the pcre2grep and
pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
respectively. The remaining sections, except for the pcre2demo section
(which is a program listing), and the short pages for individual func-
tions, are concatenated in pcre2.txt, for ease of searching. The sec-
tions are as follows:
pcre2 this document
pcre2-config show PCRE2 installation configuration information
pcre2api details of PCRE2's native C API
pcre2build building PCRE2
pcre2callout details of the pattern callout feature
pcre2compat discussion of Perl compatibility
pcre2convert details of pattern conversion functions
pcre2demo a demonstration C program that uses PCRE2
pcre2grep description of the pcre2grep command (8-bit only)
pcre2jit discussion of just-in-time optimization support
pcre2limits details of size and other limits
pcre2matching discussion of the two matching algorithms
pcre2partial details of the partial matching facility
pcre2pattern syntax and semantics of supported regular
expression patterns
pcre2perform discussion of performance issues
pcre2posix the POSIX-compatible C API for the 8-bit library
pcre2sample discussion of the pcre2demo program
pcre2serialize details of pattern serialization
pcre2syntax quick syntax reference
pcre2test description of the pcre2test command
pcre2unicode discussion of Unicode and UTF support
In the "man" and HTML formats, there is also a short page for each C
library function, listing its arguments and results.
AUTHORS
The current maintainers of PCRE2 are Nicholas Wilson and Zoltan Her-
czeg.
PCRE2 was written by Philip Hazel, of the University Computing Service,
Cambridge, England. Many others have also contributed.
To contact the maintainers, please use the GitHub issues tracker or
PCRE2 mailing list, as described at the project page:
https://github.com/PCRE2Project/pcre2
REVISION
Last updated: 18 December 2024
Copyright (c) 1997-2021 University of Cambridge.
PCRE2 10.46-DEV 18 December 2024 PCRE2(3)
------------------------------------------------------------------------------
PCRE2API(3) Library Functions Manual PCRE2API(3)
NAME
PCRE2 - Perl-compatible regular expressions (revised API)
#include <pcre2.h>
PCRE2 is a new API for PCRE, starting at release 10.0. This document
contains a description of all its native functions. See the pcre2 docu-
ment for an overview of all the PCRE2 documentation.
PCRE2 NATIVE API BASIC FUNCTIONS
pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
pcre2_compile_context *ccontext);
void pcre2_code_free(pcre2_code *code);
pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
pcre2_general_context *gcontext);
pcre2_match_data *pcre2_match_data_create_from_pattern(
const pcre2_code *code, pcre2_general_context *gcontext);
int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext);
int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext,
int *workspace, PCRE2_SIZE wscount);
void pcre2_match_data_free(pcre2_match_data *match_data);
PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data);
PCRE2_SIZE pcre2_get_match_data_heapframes_size(
pcre2_match_data *match_data);
uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
pcre2_general_context *pcre2_general_context_create(
void *(*private_malloc)(PCRE2_SIZE, void *),
void (*private_free)(void *, void *), void *memory_data);
pcre2_general_context *pcre2_general_context_copy(
pcre2_general_context *gcontext);
void pcre2_general_context_free(pcre2_general_context *gcontext);
PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
pcre2_compile_context *pcre2_compile_context_create(
pcre2_general_context *gcontext);
pcre2_compile_context *pcre2_compile_context_copy(
pcre2_compile_context *ccontext);
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
int pcre2_set_bsr(pcre2_compile_context *ccontext,
uint32_t value);
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const uint8_t *tables);
int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
uint32_t extra_options);
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
PCRE2_SIZE value);
int pcre2_set_max_pattern_compiled_length(
pcre2_compile_context *ccontext, PCRE2_SIZE value);
int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
uint32_t value);
int pcre2_set_newline(pcre2_compile_context *ccontext,
uint32_t value);
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
uint32_t value);
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);
int pcre2_set_optimize(pcre2_compile_context *ccontext,
uint32_t directive);
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
pcre2_match_context *pcre2_match_context_create(
pcre2_general_context *gcontext);
pcre2_match_context *pcre2_match_context_copy(
pcre2_match_context *mcontext);
void pcre2_match_context_free(pcre2_match_context *mcontext);
int pcre2_set_callout(pcre2_match_context *mcontext,
int (*callout_function)(pcre2_callout_block *, void *),
void *callout_data);
int pcre2_set_substitute_callout(pcre2_match_context *mcontext,
int (*callout_function)(pcre2_substitute_callout_block *, void *),
void *callout_data);
int pcre2_set_substitute_case_callout(pcre2_match_context *mcontext,
PCRE2_SIZE (*callout_function)(PCRE2_SPTR, PCRE2_SIZE,
PCRE2_UCHAR *, PCRE2_SIZE,
int, void *),
void *callout_data);
int pcre2_set_offset_limit(pcre2_match_context *mcontext,
PCRE2_SIZE value);
int pcre2_set_heap_limit(pcre2_match_context *mcontext,
uint32_t value);
int pcre2_set_match_limit(pcre2_match_context *mcontext,
uint32_t value);
int pcre2_set_depth_limit(pcre2_match_context *mcontext,
uint32_t value);
PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
int pcre2_substring_copy_byname(pcre2_match_data *match_data,
PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_UCHAR *buffer,
PCRE2_SIZE *bufflen);
void pcre2_substring_free(PCRE2_UCHAR *buffer);
int pcre2_substring_get_byname(pcre2_match_data *match_data,
PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_UCHAR **bufferptr,
PCRE2_SIZE *bufflen);
int pcre2_substring_length_byname(pcre2_match_data *match_data,
PCRE2_SPTR name, PCRE2_SIZE *length);
int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
uint32_t number, PCRE2_SIZE *length);
int pcre2_substring_nametable_scan(const pcre2_code *code,
PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
int pcre2_substring_number_from_name(const pcre2_code *code,
PCRE2_SPTR name);
void pcre2_substring_list_free(PCRE2_UCHAR **list);
int pcre2_substring_list_get(pcre2_match_data *match_data,
PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext, PCRE2_SPTR replacement,
PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
PCRE2_SIZE *outlengthptr);
PCRE2 NATIVE API JIT FUNCTIONS
int pcre2_jit_compile(pcre2_code *code, uint32_t options);
int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
PCRE2_SIZE length, PCRE2_SIZE startoffset,
uint32_t options, pcre2_match_data *match_data,
pcre2_match_context *mcontext);
void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize,
size_t maxsize, pcre2_general_context *gcontext);
void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
pcre2_jit_callback callback_function, void *callback_data);
void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
PCRE2 NATIVE API SERIALIZATION FUNCTIONS
int32_t pcre2_serialize_decode(pcre2_code **codes,
int32_t number_of_codes, const uint8_t *bytes,
pcre2_general_context *gcontext);
int32_t pcre2_serialize_encode(const pcre2_code **codes,
int32_t number_of_codes, uint8_t **serialized_bytes,
PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
void pcre2_serialize_free(uint8_t *bytes);
int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
PCRE2 NATIVE API AUXILIARY FUNCTIONS
pcre2_code *pcre2_code_copy(const pcre2_code *code);
pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
PCRE2_SIZE bufflen);
const uint8_t *pcre2_maketables(pcre2_general_context *gcontext);
void pcre2_maketables_free(pcre2_general_context *gcontext,
const uint8_t *tables);
int pcre2_pattern_info(const pcre2_code *code, uint32_t what,
void *where);
int pcre2_callout_enumerate(const pcre2_code *code,
int (*callback)(pcre2_callout_enumerate_block *, void *),
void *user_data);
int pcre2_config(uint32_t what, void *where);
PCRE2 NATIVE API OBSOLETE FUNCTIONS
int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
uint32_t value);
int pcre2_set_recursion_memory_management(
pcre2_match_context *mcontext,
void *(*private_malloc)(size_t, void *),
void (*private_free)(void *, void *), void *memory_data);
These functions became obsolete at release 10.30 and are retained only
for backward compatibility. They should not be used in new code. The
first is replaced by pcre2_set_depth_limit(); the second is no longer
needed and has no effect (it always returns zero).
PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
pcre2_convert_context *pcre2_convert_context_create(
pcre2_general_context *gcontext);
pcre2_convert_context *pcre2_convert_context_copy(
pcre2_convert_context *cvcontext);
void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
uint32_t escape_char);
int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
uint32_t separator_char);
int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
uint32_t options, PCRE2_UCHAR **buffer,
PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
These functions provide a way of converting non-PCRE2 patterns into
patterns that can be processed by pcre2_compile(). This facility is ex-
perimental and may be changed in future releases. At present, "globs"
and POSIX basic and extended patterns can be converted. Details are
given in the pcre2convert documentation.
PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
code units, respectively. However, there is just one header file,
pcre2.h. This contains the function prototypes and other definitions
for all three libraries. One, two, or all three can be installed simul-
taneously. On Unix-like systems the libraries are called libpcre2-8,
libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
inal PCRE libraries. Every PCRE2 function comes in three different
forms, one for each library, for example:
pcre2_compile_8()
pcre2_compile_16()
pcre2_compile_32()
There are also three different sets of data types:
PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
The UCHAR types define unsigned code units of the appropriate widths.
For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
types are pointers to constants of the equivalent UCHAR types, that is,
they are pointers to vectors of unsigned code units.
Character strings are passed to a PCRE2 library as sequences of un-
signed integers in code units of the appropriate width. The length of a
string may be given as a number of code units, or the string may be
specified as zero-terminated.
Many applications use only one code unit width. For their convenience,
macros are defined whose names are the generic forms such as pcre2_com-
pile() and PCRE2_SPTR. These macros use the value of the macro
PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
An application must define it to be 8, 16, or 32 before including
pcre2.h in order to make use of the generic names.
Applications that use more than one code unit width can be linked with
more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
be 0 before including pcre2.h, and then use the real function names.
Any code that is to be included in an environment where the value of
PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
names. (Unfortunately, it is not possible in C code to save and restore
the value of a macro.)
If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
compiler error occurs.
When using multiple libraries in an application, you must take care
when processing any particular pattern to use only functions from a
single library. For example, if you want to run a match using a pat-
tern that was compiled with pcre2_compile_16(), you must do so with
pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
In the function summaries above, and in the rest of this document and
other PCRE2 documents, functions and data types are described using
their generic names, without the _8, _16, or _32 suffix.
PCRE2 API OVERVIEW
PCRE2 has its own native API, which is described in this document.
There are also some wrapper functions for the 8-bit library that corre-
spond to the POSIX regular expression API, but they do not give access
to all the functionality of PCRE2 and they are not thread-safe. They
are described in the pcre2posix documentation. Both these APIs define a
set of C function calls.
The native API C data types, function prototypes, option values, and
error codes are defined in the header file pcre2.h, which also contains
definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
numbers for the library. Applications can use these to include support
for different releases of PCRE2.
In a Windows environment, if you want to statically link an application
program against a non-dll PCRE2 library, you must define PCRE2_STATIC
before including pcre2.h.
The functions pcre2_compile() and pcre2_match() are used for compiling
and matching regular expressions in a Perl-compatible manner. A sample
program that demonstrates the simplest way of using them is provided in
the file called pcre2demo.c in the PCRE2 source distribution. A listing
of this program is given in the pcre2demo documentation, and the
pcre2sample documentation describes how to compile and run it.
The compiling and matching functions recognize various options that are
passed as bits in an options argument. There are also some more compli-
cated parameters such as custom memory management functions and re-
source limits that are passed in "contexts" (which are just memory
blocks, described below). Simple applications do not need to make use
of contexts.
Just-in-time (JIT) compiler support is an optional feature of PCRE2
that can be built in appropriate hardware environments. It greatly
speeds up the matching performance of many patterns. Programs can re-
quest that it be used if available by calling pcre2_jit_compile() after
a pattern has been successfully compiled by pcre2_compile(). This does
nothing if JIT support is not available.
More complicated programs might need to make use of the specialist
functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
pcre2_jit_stack_assign() in order to control the JIT code's memory us-
age.
JIT matching is automatically used by pcre2_match() if it is available,
unless the PCRE2_NO_JIT option is set. There is also a direct interface
for JIT matching, which gives improved performance at the expense of
less sanity checking. The JIT-specific functions are discussed in the
pcre2jit documentation.
A second matching function, pcre2_dfa_match(), which is not Perl-com-
patible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a
given point in the subject), and scans the subject just once (unless
there are lookaround assertions). However, this algorithm does not re-
turn captured substrings. A description of the two matching algorithms
and their advantages and disadvantages is given in the pcre2matching
documentation. There is no JIT support for pcre2_dfa_match().
In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that has been matched by pcre2_match(). They are:
pcre2_substring_copy_byname()
pcre2_substring_copy_bynumber()
pcre2_substring_get_byname()
pcre2_substring_get_bynumber()
pcre2_substring_list_get()
pcre2_substring_length_byname()
pcre2_substring_length_bynumber()
pcre2_substring_nametable_scan()
pcre2_substring_number_from_name()
pcre2_substring_free() and pcre2_substring_list_free() are also pro-
vided, to free memory used for extracted strings. If either of these
functions is called with a NULL argument, the function returns immedi-
ately without doing anything.
The function pcre2_substitute() can be called to match a pattern and
return a copy of the subject string with substitutions for parts that
were matched.
Functions whose names begin with pcre2_serialize_ are used for saving
compiled patterns on disc or elsewhere, and reloading them later.
Finally, there are functions for finding out information about a com-
piled pattern (pcre2_pattern_info()) and about the configuration with
which PCRE2 was built (pcre2_config()).
Functions with names ending with _free() are used for freeing memory
blocks of various sorts. In all cases, if one of these functions is
called with a NULL argument, it does nothing.
STRING LENGTHS AND OFFSETS
The PCRE2 API uses string lengths and offsets into strings of code
units in several places. These values are always of type PCRE2_SIZE,
which is an unsigned integer type, currently always defined as size_t.
The largest value that can be stored in such a type (that is
~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
strings and unset offsets. Therefore, the longest string that can be
handled is one less than this maximum. Note that string lengths are al-
ways given in code units. Only in the 8-bit library is such a length
the same as the number of bytes in the string.
NEWLINES
PCRE2 supports five different conventions for indicating line breaks in
strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
ceding, or any Unicode newline sequence. The Unicode newline sequences
are the three just mentioned, plus the single characters VT (vertical
tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
Each of the first three conventions is used by at least one operating
system as its standard newline sequence. When PCRE2 is built, a default
can be specified. If it is not, the default is set to LF, which is the
Unix standard. However, the newline convention can be changed by an ap-
plication when calling pcre2_compile(), or it can be specified by spe-
cial text at the start of the pattern itself; this overrides any other
settings. See the pcre2pattern page for details of the special charac-
ter sequences.
In the PCRE2 documentation the word "newline" is used to mean "the
character or pair of characters that indicate a line break". The choice
of newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre2_match() options below.
The choice of newline convention does not affect the interpretation of
the \n or \r escape sequences, nor does it affect what \R matches; this
has its own separate convention.
MULTITHREADING
In a multithreaded application it is important to keep thread-specific
data separate from data that can be shared between threads. The PCRE2
library code itself is thread-safe: it contains no static or global
variables. The API is designed to be fairly simple for non-threaded ap-
plications while at the same time ensuring that multithreaded applica-
tions can use it.
There are several different blocks of data that are used to pass infor-
mation between the application and the PCRE2 libraries.
The compiled pattern
A pointer to the compiled form of a pattern is returned to the user
when pcre2_compile() is successful. The data in the compiled pattern is
fixed, and does not change when the pattern is matched. Therefore, it
is thread-safe, that is, the same compiled pattern can be used by more
than one thread simultaneously. For example, an application can compile
all its patterns at the start, before forking off multiple threads that
use them. However, if the just-in-time (JIT) optimization feature is
being used, it needs separate memory stack areas for each thread. See
the pcre2jit documentation for more details.
In a more complicated situation, where patterns are compiled only when
they are first needed, but are still shared between threads, pointers
to compiled patterns must be protected from simultaneous writing by
multiple threads. This is somewhat tricky to do correctly. If you know
that writing to a pointer is atomic in your environment, you can use
logic like this:
Get a read-only (shared) lock (mutex) for pointer
if (pointer == NULL)
{
Get a write (unique) lock for pointer
if (pointer == NULL) pointer = pcre2_compile(...
}
Release the lock
Use pointer in pcre2_match()
Of course, testing for compilation errors should also be included in
the code.
The reason for checking the pointer a second time is as follows: Sev-
eral threads may have acquired the shared lock and tested the pointer
for being NULL, but only one of them will be given the write lock, with
the rest kept waiting. The winning thread will compile the pattern and
store the result. After this thread releases the write lock, another
thread will get it, and if it does not retest pointer for being NULL,
will recompile the pattern and overwrite the pointer, creating a memory
leak and possibly causing other issues.
In an environment where writing to a pointer may not be atomic, the
above logic is not sufficient. The thread that is doing the compiling
may be descheduled after writing only part of the pointer, which could
cause other threads to use an invalid value. Instead of checking the
pointer itself, a separate "pointer is valid" flag (that can be updated
atomically) must be used:
Get a read-only (shared) lock (mutex) for pointer
if (!pointer_is_valid)
{
Get a write (unique) lock for pointer
if (!pointer_is_valid)
{
pointer = pcre2_compile(...
pointer_is_valid = TRUE
}
}
Release the lock
Use pointer in pcre2_match()
If JIT is being used, but the JIT compilation is not being done immedi-
ately (perhaps waiting to see if the pattern is used often enough),
similar logic is required. JIT compilation updates a value within the
compiled code block, so a thread must gain unique write access to the
pointer before calling pcre2_jit_compile(). Alternatively,
pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to ob-
tain a private copy of the compiled code before calling the JIT com-
piler.
Context blocks
The next main section below introduces the idea of "contexts" in which
PCRE2 functions are called. A context is nothing more than a collection
of parameters that control the way PCRE2 operates. Grouping a number of
parameters together in a context is a convenient way of passing them to
a PCRE2 function without using lots of arguments. The parameters that
are stored in contexts are in some sense "advanced features" of the
API. Many straightforward applications will not need to use contexts.
In a multithreaded application, if the parameters in a context are val-
ues that are never changed, the same context can be used by all the
threads. However, if any thread needs to change any value in a context,
it must make its own thread-specific copy.
Match blocks
The matching functions need a block of memory for storing the results
of a match. This includes details of what was matched, as well as addi-
tional information such as the name of a (*MARK) setting. Each thread
must provide its own copy of this memory.
PCRE2 CONTEXTS
Some PCRE2 functions have a lot of parameters, many of which are used
only by specialist applications, for example, those that use custom
memory management or non-standard character tables. To keep function
argument lists at a reasonable size, and at the same time to keep the
API extensible, "uncommon" parameters are passed to certain functions
in a context instead of directly. A context is just a block of memory
that holds the parameter values. Applications that do not need to ad-
just any of the context parameters can pass NULL when a context pointer
is required.
There are three different types of context: a general context that is
relevant for several PCRE2 operations, a compile-time context, and a
match-time context.
The general context
At present, this context just contains pointers to (and data for) ex-
ternal memory management functions that are called from several places
in the PCRE2 library. The context is named `general' rather than
specifically `memory' because in future other fields may be added. If
you do not want to supply your own custom memory management functions,
you do not need to bother with a general context. A general context is
created by:
pcre2_general_context *pcre2_general_context_create(
void *(*private_malloc)(PCRE2_SIZE, void *),
void (*private_free)(void *, void *), void *memory_data);
The two function pointers specify custom memory management functions,
whose prototypes are:
void *private_malloc(PCRE2_SIZE, void *);
void private_free(void *, void *);
Whenever code in PCRE2 calls these functions, the final argument is the
value of memory_data. Either of the first two arguments of the creation
function may be NULL, in which case the system memory management func-
tions malloc() and free() are used. (This is not currently useful, as
there are no other fields in a general context, but in future there
might be.) The private_malloc() function is used (if supplied) to ob-
tain memory for storing the context, and all three values are saved as
part of the context.
Whenever PCRE2 creates a data block of any kind, the block contains a
pointer to the free() function that matches the malloc() function that
was used. When the time comes to free the block, this function is
called.
A general context can be copied by calling:
pcre2_general_context *pcre2_general_context_copy(
pcre2_general_context *gcontext);
The memory used for a general context should be freed by calling:
void pcre2_general_context_free(pcre2_general_context *gcontext);
If this function is passed a NULL argument, it returns immediately
without doing anything.
The compile context
A compile context is required if you want to provide an external func-
tion for stack checking during compilation or to change the default
values of any of the following compile-time parameters:
What \R matches (Unicode newlines or CR, LF, CRLF only)
PCRE2's character tables
The newline character sequence
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
Which performance optimizations the compiler should apply
A compile context is also required if you are using custom memory man-
agement. If none of these apply, just pass NULL as the context argu-
ment of pcre2_compile().
A compile context is created, copied, and freed by the following func-
tions:
pcre2_compile_context *pcre2_compile_context_create(
pcre2_general_context *gcontext);
pcre2_compile_context *pcre2_compile_context_copy(
pcre2_compile_context *ccontext);
void pcre2_compile_context_free(pcre2_compile_context *ccontext);
A compile context is created with default values for its parameters.
These can be changed by calling the following functions, which return 0
on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
int pcre2_set_bsr(pcre2_compile_context *ccontext,
uint32_t value);
The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
Unicode line ending sequence. The value is used by the JIT compiler and
by the two interpreted matching functions, pcre2_match() and
pcre2_dfa_match().
int pcre2_set_character_tables(pcre2_compile_context *ccontext,
const uint8_t *tables);
The value must be the result of a call to pcre2_maketables(), whose
only argument is a general context. This function builds a set of char-
acter tables in the current locale.
int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
uint32_t extra_options);
As PCRE2 has developed, almost all the 32 option bits that are avail-
able in the options argument of pcre2_compile() have been used up. To
avoid running out, the compile context contains a set of extra option
bits which are used for some newer, assumed rarer, options. This func-
tion sets those bits. It always sets all the bits (either on or off).
It does not modify any existing setting. The available options are de-
fined in the section entitled "Extra compile options" below.
int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
PCRE2_SIZE value);
This sets a maximum length, in code units, for any pattern string that
is compiled with this context. If the pattern is longer, an error is
generated. This facility is provided so that applications that accept
patterns from external sources can limit their size. The default is the
largest number that a PCRE2_SIZE variable can hold, which is effec-
tively unlimited.
int pcre2_set_max_pattern_compiled_length(
pcre2_compile_context *ccontext, PCRE2_SIZE value);
This sets a maximum size, in bytes, for the memory needed to hold the
compiled version of a pattern that is compiled with this context. If
the pattern needs more memory, an error is generated. This facility is
provided so that applications that accept patterns from external
sources can limit the amount of memory they use. The default is the
largest number that a PCRE2_SIZE variable can hold, which is effec-
tively unlimited.
int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext,
uint32_t value);
This sets a maximum length for the number of characters matched by a
variable-length lookbehind assertion. The default is set when PCRE2 is
built, with the ultimate default being 255, the same as Perl. Lookbe-
hind assertions without a bounding length are not supported.
int pcre2_set_newline(pcre2_compile_context *ccontext,
uint32_t value);
This specifies which characters or character sequences are to be recog-
nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or
PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
A pattern can override the value set in the compile context by starting
with a sequence such as (*CRLF). See the pcre2pattern page for details.
When a pattern is compiled with the PCRE2_EXTENDED or PCRE2_EX-
TENDED_MORE option, the newline convention affects the recognition of
the end of internal comments starting with #. The value is saved with
the compiled pattern for subsequent use by the JIT compiler and by the
two interpreted matching functions, pcre2_match() and
pcre2_dfa_match().
int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
uint32_t value);
This parameter adjusts the limit, set when PCRE2 is built (default
250), on the depth of parenthesis nesting in a pattern. This limit
stops rogue patterns using up too much system stack when being com-
piled. The limit applies to parentheses of all kinds, not just captur-
ing parentheses.
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);
There is at least one application that runs PCRE2 in threads with very
limited system stack, where running out of stack is to be avoided at
all costs. The parenthesis limit above cannot take account of how much
stack is actually available during compilation. For a finer control,
you can supply a function that is called whenever pcre2_compile()
starts to compile a parenthesized part of a pattern. This function can
check the actual stack size (or anything else that it wants to, of
course).
The first argument to the callout function gives the current depth of
nesting, and the second is user data that is set up by the last argu-
ment of pcre2_set_compile_recursion_guard(). The callout function
should return zero if all is well, or non-zero to force an error.
int pcre2_set_optimize(pcre2_compile_context *ccontext,
uint32_t directive);
PCRE2 can apply various performance optimizations during compilation,
in order to make matching faster. For example, the compiler might con-
vert some regex constructs into an equivalent construct which
pcre2_match() can execute faster. By default, all available optimiza-