-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCAB.pm
315 lines (231 loc) · 9.66 KB
/
CAB.pm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
## -*- Mode: CPerl -*-
## File: DTA::CAB.pm
## Author: Bryan Jurish <[email protected]>
## Description: robust morphological analysis: top-level
package DTA::CAB;
use DTA::CAB::Version;
use DTA::CAB::Common;
#use DTA::CAB::Analyzer; ##-- DEBUG
#use DTA::CAB::Analyzer::Common; ##-- DEBUG
#use DTA::CAB::Analyzer::Extra; ##-- DEBUG
eval "use DTA::CAB::Analyzer::Common";
#eval "use DTA::CAB::Server::HTTP";
#eval "use DTA::CAB::Client::HTTP";
#eval "use DTA::CAB::Server::XmlRpc";
#eval "use DTA::CAB::Client::XmlRpc";
use strict;
##==============================================================================
## Constants
##==============================================================================
our @ISA = qw(DTA::CAB::Logger); ##-- for compatibility
##==============================================================================
## Version Information
## \%moduleVersions => DTA::CAB->moduleVersions(%opts)
## + checks all loaded modules in %::INC for $VERSION
## + known %opts:
## (
## moduleMatch => $regex, ##-- only report modules matching $regex
## moduleIgnore => $regex, ##-- ignore modules matching $regex
## )
sub moduleVersions {
no strict 'refs';
my $that = UNIVERSAL::isa($_[0],__PACKAGE__) ? shift : __PACKAGE__;
my %opts = @_;
my $re_match = $opts{moduleMatch};
my $re_ignore = $opts{moduleIgnore};
$re_match = qr{$re_match} if (defined($re_match) && !ref($re_match));
$re_ignore = qr{$re_ignore} if (defined($re_ignore) && !ref($re_ignore));
my ($inc,$pkg,$ver,%versions);
foreach $inc (sort keys %::INC) {
next if ($inc !~ m/\.pm$/i);
$pkg = $inc;
$pkg =~ s{/}{::}g;
$pkg =~ s{\.pm$}{}i;
next if (($re_match && $pkg !~ m{$re_match}) || ($re_ignore && $pkg =~ m{$re_ignore}));
next if ( !($ver = ${"${pkg}::VERSION"}) );
$versions{$pkg} = "$ver";
}
return \%versions;
}
1; ##-- be happy
__END__
##==============================================================================
## PODS
##==============================================================================
=pod
=head1 NAME
DTA::CAB - "Cascaded Analysis Broker" for robust linguistic analysis
=head1 SYNOPSIS
use DTA::CAB;
=cut
##==============================================================================
## Description
##==============================================================================
=pod
=head1 DESCRIPTION
The DTA::CAB suite provides an object-oriented API for
error-tolerant linguistic analysis of tokenized text.
The DTA::CAB package itself just loads the common API
from
L<DTA::CAB::Common|DTA::CAB::Common> and attempts
to load the common analysis modules from
L<DTA::CAB::Analyzer::Common|DTA::CAB::Analyzer::Common>
if present.
Earlier versions of the DTA::CAB suite used the DTA::CAB
package to represent a default analyzer class. The corresponding
class now lives in L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA>.
=cut
##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB: Constants
=pod
=head2 Package Constants
=over 4
=item $VERSION
Module version, imported from L<DTA::CAB::Version|DTA::CAB::Version>.
=item $SVNVERSION
SVN version from which this module was built, imported from L<DTA::CAB::Version|DTA::CAB::Version>.
=back
=cut
##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB: Data Model
=pod
=head2 Data Model
DTA::CAB is designed for processing natural language data which are represented
internally by objects descended from the class L<DTA::CAB::Datum|DTA::CAB::Datum>.
Currently, the DTA::CAB data model explicitly supports the following
datum classes:
=over 4
=item L<DTA::CAB::Token|DTA::CAB::Token>
Represents a single word token as a HASH-ref with at least
a 'text' key, whose value should be a string representing the literal word text.
Additional keys may be defined by L<IO formats|/"I/O Formats">
and/or L<analyzers|/"Processing Model">.
=item L<DTA::CAB::Sentence|DTA::CAB::Sentence>
Represents a single sentence as a HASH-ref with at least
a 'tokens' key, whose value should be an ARRAY-ref of
L<DTA::CAB::Token|DTA::CAB::Token> structures.
Additional keys may be defined by L<IO formats|/"I/O Formats">
and/or L<analyzers|/"Processing Model">.
=item L<DTA::CAB::Document|DTA::CAB::Document>
Represents a text document as a HASH-ref with at least
a 'body' key, whose value should be an ARRAY-ref of
L<DTA::CAB::Sentence|DTA::CAB::Sentence> structures.
Additional keys may be defined by L<IO formats|/"I/O Formats">
and/or L<analyzers|/"Processing Model">.
=back
See the subclass documentation for details.
=cut
##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB: I/O Formats
=pod
=head2 I/O Formats
DTA::CAB supports a number of different I/O formats for
L<document data|/"Data Model">,
including
L<"CSV"|DTA::CAB::Format::CSV>,
L<"JSON"|DTA::CAB::Format::JSON>,
L<"Raw"|DTA::CAB::Format::Raw>,
L<"Text"|DTA::CAB::Format::Text>,
L<"TT"|DTA::CAB::Format::TT>,
L<"YAML"|DTA::CAB::Format::YAML>,
and
L<"XML"|DTA::CAB::Format::XmlNative>.
See L<DTA::CAB::Format> for details on the I/O format API,
and see L<DTA::CAB::Format/SUBCLASSES> for a list of currently
implemented format subclasses.
The command-line utility
L<dta-cab-convert.perl(1)|dta-cab-convert.perl>
is provided for converting between supported I/O formats.
=cut
##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB: Processing Model
=pod
=head2 Processing Model
Input documents are processed by one or more
L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> objects,
each of which may insert, modify, and/or remove
arbitrary properties of the
analyzed L<data|/"Data model">, e.g.
a morphological analyzer (L<DTA::CAB::Analyzer::Morph|DTA::CAB::Analyzer::Morph>)
might insert a token property 'morph'
which could be read in turn by a
part-of-speech tagger (L<DTA::CAB::Analyzer::Moot|DTA::CAB::Analyzer::Moot>).
See
L<DTA::CAB::Analyzer> for a specification of the basic analysis API,
see
L<DTA::CAB::Analyzer::Common> for some common analyzers,
see
L<DTA::CAB::Chain> and/or L<DTA::CAB::Chain::Multi>
for abstract encapsulations of serial analysis "pipelines",
and see
L<DTA::CAB::Chain::DTA> for the analysis chains used
in the I<Deutsches Textarchiv> project.
L<dta-cab-analyze.perl(1)|dta-cab-analyze.perl>
is a command-line utility for invoking
a local L<persistent|DTA::CAB::Persistent>
analyzer on
a L<document|/"Data Model"> in some supported L<format|/"I/O Formats">.
=cut
##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB: Server/Client
=pod
=head2 Server/Client Architectures
The DTA::CAB suite implements
two different server/client architectures
in order to facilitate shared use of common processing pipelines,
as well as to avoid extraneous overhead for L<analyzers|/"Processing Model">
which require excessive initialization times.
L<DTA::CAB::Server|DTA::CAB::Server> and L<DTA::CAB::Client|DTA::CAB::Client>
define the abstract server/client API.
=head3 XML-RPC Server/Client Protocol
B<DEPRECATED> in favor of raw L<HTTP|/"HTTP Server/Client Protocol">.
L<DTA::CAB::Server::XmlRpc|DTA::CAB::Server::XmlRpc> implements a simple
XML-RPC HTTP server which can be used to handle analysis requests for
one of a user-specified set of L<DTA::CAB::Analyzer|DTA::CAB::Analyzer>
objects formulated as XML-RPC procedure calls.
L<DTA::CAB::Client::XmlRpc|DTA::CAB::Client::XmlRpc> provides a wrapper class
for querying such a server.
See L<DTA::CAB::XmlRpcProtocol>
for an brief overview of the procedures available
and an XML-RPCish rehash of the DTA::CAB L<data model|/"Data Model">.
The command-line scripts
L<dta-cab-xmlrpc-server.perl(1)|dta-cab-xmlrpc-server.perl>
and
L<dta-cab-xmlrpc-client.perl(1)|dta-cab-xmlrpc-client.perl>
implement the (deprecated) XML-RPC server/client protocol.
=head3 HTTP Server/Client Protocol
L<DTA::CAB::Server::HTTP|DTA::CAB::Server::HTTP> implements a simple
HTTP server which can be used to handle analysis requests for
one of a user-specified set of L<DTA::CAB::Analyzer|DTA::CAB::Analyzer>
objects. The analysis requests themselves are handled by the
L<DTA::CAB::Server::HTTP::Handler::Query|DTA::CAB::Server::HTTP::Handler::Query>
handler class, which interprets incoming GET and/or POST requests as conventional HTTP
form data, invokes the specified analyzer on the query document, and returns a
formatted document in the HTTP response.
L<DTA::CAB::Client::HTTP|DTA::CAB::Client::HTTP> provides a wrapper class
for querying such a server. Additionally, both HTTP servers and clients support a
backwards-compatible L<XML-RPC mode|/"XML-RPC Server/Client Protocol">.
The command-line scripts
L<dta-cab-http-server.perl(1)|dta-cab-http-server.perl>
and
L<dta-cab-http-client.perl(1)|dta-cab-http-client.perl>
implement the HTTP server/client protocol.
=head3 CLARIN-D WebLicht Protocol
A running L<DTA::CAB::Server::HTTP|DTA::CAB::Server::HTTP> server can be used directly
as a CLARIN-D WebLicht web-service by using the "tcf" or "tcf-orth" formats.
The "CAB historical text analysis"
and "CAB orthographic canonicalizer" WebLicht chain components are implemented
in this fashion; see L<http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/> for details.
=cut
##==============================================================================
## Footer
##==============================================================================
=pod
=head1 AUTHOR
Bryan Jurish E<lt>[email protected]<gt>
=head1 COPYRIGHT AND LICENSE
Copyright (C) 2008-2019 by Bryan Jurish
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself, either Perl version 5.24.1 or,
at your option, any later version of Perl 5 you may have available.
=cut