Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop a pluggable logging API for Cacti to be able to log to alternate logging formats #5852

Open
TheWitness opened this issue Oct 13, 2024 · 22 comments
Labels
enhancement General tag for an enhancement
Milestone

Comments

@TheWitness
Copy link
Member

Feature Request

Is your feature request related to a problem? Please describe

Support extensible logging to Splunk, Kafka, Elastic, etc.

Describe the solution you'd like

We need to determine the minimum logging attributes that are required to encompass these various logging solutions and then define the JSON, XML, etc. structures that are the minimum to perform such logging, and finally an API and class library that extends the output format using a pluggable and extensible output class function.

Describe alternatives you've considered

Continuing one size fit's all logging format.

Additional context

This is an R&D project that should deliver a report on what should happen vs. simply coding an enhancement. So, it's a call for a design document.

@TheWitness TheWitness added the enhancement General tag for an enhancement label Oct 13, 2024
@TheWitness TheWitness added this to the v1.3.0 milestone Oct 13, 2024
@bernisys
Copy link
Contributor

bernisys commented Oct 14, 2024

Thoughts about capabilities and formats

The local logging must be unified

  • each log line must have the same structure
  • each log line must have the same minimum information
  • log lines need to be easily extendable with additional information (for example for debugging)
  • adding verbosity should not alter the start of the lines, so that they can still be parsed by existing functions
  • a common log line format must be defined, and several elements need a static position and a separator that can be clearly identified
  • suggestion for a local-file log format: "<timestamp> <section/facility>:<subsection> - <log message> -- <key/value pairs> == <optional info>"
  • for standard syslog, we should try to match RFC5424 conformity (logs with structured data) -> https://datatracker.ietf.org/doc/html/rfc5424#section-6.5 example 3

Considerations for configuration

  • make it flexible
  • keep it simple (too many options might confuse the users - maybe add "expert" section?)
  • global options to enable/disable specific elements per section
  • log levels should be configurable at least per facility or per code-section
  • optional information needs to be configurable (for example: line numbers and code portions are usually only needed for debugging)

Additional considerations:

  • there are several logging functions in different parts of cacti, this needs some unification
  • allow more detailed logging in a specific section, for example when there is something with boost -> switch on boost logging with this and that severity
  • transfer in output format should be done as final step (keep the logging item in a data structure as long as possible, separated by identifiers mentioned below)
  • we might need to think about the general concept of logging into different files. Is that really necessary or wouldn't it be better to keep everything in one file? And if we want to keep supporting multiple files, which information shall go into which file?

log items/identifiers

Minimum/mandatory set of information (identifiers)

  • timestamp in epoch or UTC (local time conversion later)
  • facility/section inside cacti (i.e. Spine, cmdphp, Thresholds, Core, Automate, SQL, ...)
  • subsection (TBD)
  • severity (similar to syslog: debug=7, info, notice, warning, error, critical, alert, emergency=0)
  • message

Optional items (where applicable / during debugging)

  • php-file in which the log was produced
  • line number
  • function name
  • object that produced the log
  • The message can include additional key/value pairs
    • let's keep the old bracket-notation (like: host_id[42])
    • if possible, make the keys identical to the DB-column names they refer to

Collection of possible (sub)sections:

  • [main] GUI operations
  • [main] Boost
  • [main] Polling (spine + cmdphp)
  • [main] workflow/controlling
  • [sub] database accesses
  • [sub] command calls
  • [sub] remote poller communication
    @TheWitness @netniV please contribute here, you know the code and its structure better

support for output plug-ins

The logging function should act as a hub for plug-ins that allow logging to not only a local file but also other destinations.
Possible output plug-ins:

  • local file (default plug-in to keep current functionality)
    • Config: file location(s)m, rotation settings [hourly/daily/size-based, compression, archive folder structure]
    • log-file should immediately have a creation-timestamp in its name
  • local syslog (default plug-in to keep current functionality)
    • Config: facility
  • remote syslog (optional plug-in)
    • Config: target-server, protocol, port, facility, CA
  • Elasticsearch (optional plug-in, TBD)
  • kafka bus (optional plug-in, TBD)
  • other plug-ins (community-provided)

Thoughts about implementation phases

prepare the refactoring

  • create a new specific library-file: /lib/logging.php
  • put all new logging-related functions into that file
  • create new core-function which must be used in future for all logging purposes
  • create a set of logging configuration in "settings" table: logging_ =
  • create new GUI config elements for the new logging settings + code and sanity checks around them
  • keep the current logging-function available (old plug-ins should be supported as long as possible)
    • use the current code as source for analyzing the logging internals
    • rebuild the basic functionality in the new function, add all planned features
    • make the old code call the new code
    • old code then acts as meta-code for the transition to new logging
    • think about expiring old logging at some point in future

reforming the logging concept

  • create logging initialization procedure
    • if possible/applicable, called only once during start of the instance
    • initializes all logging-related variables
    • checks if log-files & folders are available and creates them if needed
    • long running processes might need a reload/re-init functionality to allow change of log levels (for example during a boost run)
  • include the new core logging lib file everywhere, where logging is needed (which is basically everywhere)
  • Start refactoring all "cacti_log" calls, replace them with the new core lib calls

@bernisys
Copy link
Contributor

Whoever has ideas that need to be considered, please speak up! :)
This is an open discussion to take into account as many thoughts as possible from the start.
I will keep adding the information in compact form to my initial post.

@xmacan
Copy link
Member

xmacan commented Oct 22, 2024

Subsection also for plug-in modules. It could be useful if I can filter messages only from a specific plugin, for example if I have debug enabled for it

@TheWitness
Copy link
Member Author

I hope to provide some comments next week.

@TheWitness
Copy link
Member Author

All, I've had a detailed conversation with @Linegod whose day job has everything to do with logging. Let me summarize here where we aligned. I will do this in code form as well as descriptive.

  1. KISS - Yup, make it easy.
  2. Modify the cacti_log() function take an array in addition to a string as input.
  3. For legacy logging, use the special tags that we used today DS[], Device[], GRAPH[] from legacy messages and break that into an object if the input to the function is a string.
  4. If the input to the function is an object/array, then allow logging in JSON and whatever format is applicable to the output function. @Linegod says it's pretty much all JSON these days, with some variations on the timestamp format.
  5. The plugin dictates things like topic, endpoint ip address, and auth data. The format could simply be a JSON file defining the plugin vs. a table. That's up for discussion.

Now for my position. One way or another, the cacti.log file will remain the same, emulating the tags already defined regardless of the input type.

We will discuss more tomorrow in our call.

Larry

@TheWitness
Copy link
Member Author

We could store the plugins in a table as a name value pair of course, with a pre-defined JSON structure holding the important columns as above. This is pretty popular these days with people using name value stores such as reddis, but why not the Cacti settings table (just kidding), we will create it it's own table format and cache the document at the session level. It'll be stored in JSON. I guess the 'template' could be stored on disk. Let's discuss.

@TheWitness
Copy link
Member Author

So, I would propose something like the following as a default object structure, it can of course be increased by caller:

$message = array(
  'timestamp' => unix timestamp in integer or millisecond format (optional),
  'message' => text,
  'level'  => LOW|MED|HIGH|DEBUG|...,
  'hostname' => either Cacti or the device I'm open,
  'app' => third parameter on the current logging facility,
  'metadata' => array(
    'thing' => $value,
    ...   
  )
);

Then, as additional optional metadata that we know of today that can be reverse engineered into the object structure and currently captured in the text_get_regex_array() function, we have:

$regex_array = array(
  1  => array('name' => 'DS',     'regex' => '( DS\[)([, \d]+)(\])',       'func' => 'text_regex_datasource'),
  2  => array('name' => 'DQ',     'regex' => '( DQ\[)([, \d]+)(\])',       'func' => 'text_regex_dataquery'),
  3  => array('name' => 'Device', 'regex' => '( Device\[)([, \d]+)(\])',   'func' => 'text_regex_device'),
  4  => array('name' => 'Poller', 'regex' => '( Poller\[)([, \d]+)(\])',   'func' => 'text_regex_poller'),
  5  => array('name' => 'RRA',    'regex' => "([_\/])(\d+)(\.rrd&#039;)",  'func' => 'text_regex_rra'),
  6  => array('name' => 'GT',     'regex' => '( GT\[)([, \d]+)(\])',       'func' => 'text_regex_graphtemplates'),
  7  => array('name' => 'Graph',  'regex' => '( Graph\[)([, \d]+)(\])',    'func' => 'text_regex_graphs'),
  8  => array('name' => 'Graphs', 'regex' => '( Graphs\[)([, \d]+)(\])',   'func' => 'text_regex_graphs'),
  9  => array('name' => 'User',   'regex' => '( User\[)([, \d]+)(\])',     'func' => 'text_regex_users'),
  10 => array('name' => 'User',   'regex' => '( Users\[)([, \d]+)(\])',    'func' => 'text_regex_users'),
  11 => array('name' => 'Rule',   'regex' => '( Rule\[)([, \d]+)(\])',     'func' => 'text_regex_rule'),
);

I noticed a few things missing from the 'standard' like Site[], Location[], Thold[], Alert[] for example. These would all be captured under metadata. At least that is where my head is at.

If we follow that structure, we can provide a human readable message in the cacti.log file, as well as JSON to any of the feed types.

@TheWitness
Copy link
Member Author

@Linegod please jump in. Is this kind of where your head was too?

@Linegod
Copy link
Member

Linegod commented Nov 10, 2024

@TheWitness - yes, that's pretty much what I was thinking.

A little bit of digging lead me this though: https://github.com/Seldaek/monolog

Most of the heavy lifting could be accomplished just by using monolog

@TheWitness
Copy link
Member Author

Oh, look 'composer'. Last time, that did not go very well. No worries.

@Linegod
Copy link
Member

Linegod commented Nov 10, 2024

There are 4 other vendor libraries using composer, so I didn't think it would be an issue.

./include/vendor/cldr-to-gettext-plural-rules/composer.json
./include/vendor/gettext/composer.json
./include/vendor/phpmailer/composer.json
./include/vendor/flag-icons/composer.json

And monolog is PSR-3 compliant, so it adheres to the PHP logging standards. And the PHP logging standards are RFC5424 compliant, so they adhere to the syslog standard.

@bernisys
Copy link
Contributor

bernisys commented Dec 9, 2024

Hmm thinking again about Larry's idea of the hostname parameter, i come to a different conclusion ...

'hostname' => either Cacti or the device I'm open,

I believe to match the standard, it should rather be the name of the originating server that generates the log line.
This would then actually be the cacti server's hostname (either the main one or one of the remote ones), and the device name and ID should better be reflected in the metadata part only.

It's similar to for example firewall logs, where we have the FW itself writing the log line, which is also reflected as the hostname part in the 5424 syslog. But it might also contain a hostname of a device that connects across it, and that is logged in the (more or less) structured part of the log message.

And this metadata part:
4 => array('name' => 'Poller', 'regex' => '( Poller[)([, \d]+)(])', 'func' => 'text_regex_poller'),
would come into play, if the main poller registers something going on with the communication towards a remote poller for example. Or vice versa.

BTW i think we should start with some meta-code, getting all this structured part in order before we fill it with life, providing the actual code that processes everything.

Comments?

@bernisys
Copy link
Contributor

bernisys commented Dec 9, 2024

Comment on the timestamp:
If possible, at least internally we should go for milliseconds, as it is much more precise and therefore easier to reconstruct the sequence, if for example the logs are sent thru a syslog channel across a load balancing device, which causes re-ordering of messages due to differences in internal processing time. I am thinking here about a clustered logging solution like Elastic, which can have a load-balanced Logstash or Beats phalanx in front. All logs arrive in sequence at the LB, but it is not guaranteed that the time a packet needs from the kernel towards the application always has identical delays. Scheduling is on a 1000 ticks/s time scale, so we can have already at least 1ms difference if another task was processed at that time, and that happens constantly.

If we can rely on the precision originating time stamp, where logs are produced in a given logical sequence, this helps in the re-ordering in such cases.

@bernisys
Copy link
Contributor

bernisys commented Dec 9, 2024

Regarding Graphs and Users:

I think we can handle those always as an array. Either it contains 1 element or it contains more, and it is easier to have it as an array in all cases, that makes the handling easier.

So we can say this:

  7  => array('name' => 'Graph',  'regex' => '( Graph\[)([, \d]+)(\])',    'func' => 'text_regex_graphs'),
  8  => array('name' => 'Graphs', 'regex' => '( Graphs\[)([, \d]+)(\])',   'func' => 'text_regex_graphs'),

would become this single line:

  <n>  => array('name' => 'Graphs',  'regex' => '( Graphs?\[)([, \d]+)(\])',    'func' => 'text_regex_graphs'),

Same here:

  9  => array('name' => 'User',   'regex' => '( User\[)([, \d]+)(\])',     'func' => 'text_regex_users'),
  10 => array('name' => 'Users',   'regex' => '( Users\[)([, \d]+)(\])',    'func' => 'text_regex_users'),
                                                     ^ your forgot the "s" there, i added it
  <n>  => array('name' => 'Users',   'regex' => '( Users?\[)([, \d]+)(\])',     'func' => 'text_regex_users'),

(adapt the numbers in front as necessary)

@TheWitness
Copy link
Member Author

Another comment or question on this. Should we not log Tina table first and then have a proxy forward so as to not impact Cacti if the log receiver is down?

@TheWitness
Copy link
Member Author

Or better yet just let our sis log D do the forwarding for us. We just have to write to the json format in an alternate location and let it do all work. For those on Windows I guess well I care maybe? I don't know...

@bernisys
Copy link
Contributor

Logging into a DB in principle has the same constraint .. if the target DB is down -> fails.
But in case of Cacti, which is heavily relying on the DB, this could be a possible idea.
And if the logging end point is that critical, it would be load balanced anyway. In case of ELK for example you could configure multiple end points that would be used in a round robin manner or if the end point does not react, we simply switch to the next instance (can be implemented in the logging plug-in).

This brings me to something that we need to consider:
If the Cacti instance is in debugging mode, you would flood the logging end-point. Or we simply drop the "debug" level logs by default and let the user decide if they want to include debug logs in the end point. To point out that debugging is active, we could drop a warning once a minute: "Debugging is activated on poller ... for devices=... scripts=..."

And if we let syslog do the stuff, then
a) the user is forced to configure the connectivity on TCP basis, otherwise log reception is not guaranteed as UDP is stateless
b) we would introduce an intermediate log-format (text-based JSON or key-value format with prepended syslog header) that needs to be parsed again at the actual logging-endpoint

Since caching on failure is generally a very wise idea, I suggest to keep it simple:
If we detect that a configured receiver is down, then let's just put the plain JSON objects into a local file. This file can be replayed once the remote server is back again.

@TheWitness
Copy link
Member Author

We should handle like our syslog integration. Be able to exclude by string match or level.

One thing at a time though. Packaging this week.

@TheWitness
Copy link
Member Author

Okay, I'm winding down on projects. So, please let me know what you think.

  1. To keep it simple, create a JSON log path that get's rotated just like the other logs. example: cacti.log.json
  2. All time in MS format using the RFC layout: 2003-10-11T22:14:15.003Z for example.
  3. Existing string logging will be converted to json as we discussed.
  4. cacti_log() will take mixed input either array or string. Array will be converted to JSON in the json logging, and flattened to our current standard of XXX[] in the legacy (visual) logging standard.
  5. We de-content the forwarding aspects and defer that to say rsyslog which handles this perfectly today. Windows users will have to figure things out I guess.

If we do this, I can have it finished (first pass), in just a few hours.

@bernisys @Linegod @netniV @xmacan @browniebraun @seanmancini, please chime in.

@netniV
Copy link
Member

netniV commented Dec 15, 2024

Sounds like a plan

@TheWitness
Copy link
Member Author

@Linegod , RFC5424 for structured data is not quite JSON. Are we okay just running with JSON though with the zulu timestamp?

@bernisys
Copy link
Contributor

@TheWitness: Sounds okay so far, i am just not too comfortable with that part: "cacti_log() will take mixed input either array or string"

I would create a new logging function (func-name TBD) that takes only JSON-like objects (hash-arrays internally), and then point the old func to the new one and in the code replace the logging functions everywhere and review the calls and their contents at the same time.

This offers higher processing speed in the core, because we don't need to distinguish, what kind of object we get, when the func is called.

Regarding the timestamp: That (as well as all further parts of the given data) would be formatted only when the log is actually written to a destination, so internally we can keep a simple number-based time format, maybe epoch + millis or something like that. What ever php offers and what ever is best to process without too much overhead.

And i believe the current cacti.log can (should) be kept, too, as it is more human-readable than JSON. It's just another internal output plug-in, and the formatting of the lines is necessary anyway, when we want to write the lines to syslog, it's a rather similar format. So why not "re-use" that and write the good old cacti.log that the users already know. Avoids some confusion i would say.
Though one thing i would change here: all files should be timestamped directly, and not rotated. That imposes basically no overhead when writing the log, we gain simpler internal handling during the rotation time (nothing needed but a new filename with the new date-stamp once the date changes, no extra file-renames), and the internal log-view in the GUI can simply scan all (or the latest 2) timestamped files if needed.

Comments please :)
Better talk a bit more first and do it "a bit more right" in the end.

@TheWitness TheWitness modified the milestones: v1.3.0, v1.4.0 Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement General tag for an enhancement
Projects
None yet
Development

No branches or pull requests

5 participants