Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split logging between stderr and stdout #37

Open
nvanva opened this issue Mar 22, 2023 · 1 comment
Open

Split logging between stderr and stdout #37

nvanva opened this issue Mar 22, 2023 · 1 comment

Comments

@nvanva
Copy link
Contributor

nvanva commented Mar 22, 2023

Currently all logs are written to stderr. This makes it difficult to find error messages in those logs. Writing only error messages to stderr and the rest to stdout will make it simpler to investigate if there were any errors, especially when running many warc2text processes in parallel and redirecting stdout and stderr to different files.

@jelmervdl
Copy link
Member

jelmervdl commented Mar 23, 2023

I've been using stdout for #34 and am in the camp of "stdout is for output, not UI" so that I can pipe things together. Using stdout for messages to the user would make that impossible.

If you want to split verbose logging from error messages, I propose to use a command line option, e.g. --log-file, that writes the verbose messages (a record was filtered due to url filter, that kind of stuff) to a separate file. This would also make it optional so it doesn't need any changes in bitextor. Edit: and if you really want the log messages to go to stdout, you can use --log-file=/dev/stdout or --log-file=-.

That being said, the only error message that doesn't terminate warc2text is when a warc archive contains broken gzip records (which could indicate file corruption). All others either are the last message to be printed before warc2text dies with a non-zero exit code which seems pretty reasonable to me.

A different annoyance I've had: if you're running multiple warc2text processes through parallel, warc2text will not prefix the logging messages with the name (and offset maybe?) of the warc that the message is about. Right now you need to recollect all messages from a single warc2text in order, and then go through it from top to bottom to figure out which warc is the source of any of the messages. Running warc2text with just a single warc archive, and letting parallel do the log grouping is also not an option since then you can't combine the output of multiple warcs easily and you end up with many more files on disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants