Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple small issues with messages in the C code #6503

Open
7 of 8 tasks
aitap opened this issue Sep 17, 2024 · 14 comments
Open
7 of 8 tasks

Multiple small issues with messages in the C code #6503

aitap opened this issue Sep 17, 2024 · 14 comments
Labels
question translation issues/PRs related to message translation projects

Comments

@aitap
Copy link
Contributor

aitap commented Sep 17, 2024

This is a continuation of #6482. I think I won't find any more issues in the C code messages. Some of these are questions, a few indicate real translation hurdles.

  • This one is actually possible for me to translate in parts (so we can keep it as is if needed), but I don't like translating individual words and relying on the rest of the code not to need a different translation:

    data.table/src/fread.c

    Lines 1933 to 1934 in 9443409

    if (verbose) DTPRINT(_(" A line with too-%s fields (%d/%d) was found on line %d of sample jump %d. %s\n"),
    thisNcol<ncol ? _("few") : _("many"), thisNcol, ncol, jumpLine, jump, jump>0 ? _("Most likely this jump landed awkwardly so type bumps here will be skipped.") : "");

    Translation in parts could be done reliably with pgettext, which seems to be currently unavailable even to C code in R's built-in copy of gettext on Windows.
  • This one is also translatable in context, but isn't too painful to recombine into a single sentence:

    data.table/src/fsort.c

    Lines 255 to 259 in 9443409

    Rprintf(_("Reduced MSBsize from %zu to "), MSBsize);
    }
    while (MSBsize>0 && msbCounts[order[MSBsize-1]] < 2) MSBsize--;
    if (verbose) {
    Rprintf(_("%zu by excluding 0 and 1 counts\n"), MSBsize);
  • This one is very awkward to translate, and we're again relying on _("no") not to mean something else in a different part of the code:
    DTPRINT(_(" 'header' determined to be %s because there are%s number fields in the first and only row\n"), args.header?"TRUE":"FALSE", args.header?_(" no"):"");

    Unfortunately, expanding this sentence will duplicate the full line of the code.

The rest are questions unrelated to translations:

  • I think we shouldn't be calling it ASCII if its / is not immediately before 0. How about "The character '/' is not just before '0' in the source character set"?
    if ((uint_fast8_t)('0'-'/') != 1) error(_("The ascii character '/' is not just before '0'"));

    if ((uint_fast8_t)(':'-'9') != 1) error(_("The ascii character ':' is not just after '9'"));
  • This one probably used to be correct in the past, but now should refer to a different variable, .Last.updated:
    if (!isInteger(var) || LENGTH(var)!=1) error(_(".Last.value in namespace is not a length 1 integer"));
  • This message is technically correct because the code does expect only symbols at this point, but data.table:::list2lang converts character strings to symbols. Would it be more helpful to mention character strings in the error message too?
    error(_("Attempting to substitute '%s' element with object of type '%s' but it has to be 'symbol' type when substituting name of the call argument, functions 'as.name' and 'I' can be used to work out proper substitution, see ?substitute2 examples."), CHAR(STRING_ELT(arg_names, i)), type2char(TYPEOF(sym)));
  • Should this mention the issue tracker instead?
    if (isNull(dtnames)) error(_("names(data) is NULL. Please report to data.table-help"));
  • Is LENGTH(.) < 0 something that's possible in R ≥ 3.3.0?
    if (!isInteger(idx) || length(idx) < 0) error(_("concat: 'idx' must be an integer vector of length >= 0"));

(Checkboxes indicate either no action needed or a corresponding fix being suggested in #6504)

@aitap aitap added question translation issues/PRs related to message translation projects labels Sep 17, 2024
@tdhock
Copy link
Member

tdhock commented Sep 17, 2024

Should this mention the issue tracker instead?

yes

@MichaelChirico
Copy link
Member

This one is actually possible for me to translate in parts (so we can keep it as is if needed)

I guess you're referring to few/many, not the jump>0 complete sentence added at the end. If so: yes, I think we should branch it.

I think we shouldn't be calling it ASCII if its / is not immediately before 0. How about "The character '/' is not just before '0' in the source character set

Hmm, good point, though I worry "source character set" is a highly technical term --> relatively tough to translate. WDYT about "Unlike the very common case, e.g. ASCII, the character '/' is not just before '0'".

Would it be more helpful to mention character strings in the error message too?

Ping @jangorecki who has the best context here

Is LENGTH(.) < 0 something that's possible in R ≥ 3.3.0?

Good spot. Has it ever been possible? My best guess was this was a somewhat half-baked fix here:

437dc39

Author fixed the length=0 case to work and changed the message to adapt without stopping to think "length<0 is not possible" instead of "how do I adapt x<=0 | x>0 to include x=0 valid? x<0 | x>=0".

So I think we can just drop that condition and the corresponding part of the message.

@aitap
Copy link
Contributor Author

aitap commented Sep 17, 2024

I guess you're referring to few/many, not the jump>0 complete sentence added at the end. If so: yes, I think we should branch it.

Yes, I mean the part about few/many. Could you please clarify what you meant by "branch it"?

I worry "source character set" is a highly technical term --> relatively tough to translate. WDYT about "Unlike the very common case, e.g. ASCII, the character '/' is not just before '0'".

Agreed, this phrasing sounds fine.

My best guess was this was a somewhat half-baked fix here:

437dc39

...which started out as a check for "opposite of length() > 0". Will remove the condition and adjust the message.

@MichaelChirico
Copy link
Member

Could you please clarify what you meant by "branch it"?

thisNcol < ncol ? _("A line with too few fields...") : _("A line with too many fields...)"

@jangorecki
Copy link
Member

No idea really, but I can imagine some hacks in setting up length to negative value, that could have been in play.

@rikivillalba
Copy link
Contributor

rikivillalba commented Sep 17, 2024

some contribs

  • should not be translatated:
    DTPRINT(_(" NAstrings = ["));
  • messages like this are often hard to translate

    data.table/src/gsumm.c

    Lines 224 to 225 in 3b376da

    if (verbose) Rprintf(_("gather took ... "));
    switch (TYPEOF(x)) {
  • Better to write STOP(_("Internal error in %s: %s. Please report to the data.table issues tracker", __func__, internalErr);
    STOP("%s %s: %s. %s", _("Internal error in"), __func__, internalErr, _("Please report to the data.table issues tracker")); // # nocov
  • part of above message, perhaps difficult to translate out of context.
    DTPRINT(_(" with %d fields using quote rule %d\n"), topNumFields, quoteRule);

@MichaelChirico
Copy link
Member

[aside/FYI @rikivillalba if your permalink includes column numbers (C10, C23 in your 2nd bullet before my edit), it won't render inline in the issue. I don't know why GitHub sometimes includes the column numbers]

@aitap
Copy link
Contributor Author

aitap commented Sep 18, 2024

@jangorecki I'm not seeing calls to SETLENGTH in fmelt.c, but I do know that data.table uses negative TRUELENGTH for its internal purposes. Hmm.

@rikivillalba Thank you for reminding me! There's quite a lot of debugging printout that consists of argument/variable names that (arguably) should not be translated:

  • if (verbose) Rprintf(_("RHS_list_of_columns == %s\n"), RHS_list_of_columns ? "true" : "false");
  • if (iN && it!=xt) error(_("typeof x.%s (%s) != typeof i.%s (%s)"), CHAR(STRING_ELT(getAttrib(xdt,R_NamesSymbol),xcols[col]-1)), type2char(xt), CHAR(STRING_ELT(getAttrib(idt,R_NamesSymbol),icols[col]-1)), type2char(it));
  • if (isNull(jiscols) && (length(bynames)!=length(groups) || length(bynames)!=length(grpcols))) error(_("!length(bynames)[%d]==length(groups)[%d]==length(grpcols)[%d]"),length(bynames),length(groups),length(grpcols));
  • if (length(names) != length(SDall)) error(_("length(names)!=length(SD)"));
  • if (length(xknames) != length(xSD)) error(_("length(xknames)!=length(xSD)"));
  • data.table/src/dogroups.c

    Lines 162 to 163 in d7e95d1

    if (length(iSD)!=length(jiscols)) error(_("length(iSD)[%d] != length(jiscols)[%d]"),length(iSD),length(jiscols));
    if (length(xSD)!=length(xjiscols)) error(_("length(xSD)[%d] != length(xjiscols)[%d]"),length(xSD),length(xjiscols));
  • if (verbose) Rprintf(_("dogroups: growing from %d to %d rows\n"), length(VECTOR_ELT(ans,0)), estn);
  • Rprintf(_("nradix=%d\n"), nradix);
  • One more fragmented sentence, mixing an untranslatable variable name (jump0size==0) in the middle:

    data.table/src/fread.c

    Lines 1891 to 1896 in d7e95d1

    DTPRINT(_(" Number of sampling jump points = %d because "), nJumps);
    if (nrowLimit<INT64_MAX) DTPRINT(_("nrow limit (%"PRIu64") supplied\n"), (uint64_t)nrowLimit);
    else if (jump0size==0) DTPRINT(_("jump0size==0\n"));
    else DTPRINT(_("(%"PRIu64" bytes from row 1 to eof) / (2 * %"PRIu64" jump0size) == %"PRIu64"\n"),
    (uint64_t)sz, (uint64_t)jump0size, (uint64_t)(sz/(2*jump0size)));
    }
  • if (verbose) DTPRINT(_(" jumps=[%d..%d), chunk_size=%"PRIu64", total_size=%"PRIu64"\n"),
  • if (INTEGER(nThreadArg)[0]<1) error(_("nThread(%d)<1"), INTEGER(nThreadArg)[0]);
  • if (verbose) Rprintf(_("nth=%d, nBatch=%d\n"),nth,nBatch);
  • if (verbose) Rprintf(_("maxBit=%d; MSBNbits=%d; shift=%d; MSBsize=%zu\n"), maxBit, MSBNbits, shift, MSBsize);
  • data.table/src/fwrite.c

    Lines 638 to 639 in d7e95d1

    DTPRINT(_("\nargs.doRowNames=%d args.rowNames=%p args.rowNameFun=%d doQuote=%d args.nrow=%"PRId64" args.ncol=%d eolLen=%d\n"),
    args.doRowNames, args.rowNames, args.rowNameFun, doQuote, args.nrow, args.ncol, eolLen);
  • if (LENGTH(f) != ngrp) error(_("length(f)=%d != length(l)=%d"), LENGTH(f), ngrp);
  • This one uses untranslated default values in mygetenv, which may be not that important:
    Rprintf(_(" omp_get_num_procs() %d\n"), omp_get_num_procs());
    Rprintf(_(" R_DATATABLE_NUM_PROCS_PERCENT %s\n"), mygetenv("R_DATATABLE_NUM_PROCS_PERCENT", "unset (default 50)"));
    Rprintf(_(" R_DATATABLE_NUM_THREADS %s\n"), mygetenv("R_DATATABLE_NUM_THREADS", "unset"));
    Rprintf(_(" R_DATATABLE_THROTTLE %s\n"), mygetenv("R_DATATABLE_THROTTLE", "unset (default 1024)"));
    Rprintf(_(" omp_get_thread_limit() %d\n"), omp_get_thread_limit());
    Rprintf(_(" omp_get_max_threads() %d\n"), omp_get_max_threads());
    Rprintf(_(" OMP_THREAD_LIMIT %s\n"), mygetenv("OMP_THREAD_LIMIT", "unset")); // CRAN sets to 2
    Rprintf(_(" OMP_NUM_THREADS %s\n"), mygetenv("OMP_NUM_THREADS", "unset"));
    Rprintf(_(" RestoreAfterFork %s\n"), RestoreAfterFork ? "true" : "false");
  • if (length(order) != nrow) error(_("nrow(x)[%d]!=length(order)[%d]"),nrow,length(order));

Which of these should have their _ removed? Which should be translated anyway?

@MichaelChirico
Copy link
Member

MichaelChirico commented Sep 18, 2024

for those, I think we need a system for marking "notranslate" at the same time, so that CI/release doesn't keep finding these strings that are in translateable calls, just not useful to translate.

In potools, that's // # notranslate.

Let's open another separate issue for those, it's a bit different from the topic at hand.

@aitap
Copy link
Contributor Author

aitap commented Sep 19, 2024

  • This one indicates a problem with xgettext: the string eventually given to gettext() will not be translated because xgettext had split it into parts, including "%":

    data.table/src/assign.c

    Lines 944 to 945 in d7e95d1

    ? _("%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (target vector)") \
    : _("%"FMT" (type '%s') at RHS position %d "TO" when assigning to type '%s' (column %d named '%s')"), \

Edit: hmm, I don't see "at RHS position" at all in the official *.pot file. Nevertheless, I don't think xgettext handles compile-time string concatenation in general.

@MichaelChirico
Copy link
Member

Re: the previous comment, any suggested fix? The first things that come to mind seem messy. Maybe we should functionalize the macro...

@aitap
Copy link
Contributor Author

aitap commented Sep 23, 2024

The core of the problem is that even if we do follow the recommendation to format the number into a temporary buffer and then use %s for all number formats, we still have a combinatorial explosion of 7 (TO) ⋅ 2 (target vector / column %d named '%s') strings that xgettext wants spelled out in the source code.

Can we cheat a little and split the clauses into two sentences? Then we'll have

_("%s (type '%s') at RHS position %d taken as TRUE")
_("%s (type '%s') at RHS position %d taken as 0")
_("%s (type '%s') at RHS position %d either truncated (precision lost) or taken as 0")
_("%s (type '%s') at RHS position %d out-of-range (NA)")
_("%s (type '%s') at RHS position %d out-of-range (NA) or truncated (precision lost)")
_("%s (type '%s') at RHS position %d had either imaginary part discarded or real part truncated (precision lost)")
_("%s (type '%s') at RHS position %d had imaginary part discarded")

and

_("Problem when assigning to type '%s' (target vector):")
_("Problem when assigning to type '%s' (column %d named '%s'):")

...which is more manageable. The strings could be rephrased further, making them truly separate sentences (The problem happened when assigning to type '%s' (target vector).).

@MichaelChirico
Copy link
Member

Would it be more helpful to mention character strings in the error message too?

cc @jangorecki. I'm not sure what change you have in mind exactly, but I think the call-out to consult ?substitute2 is the most helpful part. I worry adding to the message will make it over-complicated.

@MichaelChirico
Copy link
Member

part of above message, perhaps difficult to translate out of context.

Good spot, there's a similar fragment below. It's a bit hard to pull apart exactly how we'd make this friendlier. I think the best bet is to make each fragment a standalone sentence. I'll want to read more carefully how exactly those show up in the verbose output. Filing as a separate issue so it's not lost, there's already a lot going on here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question translation issues/PRs related to message translation projects
Projects
None yet
Development

No branches or pull requests

5 participants