Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_Generic() is not handled on elixir.bootlin.com #237

Open
andy-shev opened this issue Dec 4, 2022 · 11 comments
Open

_Generic() is not handled on elixir.bootlin.com #237

andy-shev opened this issue Dec 4, 2022 · 11 comments

Comments

@andy-shev
Copy link

There is no references to _Generic(). Since new kernel releases will more and more rely on the macros with this feature be enabled, it would be nice to at least have a list of the users of it in the source code.

@fstachura
Copy link
Collaborator

_Generic is never listed by ctags (because it's a keyword and it's not redefined anywhere in the code), therefore it's not in the definitions database and references to it are not recorded. Handling this would probably need to be special-cased in the code.

@tleb
Copy link
Member

tleb commented Jun 10, 2024

_Generic is never listed by ctags (because it's a keyword and it's not redefined anywhere in the code), therefore it's not in the definitions database and references to it are not recorded.

My understand is this: _Generic() is not listed by ctags because it can only be used in macros. ctags does list keywords, see your identifier database (#286) that contains keywords.

Handling this would probably need to be special-cased in the code.

I don't see any ctags kind that supports it. I tested using a short test.c file that uses a _Generic in a macro, then:

⟩ # For reference:
⟩ ctags --version
Universal Ctags 6.1.0(v6.1.0), Copyright (C) 2015-2023 Universal Ctags Team
Universal Ctags is derived from Exuberant Ctags.
Exuberant Ctags 5.8, Copyright (C) 1996-2009 Darren Hiebert
  Compiled: Mar 17 2024, 12:18:39
  URL: https://ctags.io/
  Output version: 0.0
  Optional compiled features: +wildcards, +regex, +iconv, +option-directory,
  +xpath, +json, +interactive, +sandbox, +yaml, +packcc, +optscript, +pcre2

⟩ # Enable all kinds. Full list comes from `ctags --list-kinds=C`.
⟩ ctags --c-kinds=defghlmpstuvxzLD test.c
⟩ grep _Generic tags
X	test.c	/^#define cbrt(X) _Generic((X), \\$/;"	D	macro:cbrt
⟩ # only result is for `X`, of kind:
⟩ # D  parameters inside macro definitions

That would mean we would need to contribute to ctags or add custom parsing. I'm not sure parsing C is something we want to get into.

Also, is this feature useful? I'd say no. But I might be biased because I have a copy of the kernel on my machine and can always grep from there if I need non keywords (which I also do to look for strings for example, which are not indexed by Elixir either).

@andy-shev
Copy link
Author

andy-shev commented Jun 10, 2024

Also, is this feature useful? I'd say no.

I disagree.

Yes, first of all we should consider Elixir service to be detached from whatever developer has, I often access it via my phone. Do I have a kernel Git source tree there? Nope, and 99.99+% of users of such a use case won't have it either.
Second, the Elixir interface will give "identifier not found" (or alike) error message making a false impression about some macros or even functions. In long term you will lose half of the generic (pun unintended) helpers in the kernel, because as I said, _Generic() becomes more widespread in there.

Yeah, I understand that ideally it's probably a ctags project that is lagging more and more with Linux kernel progression (and most of those lags are related to macros: _Generic(), __free(), ...).

@tleb
Copy link
Member

tleb commented Jun 10, 2024

I see your reasoning.

Yeah, I understand that ideally it's probably a ctags project that is lagging more and more with Linux kernel progression (and most of those lags are related to macros: _Generic(), __free(), ...).

__free is indexed properly, it is a plain kernel macro. Same thing with all other kernel macros.

Something odd: __free even gets indexed when used inside #define. This contrasts with _Generic which should look the same to ctags. See ident, referenced in include/linux/of.h, which I cannot reproduce locally:

⟩ cat test.c
#define __free(_name) __cleanup(__free_##_name)

⟩ # command taken from `script.sh`
⟩ ctags -x --kinds-c=+p+x --extras='-{anonymous}' test.c
__free           macro         1 test.c           #define __free(_name) __cleanup(__free_##_name)

⟩ # notice we get no entry for `__cleanup`

It might be only a bug/exception with _Generic but I cannot find an edge-case for it in ctags or Elixir. This needs further investigation.

  • What makes Elixir index __cleanup in the declaration of __free on the Elixir instance but not when ctags is used directly on my machine?
  • Why is _Generic different from other symbols inside macros?

@fstachura
Copy link
Collaborator

ctags does list keywords, see your identifier database (#286 (comment)) that contains keywords.

But ctags only lists these keywords because they are redefined as something else (label, member name...) somewhere in the kernel. See:

But no else or return for example.

@tleb
Copy link
Member

tleb commented Jun 10, 2024

How can ctags be aware of those definitions? It gets passed a file in /tmp that is extracted using git cat-file blob. It does not have access to the full kernel sources when running and does not get passed the existing database.

Take __cleanup. It is referenced in include/linux/cleanup.h but gets defined in include/linux/compiler-clang.h or include/linux/compiler_attributes.h. When ctags parses include/linux/cleanup.h it is not aware of the macro's potential definitions.

@fstachura
Copy link
Collaborator

Ctags is only used to parse definitions. References are parsed with a different, simpler method. Ctags is not involved when the database of references is updated.

See https://github.com/bootlin/elixir/blob/master/update.py#L237
And https://github.com/bootlin/elixir/blob/master/script.sh#L161

For references, a simple tokenizer is ran: https://github.com/bootlin/elixir/blob/master/script.sh#L97
Then, all identifiers returned by the tokenizer are cross-checked with the definitions database. If an identifier is already in the definitions database, then it's added to the reference database https://github.com/bootlin/elixir/blob/master/update.py#L318 (there are some exceptions to this, but that's roughly how it works in my understanding)

References update (for a particular tag) runs only after definitions update is done.
https://github.com/bootlin/elixir/blob/master/update.py#L277

@tleb
Copy link
Member

tleb commented Jun 10, 2024

Ah, that tokenizer was the part I was missing, thanks! The conditional statement in update_references() could be split up and would benefit from some comments.

Then to address _Generic, we should turn the logic from:

# This is what we currently have in update.py.

if (db.defs.exists(tok) and
    not ( (idx*idx_key_mod + line_num) in defs_idxes and
        defs_idxes[idx*idx_key_mod + line_num] == tok ) and
    (family != 'M' or tok.startswith(b'CONFIG_'))):
    # add reference
    pass

# First refactor to simplify. I _think_ this is equivalent.

def_exists = db.defs.exists(tok)
i = idx*idx_key_mod + line_num
if def_exists and defs_idxes.get(i) != tok and (family != 'M' or tok.startswith(b'CONFIG_')):
    # add reference
    pass

# Now add logic. I haven't tested this code. Also it could be slow.

compiler_tokens = [b'_Generic', b'_Atomic', b'_Static_assert'] # etc.
accept_tok = db.defs.exists(tok) or (family == 'C' and tok in compiler_tokens)
i = idx*idx_key_mod + line_num
if accept_tok and defs_idxes.get(i) != tok and (family != 'M' or tok.startswith(b'CONFIG_')):
    # add reference
    pass

Some notes about this:

  • We might encounter issues because family == 'C' matches all those filetypes: ['.c', '.cc', '.cpp', '.c++', '.cxx', '.h', '.s'].
  • We got lucky with some compiler-provided symbols. See _Static_assert that is indexed currently because some random files define it if it does not exist.
  • Where can we find a list of compiler provided symbols? I'm seeing a list in the C11 draft §6.4.1. Does GCC/Clang provide additional stuff that the kernel use?

What do you think about this solution?

@fstachura
Copy link
Collaborator

fstachura commented Jun 10, 2024

We might encounter issues because family == 'C' matches all those filetypes: ['.c', '.cc', '.cpp', '.c++', '.cxx', '.h', '.s'].

Do you mean that there could be a problem with assembler files?

Where can we find a list of compiler provided symbols? I'm seeing a list in the C11 draft §6.4.1. Does GCC/Clang provide additional stuff that the kernel use?

I heard that Linux heavily relies on GCC extensions, and while I don't have much experience with the codebase, I would guess that those are never defined in a way that would be picked up by ctags.
Maybe this could help, although it's not a simple list we can just copy from: https://gcc.gnu.org/onlinedocs/gcc/C-Extensions.html

See for example __builtin_unreachable:
https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005funreachable
https://elixir.bootlin.com/linux/latest/source/arch/arm64/kvm/hyp/nvhe/gen-hyprel.c#L166
But https://elixir.bootlin.com/linux/latest/A/ident/__builtin_unreachable

What do you think about this solution?

Your solution definitely makes sense. Are you planning to open a PR? Or should I do it?

@tleb
Copy link
Member

tleb commented Jun 11, 2024

Do you mean that there could be a problem with assembler files?

Yes, sorry, I forget giving precise thoughts. There is an edge-case with assembly files but I don't see how it could be an issue. Something else: we also must not forget to handle C++ properly, which I guess has many compiler-provided identifiers.

I heard that Linux heavily relies on GCC extensions, and while I don't have much experience with the codebase, I would guess that those are never defined in a way that would be picked up by ctags.

Indeed ctags does not output declarations for all compiler built-ins. This is the root of the issue. We need to know of all compiler built-ins to properly filter our tokenizer output.

I went looking into GCC and I found a reserved words list (eg contains _Generic) and a builtins list (eg contains __builtin_unreachable). That is a good start. Two questions:

  • Is that all for C? Maybe I missed stuff from GCC, does it include types like __int128? Does LLVM have additional reserved words/builtins?
  • How does it work for other languages supported by Elixir?

Your solution definitely makes sense. Are you planning to open a PR? Or should I do it?

I cannot allocate much time to this. Can you continue the investigation and send a PR? Thanks!

@andy-shev
Copy link
Author

Thanks, folks, for this investigation! Hopefully we will get something out sooner than later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants