Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OF-2954: New feature: Spam Reporting #2661

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

guusdk
Copy link
Member

@guusdk guusdk commented Jan 14, 2025

This commit provides a basic implementation of XEP-0377: Spam Reporting

The changes include:

  • persistent storage of spam reports
  • a provider implementation to allow other spam reporting persistence providers
  • an event listening mechanism
  • an optional notification of admins

@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from abbd110 to d4f9b04 Compare January 14, 2025 22:18
@guusdk
Copy link
Member Author

guusdk commented Jan 14, 2025

This isn't quite ready yet, but seems to be functional.

Pending work:

  • database install script modification for all databases except HSQLDB
  • database update scripts for all databases
  • make the provider implementation non-static (it hard-codes to the 'default' provider now)
  • forward to originating server (note: warn about end-user consent to do this)
  • forward to xmppbl (note: warn about end-user consent to do this)
  • add support for including the original, offending stanza (eg: urn:xmpp:forward:0)

This commit provides a basic implementation of XEP-0377: Spam Reporting

The changes include:
- persistent storage of spam reports
- a provider implementation to allow other spam reporting persistence providers
- an event listening mechanism
- an optional notification of admins
@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from d4f9b04 to 9b82d8e Compare January 15, 2025 09:28
@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from 05865cc to b6e74e7 Compare January 15, 2025 10:22
@guusdk guusdk force-pushed the OF-2954_Spam-Reporting branch from b6e74e7 to 83fb015 Compare January 15, 2025 10:22
The XML element representation that's in a SpamReport should be a copy, and detached. This allows it to be used elsewhere, without one consumer's modifications affected another's.
When admins are notified of a new spam report, the report itself should be included. Supporting clients can then render it.
reporter VARCHAR(1024) NOT NULL,
reported VARCHAR(1024) NOT NULL,
reason VARCHAR(255) NOT NULL,
created BIGINT NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may use timestamp

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea, but not for now: We've had trouble before finding a consistent way to represent a timestamp in all of the databases that we support, which is why we use a number instead. I'm not sure if this is still as impossible as it was in 2004, by the way, but I'd like Openfire to be consistent.

If we do change number for timestamp (which would be a good thing), we should do it for all columns that currently use a number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporter VARCHAR(1024) NOT NULL,
reported VARCHAR(1024) NOT NULL,
reason VARCHAR(255) NOT NULL,
created BIGINT NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the db2 has the TIMESTAMP type. In other places of the file used char for date, which it very strange

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(as above: we currently never use a TIMESTAMP field, which is certainly something I'd like to improve on. If we do improve on that, we should do it consistently, everywhere).

reported NVARCHAR(1024) NOT NULL,
reason NVARCHAR(255) NOT NULL,
created INTEGER NOT NULL,
"raw" LONG VARCHAR NOT NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe rename the raw field?
It contains the "XML representation of the report" but maybe we just need only the reported message so that we can easily parse the spam and train spamd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That column is not great, indeed. I'm in the middle of a refactoring of the database structure. I think this column will be dropped (and instead we'll have a table that can be JOINed with, to have zero-to-many reported stanzas. More on this later!

@@ -130,13 +133,18 @@ else if ( iq.getType().equals( IQ.Type.set ) && "block".equals( iq.getChildEleme
}

final List<JID> toBlocks = new ArrayList<>();
final Set<SpamReport> reports = new HashSet<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the SpamReport doesn't have equals() so the Set won't check for duplicates. Maybe we can use a List here?

In general the HashSet eats a lot of memory and if it's possible to avoid it and use the ArrayList then it will reduce pressure on GC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meh. Resizing ArrayLists when they are populated can be costly too. All such arguments border on premature optimization, which I'd like to avoid.

Instead, I prefer to look at the semantics:

  • usage of Set, unlike List, suggests that the entities are all different.
  • usage of List, unlike Set, suggests that entities are ordered.

In that sense, Set is a better fit for this use-case than List.

public synchronized Set<Text> getContext()
{
if (context == null) {
context = new HashSet<>(Text.allFromChildren(reportElement));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we have an allocation and copy, two heavy operations


public static List<Text> allFromChildren(final Element parentElement)
{
final List<Text> result = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can set an initial capacity to texts.size()

listeners.forEach(listener -> {
try {
listener.receivedSpamReport(report);
} catch (Throwable t) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe catch an Exception and not the Throwable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in learning the rationale for that. I've always been going back and forth between the two without a clear reason. I sometimes wish to catch Error instances, I think - but maybe that's not always appropriate.

notification.setBody(body);
notification.getElement().add(reportElement);

XMPPServer.getInstance().getAdmins().forEach(jid -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor, style) you may use a for each loop that is easier to understand and faster to compile and analyze for linters

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style is in the eye of the beholder. :) The linter in Intellij is perfectly able to reason about this.

return null;
final List<StanzaID> sids = StanzaID.allFromChildren(packet.getElement());
return sids.stream()
.filter(stanzaID -> stanzaID.getBy().toString().equals(by))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filter will be executed for all elements so this will work slower than previously.

@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

Thanks for your feedback, @stokito! I've commented on most of them. Most of the others for me fall in similar categories of what I wrote elsewhere. This PR is still very much a work in progress, so do expect changes!

The data structure defined in XEP-0377 is only usable in context of IQ Blocking. It's not really usable for more generic incident reporting.

The refactoring moves the database structure towards a more re-usable structure:
- it no longer contains the XEP-0377 element as raw XML
- it references 0 to many stanzas in a different table
- the stanzas are minimally represented by a StanzaID, but may include the full XMPP stanza.
This adds a simple page to the admin console on which spam reports can be viewed.

The page is heavily based on the audit log viewer.
@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

I have now applied a major refactoring.

@guusdk
Copy link
Member Author

guusdk commented Jan 15, 2025

Another thing that I can't get working yet is for Conversations (the client that I'm using to report spam) to include a stanza-ID that identifies a spam message.

This should be done by long-pressing a message from a stranger, as implemented here: https://codeberg.org/iNPUTmice/Conversations/src/branch/master/src/main/java/eu/siacs/conversations/ui/ConversationFragment.java#L1323-L1328

That fails for me. My theory is that Openfire doesn't include a stanza-ID to the messages. Does this need (changing in) the monitoring plugin?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants