Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store screenshots as attachments in NDB. #4473

Merged
merged 2 commits into from
Oct 24, 2024

Conversation

jrobbins
Copy link
Collaborator

This should resolve #4459.

This is a simpler approach for storing screenshots. Rather than add a dependency on using GCS, we can store them in NDB, which we already use extensively. I am expecting performance to be acceptable because only thumbnails of the images will be loaded into the feature detail page.

The security concern around serving images from our app (rather than directly from GCS) will be mitigated in an upcoming PR that deals with serving issues. Basically, the image content will be served from a different origin than where we store user session cookies.

In this PR:

  • Implement attachments.py with an NDB model and functions for working with attachments.

@jcscottiii
Copy link
Collaborator

jcscottiii commented Oct 22, 2024

Thanks for putting this PR together!

The security concern around serving images from our app (rather than directly from GCS) will be mitigated in an upcoming PR that deals with serving issues. Basically, the image content will be served from a different origin than where we store user session cookies.

Could you describe the security problem again? I know you mentioned the separate domain part to me before. But will this separate domain be a separate deployment that needs to happen now? If so, will that second deployment restrict the usage of the other routes to reduce the load on that deployment? (Admittedly, only a malicious actor would be poking at the other APIs on the same server)

After looking at the GCP link, I would be curious if we should look at your previous PR that used GCS instead. (I do see you are storing images that are 250x200)

Also, if we continue with Datastore, we should add comments that we shouldn't increase the thumbnail size as it could have negative impacts on the reads.

# Create and save a thumbnail too.
thumb_content = None
try:
thumb_content = images.resize(content, THUMB_WIDTH, THUMB_HEIGHT)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Will this code maintain the aspect ratio of the original image if the user uploads an image with different proportions than 250x200? For example, if someone uploads a square image, will it be distorted when resized to 250x200?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. It will be somewhat distorted. After I get the backend PRs merged, I'll start experimenting with the frontend and coordination with Yann to see how some sample images will look. My thinking at this point is that all the thumbnails should be the same size so that they look nice when there are multiple, so we might end up with some combination of scaling and cropping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few quick alternatives to consider in case image dimensions change in the future.

  1. Resize with Padding: This will add a white/transparent background around the picture but will keep the original content without distortion
  2. Resize to Fit: Pick the largest dimension and resize it down to its limit (Width=250 or Height=200)

If the images already have the same aspect ratio as 250x200, option 2 is the simplest. These approaches will make our code more robust to future changes.

@jrobbins
Copy link
Collaborator Author

Could you describe the security problem again? I know you mentioned the separate domain part to me before. But will this separate domain be a separate deployment that needs to happen now? If so, will that second deployment restrict the usage of the other routes to reduce the load on that deployment? (Admittedly, only a malicious actor would be poking at the other APIs on the same server)

I just don't want the original images that any user uploaded to be served from our app's domain (the same one that has session cookies). The reason is that the uploaded file is user-controlled data that could be somehow malicious.

The serving approach that I have in mind is simply to use one of the infinite number of other domains that already reach our app. Our session cookies are on chromestatus.com, but we can serve the images from imgXXX-dot-cr-status.appspot.com. The "imgXXX" can be anything, so I was thinking of making it different for every image. Any malicious image content that somehow runs JS would not have access to the main app domain cookies, and therefore can't do anything to our site that an anon user couldn't do. This does not require an additional GAE module or deployment because strings like imgXXX's that don't match any app version get routed to the live version of the default GAE app service.

Using a different domain for each image would add another DNS lookup for each one. That's not ideal, but I don't know if performance will matter much at all for these images. The alternative of serving directly from GCS itself involves a REST API call, a redirect, and a DNS lookup. So, I wouldn't conclude anything about the relative performance of the two approaches yet.

Also, if we continue with Datastore, we should add comments that we shouldn't increase the thumbnail size as it could have negative impacts on the reads.

Let's see how the performance goes. I haven't tested performance at all yet. If needed, we will have the ability to change the data storage layer before this launches, but I'd like to start with this simpler approach first.

@jcscottiii
Copy link
Collaborator

I just don't want the original images that any user uploaded to be served from our app's domain (the same one that has session cookies). The reason is that the uploaded file is user-controlled data that could be somehow malicious.

This makes sense.

Let's see how the performance goes. I haven't tested performance at all yet

Let me know if you need help with this.

thumb_content = None
try:
thumb_content = images.resize(content, THUMB_WIDTH, THUMB_HEIGHT)
except images.LargeImageError:
Copy link
Collaborator

@jcscottiii jcscottiii Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to handle large files at the handler level so that users can't upload large files into memory in the first place.

https://flask.palletsprojects.com/en/3.0.x/patterns/fileuploads/#improving-uploads

By default Flask will happily accept file uploads with an unlimited amount of memory, but you can limit that by setting the MAX_CONTENT_LENGTH config key

Or you can do both.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to put this comment on the check_attachment_size method that currently checks the size.

# limitations under the License.

import logging
from google.appengine.api import images
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images API is part of the legacy bundled services. I know we use existing appengine legacy services but this will be the first time we use the images API. Instead of adding another dependency on the legacy bundled service, we should consider using one of the libraries they mention in their migrating off legacy bundle services document

To resize, convert, and manipulate images, use an image processing library such as Pillow or a Python interface for ImageMagick.

At first glance, Pillow has a similar API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I'll try to follow up with a PR to switch to Pillow and use max-aspect.

Copy link
Collaborator

@jcscottiii jcscottiii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the considerations around the bundled services and image resizing.

Regarding the performance between GCS vs Datastore, let me know if you need help with that.

@jrobbins jrobbins merged commit da48531 into main Oct 24, 2024
7 checks passed
@jrobbins jrobbins deleted the 20241021-store-attachments-in-ndb branch October 24, 2024 19:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants