Blobby: Files and Metadata
Overview
Blobby is responsible for all object (blob) storage based around the File
entity. There are many types of files that users need to be able to upload in an
authorized manner and receive in both an authorized and low-latency manner.
Such files include:
- Profile photos for users and organizations
- Cover photos for spaces
- Attached PDFs, images and other arbitrary files for messages and blocks
- Attached audio/video file for each audio/video message
- Audio/video recordings of rooms.
Different use cases require various features:
- CDN caching
- Signed URLs
- Video streaming (both RTMP up and HLS down)
- Image resizing
Blobby relies on an S3 bucket (user_content) and the SaaS video streaming tool
Mux to store files.
Blobby provides signed S3 POST URLs for uploading files, signed Mux POST
URLs and RTMP stream keys for uploading videos/live streams, signed Mux JWTs
for streaming videos, and CloudFront URLs with signed JWTs designed to be read
by the file-proxy-viewer function, described later.
Blobby does not directly interact with the raw_recordings S3 bucket that
LiveKit writes recordings to (Stagehand does),
but is responsible for creating a file in its database and in Mux based on a
pre-signed S3 URL when prompted by some service.
Blobby must consider file size maximums, per-user and per-organization upload rate limiting, and enforcing content policies. Some of those considerations may depend on checks to Wallstreet to check a specific user or organizations tier / feature access.
Blobby is not responsible for 'AI' metadata for files, just basic metadata like
file format, size, userId of uploader (when known), and date created. Asimov
generates chapters, summeries, and transcripts for audio and video files based
on its own logic and stored in its own database. The Friend service understands
that the transcript field on a File type is resolved from Asimov using the
Blobby fileId as the key.
API
Authorization
Crucially, Blobby does not know or care what application entity owned by another
service has 'attached' a file to a record. Blobby cannot authorize whether a
user should be able to read a file, because file access is authorized based on
application context, which is represented in various entities across services,
such as the MessageAttachedFile entity in Messenger and the profilePhotoId
field on the User entity in Facebox. Therefore, there are two key aspects of
Blobby's authorization model:
- Blobby rate limits file upload speed / quantity / size based on the provided
userIdbut other than that, allows infinite file uploads (the creation of direct upload URLs or actual file creation via API request). It has no concept of end user authentication or authorization to upload. As far as it is concerned file uploads are default permitted unless the provideduserIdororganizationIdspecifically has reached some limit. - if a JWT to read a file is requested, it is provided. Blobby does not ask for
the
userIdthat is requesting the file, because this is not meaningful to Blobby. It knows whichuserIduploaded a file, so it could theoretically determine that a user can access its own files, but that's not very useful. So, it has no concept of file read authorization.
Therefore, while API services can technically directly call the Blobby
GetFileReadCredential method, they should not, because neither the API
service nor Blobby understand file context. This method provides data
necessary to read a file for a given file ID, be that a Mux playback ID and JWT
or just a File Proxy JWT.
Uploading Files
- A frontend client calls
CreateFile, a simple Friend RPC that wraps Blobby. - Friend calls
Blobbyand aFileis inserted into the Blobby database, assuming the user has not exceeded their file upload quota (which is different from simply the number of recentFilerows created, as it only counts actual files that were uploaded). Blobby returns a pre-signed upload URL along with theFileId. The pre-signed upload URL is either for S3 or Mux depending on whether the content is video/audio or anything else. - The frontend client uploads the file via a
POSTrequest (or multiple requests for multi-part video uploads to Mux). - S3 pushes the notification of new file via SQS to Blobby, which prompts Blobby to update its database.
- Now that the frontend client has uploaded the file, it can call some other
RPC like
AttachUserProfilePhotoand pass theFileId. As a crucial practice, the service that exposes such an RPC must only allow users toAttachfiles they created. - A service can make a gRPC request to Blobby to verify that the file exists,
based on the
FileId. Blobby will provide a response which includes whether the file is verified to have been uploaded. This might not have happened yet, because Blobby has to wait for S3 push an event to SQS, Blobby consumes. Once this other service uses Blobby'sGetFileInfomethod to verify that theFileIdis valid (and crucially also verifies that it was uploaded by the sameUserIdwho is requesting to attach it) the service that is attaching the file can simply save theFileIdin its database, therefore allowing future queries to Blobby for that file.
Recording Processing
While Blobby is not responsible for interfacing with LiveKit, Blobby does
process the recordings from the raw_recordings S3 bucket that LiveKit writes
recordings to. The challenge is that something needs to prompt Blobby to process
those raw recordings (by uploading them to Mux and creating a new File in the
Blobby database). Stagehand is triggered by an AWS SNS/SQS notification that a
new file has been added to S3, and then determines the nature of this recording
based on its filename. Stagehand does not actually touch the raw recordings S3
bucket, Blobby does that.
Blobby is later called using its gRPC UploadAvFileFromUrl method by Blockhead,
Messenger, or some other service that has recognized its responsibility for some
Stagehand NATS message and is choosing to pass the raw_recordings S3 object
URL to Blobby for synchronous creation of a File and asynchronous creation of
a Mux asset.
Reading Files
When a service wants to provide a client with the ability to read a file, it
uses Blobby's GetFileReadCredential method. This is not designed to be exposed
via an external API, because it has no authorization capability. The following
two examples explain the lifecycle of reading a file:
Example A: User Profile Photo
Facebox is returning a User to Friend, which will return that user to the
client.
- Facebox uses
GetFileReadCredential, passing in theFileIdfor the user's profile picture. - Blobby does not need to run a database query. It simply uses the provided
FileIdwhich is in the formatUUID.fileTypeto determine that the requested file is a JPG, which means that it needs to return a URL with a JWT signed with its File Proxy signing secret, assuming that there is a JPG in the root of theuser_contentS3 bucket namedUUID.jpg. - Facebox returns both the
user_profile_photo_idit already had stored as well as the newuser_profile_photo_urlit has received to Friend. - Friend returns its own
Usertype to the client. The client simplyGETsfiles.pivotusercontent.com/+UUID.jpg+?s={JWT string}+&width=200+&height=200+&format=webpwhich hits our CloudFront distribution. - The the first of two Lambda@Edge functions that we consider part of the 'File Proxy' receives this request and validates that the JWT signature is valid, that it is unexpired, and that it represents the same file ID that has been requested in the URL path.
- Because the request is valid but uncached, the request is passed to to the second File Proxy, which notes that the request is for an image and that resizing params are provided and therefore fetches the original from the private S3 bucket and returns it back through CloudFront to the client with cache headers set.
The result? The user profile photo was only accessible to a client that was authenticated by Friend, authorized by Facebox, and given a signed JWT by Blobby. The photo was provided back to the client in the exact dimensions and format preferred by the client and because that varient of the file had not been requested before, it was resized on the fly and cached.
Example B: Room Recording
Messenger is returning a RoomRecording to Friend, which will return that to
the client.
-
Messenger uses
GetFileReadCredential, passing in theFileIdfor theFileIdof the recording. -
Blobby uses the provided
FileIdwhich is in the formatUUID.fileTypeto determine that the requested file is an MP4, which means that it needs to return a JWT signed with its Mux signing secret as well as a MuxplaybackId. -
Blobby queries its Keyspaces database to get the
mux_playback_id, generates the JWT, and sends both back to Messenger as a Mux URL. -
Messenger returns the
FileIdit already had stored as the new URL to Friend. -
Friend returns its own
RoomRecordingtype to the client, which now determines that it needs to load an HLS player and retrieve the video from Mux. The client need not precisely understand the structure of the URL.Note that an API service is not itself making a request to Blobby to read the file, the service that actually understands the context (Messenger, Blockhead, Facebox, etc.) is, so that the JWT is only returned if the
userIdrequesting the Message/Block/etc. entity actually has access to that entity. API service should not provide a generic 'GetFile' query, because neither the API service or Blobby has any way of knowing about the authorization status of that file in some context.
CDN and Image Resizing with File Proxy
As described in the examples above, Blobby implicitly relies on another service (two Lambda@Edge functions referred to as File Proxy) to front S3, and assumes that all it has to do is provide a URL with a signed JWT.
Therefore, other than for audio and video assets in Mux, File Proxy is an essential service, which keeps issues like caching, image resizing and JWT validation away from Blobby.
Here's how it works:
- Files are uploaded to S3. Files are named
fileId.format. This allows reasoning about a file simply by its Id. So an image would be named in S3 asuuid.jpg. - The
user_contentS3 bucket requires specific IAM permissions to read objects. Blobby can read (and generate pre-signedPOSTURLs), as can CloudFront and thefile-proxy-originLambda function. - The client gets a URL pointing to CloudFront back from Blobby (via some other
service). The URL wasn't checked by Blobby against S3, so the path might not
exist, but in theory it should. Blobby also provided a JWT, signed with a
secret shared between Blobby and the
file-proxy-viewerfunction. - Client makes a request to
files.pivotusercontent.com/123.jpg?s=123&width=100&format=webp. - If
sparam is missing,file-proxy-viewererrors. If other params are provided, but the file Id represented by the path does not end with a valid image file type prefix,file-proxy-originerrors. - Assuming no error, File Proxy validates the signature and
expand gets the file from S3 (which may be a CloudFront cache hit and never hitfile-proxy-originor the S3 bucket).- If resizing params are provided and the
fileIdis an image filetype,file-proxy-originuses the Sharp library to resize and return (with cache headers) the requested image.
- If resizing params are provided and the
NATS
Publication
Blobby publishes a message to blobby.change_feed.* each time an entity it owns
is created or modified. Blobby uses the subjects audio, video, image,
pdf, word, and bytes to differentiate known content types. This allows
consuming services to consume only what it needs to. For example, Asimov
consumes blobby.change_feed.file.audio.created and video.created, but not
the others.
Note that for Blobby, created implies that a file actually exists, not just
that a presigned POST URL was created.
Consumption
Blobby is responsible for Mux and for handling callbacks when files are uploaded
to the user_content bucket, so it consumes
tunnel.incoming-webhook-events.mux and the user-content S3 SQS queue.
Databases
Blobby uses Amazon Keyspaces (managed Cassandra) to store files metadata. Raw
file bytes are stored in appropriate object storage services (S3 and Mux), not
Keyspaces.
- File – Whenever Blobby initializes a file, a row is added to this table, and the primary key of this row is used for future identification of that file in Blobby RPCs and therefore in other services. Blobby is responsible for managing the underlying objects in S3 / assets in Mux, and mapping those entities to the right File row. The primary key is a UUID, so even files with different extensions have unique IDs. When Blobby interacts with external services, it always appends to the UUID a file type.
Temporal Workflows
N/A