2025-04-25 10:31:00
github.com
UIT is a library for performant, modular, low-memory file processing at scale, in the Cloud. It works by offering a 4-step process to gather a file hierarchy from any desired modality, apply filters and transformations, and output it in any desired modality.
- performance: speed is of essence when navigating and searching through large amounts of data
- low-memory by applying streaming and parallelization we can run this in low-memory environments such as Cloudflare workers
- modular: modularity is beneficial because by making it composable we get a clear high-level overview of all building blocks. also, not all building blocks can be ran in the same runtime or location.
UIT has come about after many iterations of the platform of uithub, which started as a simple node-based parser of zipfiles. While building more and more features and add-ons, I found myself limited by the memory a lot as I was not streaming enough, and going back to JSON too early (because using the Streams API is tricky!). Thus, as features and complexity grew the need was born to create a more modular extensible architecture with good serverless practices in mind.
FormData
has a long history [RFC 1867 (1995)] [RFC 2388 (1998)] [RFC 7578 (2015)] and is deeply embedded into the web. It offers an excellent way to serve multiple files, binary and textual, over a single request. Although FormData
does not support stream-reading directly from Request
and other Web Standards yet, UIT leverages the fact that intermediate results can be read using the Streams API using multipart-formdata-stream-js.
UIT cleverly modularizes filters and transformations on file hierarchies by providing an elegant way to combine multiple UIT ‘modules’ together to get to a final result. Every UIT ‘module’ can apply path filters, content filters, and content transformations, to change the files in the file hierarchy, all while streaming, and even merge multiple file hierarchies together in the blink of an eye.
So far, UIT provides the following modules that can be combined to create powerful file processing pipelines:
- uithub.ingestzip – Ingests and processes ZIP files into normalized formdata format
- uithub.merge – Combines multiple formdata streams into a single unified stream
- uithub.outputmd – Transforms and outputs data as markdown files
- uithub.outputzip – Packages processed data into downloadable ZIP archives
- uithub.search – Provides search capabilities across file hierarchies
- uithub.ziptree – Highly performant zip file-hierarchy extractor
- uithub.otp – Source proxy that generates an OTP to minimize secret exposure to other modules.
- uithub – Brings several modules together, pipes through them, and shows in authenticated HTML interface.
Each module is designed to perform a specific step in the UIT 4-step process (ingest, filter/transform, merge, output) while maintaining performance and low memory usage.
It is important to note that each of these modules can be independently hosted as a cloudflare worker, but the spec doesn’t require it to be hosted on Cloudflare per se, you can also host UIT modules in other runtimes, as long as it’s compliant with the UIT Protocol
Please also note that above diagrams showcase many modules that haven’t don’t exist yet, but could be beneficial to exist. By Open Sourcing UIT, I hope to empower developers to add the modules they need.
The UIT Protocol is the convention that characterizes any UIT module. As can be seen in the diagrams above, any UIT module must be one of these 4 module types:
- ingest module – streams any datastructure into a FormData stream
- merge module – streams several FormData sources into a single FormData stream
- filter/transform module – applies filters and transformations on files in a streaming fashion while in the FormData ‘modality’.
- output module – streams a FormData stream into any desired datastructure
The only formalized convention/protocol you need to understand to create a UIT module, is which FormData headers UIT modules work with. These FormData headers can be divided into standard and non-standard (custom) headers:
Header | Description | Required |
---|---|---|
Content-Disposition | Contains name (should equal filename) and filename (original pathname) |
Yes |
Content-Type | Specifies the MIME type of the data | No |
Content-Length | Indicates the uncompressed size of the data | No |
Content-Transfer-Encoding | Specifies how the data is encoded: – binary (required for binary files)– 8bit (recommended for text-based/utf8 files)– quoted-printable – base64 – 7bit (default) |
No |
Header | Description | Format |
---|---|---|
x-url | Specifies the URL that locates the binary file. In some cases it may be desired to omit the binary data and only leave the URL to locate the file. | URL string |
x-file-hash | Stores the hash of the file | Hash string |
x-error | Indicates processing error in the pipeline. On error in a module, the original incoming file-content should be preserved. If encountered, shouldn’t be filtered or processed, so we can see errors for every individual file, where they happened, and with what file input. | {handler-id};{status};{message} |
UIT aims to be a convention to streaming, filtering, and transforming binary and textual file hierarchies in the Cloud, and maintains a curated list of first-party and third-party libraries that can be included into any UIT data-transformation flow.
As a first step I aim to create a plugin system that allows doing file filters and transformations with ease from the uithub UI. For intended plugins, check out plugins.json and the spec.
The multipart parser is designed to handle all FormData
headers, including any non-standard ones, and can be a useful libary to create FormData filter/transformers. It extracts them from the raw header lines and makes them available in the Part object. The library also maintains the original headerLines
as part of the parsed data structure.
Please open a discussion, issue, pull request, or reach out if you want a new module to be added to this list or have any unmet requirements. To create your own plugin, follow the GETTING-STARTED.md and CONTRIBUTING.md. UIT is also looking for sponsors.
Important
MIT will be added after official launch
UIT is licensed under the MIT License. While the license only requires preservation of copyright notices, we kindly request attribution when using this project. See ATTRIBUTION.md for guidelines on how to provide attribution.
~ Being made with ❤️ by janwilmake
Keep your files stored safely and securely with the SanDisk 2TB Extreme Portable SSD. With over 69,505 ratings and an impressive 4.6 out of 5 stars, this product has been purchased over 8K+ times in the past month. At only $129.99, this Amazon’s Choice product is a must-have for secure file storage.
Help keep private content private with the included password protection featuring 256-bit AES hardware encryption. Order now for just $129.99 on Amazon!
Help Power Techcratic’s Future – Scan To Support
If Techcratic’s content and insights have helped you, consider giving back by supporting the platform with crypto. Every contribution makes a difference, whether it’s for high-quality content, server maintenance, or future updates. Techcratic is constantly evolving, and your support helps drive that progress.
As a solo operator who wears all the hats, creating content, managing the tech, and running the site, your support allows me to stay focused on delivering valuable resources. Your support keeps everything running smoothly and enables me to continue creating the content you love. I’m deeply grateful for your support, it truly means the world to me! Thank you!
BITCOIN bc1qlszw7elx2qahjwvaryh0tkgg8y68enw30gpvge Scan the QR code with your crypto wallet app |
DOGECOIN D64GwvvYQxFXYyan3oQCrmWfidf6T3JpBA Scan the QR code with your crypto wallet app |
ETHEREUM 0xe9BC980DF3d985730dA827996B43E4A62CCBAA7a Scan the QR code with your crypto wallet app |
Please read the Privacy and Security Disclaimer on how Techcratic handles your support.
Disclaimer: As an Amazon Associate, Techcratic may earn from qualifying purchases.