About MLSanitizer
Some information about this application.
MLSanitizer is a command-line program that strips most if not all of ManualsLib's watermarks in PDF files. For each file:
- It looks for
%PDF-1.4
at the first line to confirm its format and version. ManualsLib-watermarked documents have always been PDF 1.4. - It reads much of the file to detect for omittable watermarks with crude heuristics.
- If watermarks are detected, it writes the file to the output
while:
- it omits said watermarks,
- detects for unomittable watermarks and edits them on the fly, and
- calculates its cross reference offset (startxref).
Each output file name appends -mlsanitized
to its input file
name before its extension.
Because it does not parse PDF documents into memory, it reads a watermarked file twice: one for detecting, one for writing. As such, a file system caching functionality is recommended.
Its operation removes all annotations in a document, does not restore one to its original condition, nor does it add anything to one.
Compatibility
Sanitized PDFs are compatible with pretty much all readers.
PDF documents written by MLSanitizer have been found to be working with the following readers and possibly others using their library, sorted alphabetically:
- [Chromium]which uses [PDFium] and is the base for Google Chrome.
- [Evince]which uses freedesktop.org's [Poppler].
- [Firefox]which uses Mozilla's own [PDF.js].
- [PDFBox®]A Java PDF library by Apache, also available as a standalone application to inspect PDFs with.
- [SumatraPDF]which uses Artifex's [MuPDF], also used in their own readers.
Issues
It may crash from excessive memory use.
It reads a file line-by-line into a buffer, using the line feed character
(0xA
) as a delimiter. While the initial capacity of 640 KiB should
suffice in most cases, PDFs with large, 0xA
-less streams can still
reach said capacity and cause the buffer to grow by reallocation. At the extreme
boundary case, it may crash due to excessive memory use.