Backtrack: Zip

hostilefork · July 4, 2022, 9:19pm

I think it's a good approach (and a good tool for building higher-level zip dialects).

(Maybe it could be called ZIPPER or something like that to speak of its generality, and save ZIP for higher-level tools that could be very situationally specific to what people are doing in a particular project? Doesn't matter as everyone can redefine everything.)

I would put in a plug for wherever this goes supporting the creation of a URI! for generalized reading, e.g. extracting single file, which I've mentioned before:

data: read zip://wherever/something.zip/folder/file.dat

But that would be built on top of this.

I've been skeptical of the arity-2 unzip, because I feel like UNZIP with the "unzip and be done" should take a file/url/binary and dump it in the "current directory". (But the current directory should be able to be in memory in a virtual filesystem...)

archive: zip/load %archive.zip

My first thought was "hey that's kind of like a generator or yielder" (note: not merged into mainline builds yet, though the stackless abilities they depend on are merged)

So I thought "maybe it could be an 'unzipper'"

archive: unzipper %archive.zip  ; hmmm

But that makes ARCHIVE a function, which is hard to name. Your approach is probably better to have the archive separately represent the state, and then call operations on it.

That said, a generator/yielder might be useful in the internal implementation. It may be easier to write ZIP/STEP if your enumeration can stay in the middle of whatever loop it's in and YIELD the data at each step, then communicate with the stepper.

write %thing.zip zip/build

So zips can not only be read as streams, but also written as streams. I remark on something interesting about that, involving a bug fixed in Ren-C's unzip:

; NOTE: The original rebzip.r did decompression based on the local file
; header records in the zip file.  But due to streaming compression
; these can be incomplete and have zeros for the data sizes.  The only
; reliable source of sizes comes from the central file directory at
; the end of the archive.  That might seem redundant to those not aware
; of the streaming ZIP debacle, because a non-streaming zip can be
; decompressed without it...but streaming files definitely exist!

I guess since streamed-zip-writing came later in the lifetime of zip, there were a lot of files that had all the compressed sizes up front. Enough so that some decompressions (at least unzip.reb) presumed that the person who wrote the files was able to go back and patch the bytes at the earlier point in the file.

Anyway, the point being that streaming zip writing is a thing. Since you're looking at more granular means of not putting everything in memory at once, that might be relevant to consider... in terms of how the ZIP/BUILD might be able to emit information to a port as it goes.

Of course that's all in the "we don't know quite how to do this"...but stackless and green threads are things I believe will play in big with the streaming puzzle.