Files
openide/docs/jar-format.puml
Vladimir Krivosheev 305fe4aabf add doc about zip with optimized metadata
GitOrigin-RevId: 743fc3f1356115308b8936d018f953fb42202000
2023-05-19 17:59:16 +00:00

99 lines
3.4 KiB
Plaintext

@startuml
!include jb-plantuml-theme.puml
skinparam linetype ortho
top to bottom direction
header
A [[https://en.wikipedia.org/wiki/ZIP_(file_format) ZIP]] file format with optimized metadata.
endheader
component "File Entry 1" as FE1
component "File Entry N" as FE2
note right of FE1
The relative offset of the local file header does not point directly to the data,
but rather to the header itself. This means that you need to perform two seeks
in order to locate the actual data, as the size of the local file header can vary.
As an optimization, you can attempt to precompute the data offset
when reading the central directory file header.
This optimization is implemented in the HashMapZipFile class.
However, ImmutableZipFile uses a special index for this purpose, as explained below.
end note
FE1 -- FE2
component "File entry ~__index__" as INDEX {
component "A list of keys along with their corresponding offsets and sizes." as INDEX_M
note right of INDEX_M
A list of pairs consisting of long values.
Each pair includes a key, represented as a 64-bit XXH3 hash of an entry name,
and an offset and size represented as two ints packed into a single long value.
This list enables the retrieval of data locations for all entries in a single bulk read operation.
It contains no file names or other unnecessary metadata.
end note
component "class package hashes" as INDEX_PC
note right of INDEX_PC
A list of long values representing the 64-bit XXH3 hash of a package name.
This list is not used by the ZipFile implementation but is consumed by the class loader.
It allows for a quick determination of whether a class name is located within a ZIP file or not.
While it does not provide much benefit for a single ZIP file, as name lookup can be done with a single map lookup,
it enables the clustering of multiple ZIP files.
This clustering helps avoid a linear search across all ZIP files in a classpath.
end note
component "resource package hashes" as INDEX_PR
note right of INDEX_PR
The same concept applies to resource package hashes.
However, there are two different sets of hashes since there is no correlation
between class packages and resource packages.
end note
component names {
component "name lengths" as INDEX_NL
note right of INDEX_NL
A list of name lengths represented as shorts.
This list enables the reading of integers in a single bulk read operation,
directly from native memory.
end note
component "names" as INDEX_NS
note right of INDEX_NS
List of strings.
end note
INDEX_NL -down- INDEX_NS
}
note bottom of names
Entry names. They are not loaded into memory when the ZipFile is opened;
instead, they are loaded only when requested.
This is useful, for instance, when you want to process entries based on their names,
such as finding entries by a specific prefix.
end note
INDEX_M -- INDEX_PC
INDEX_PC -- INDEX_PR
INDEX_PR -- names
}
note top of INDEX
The Zip specification is not violated.
The index data represents a regular file entry.
end note
FE2 -- INDEX
component "Central directory" as CD
note right of CD
The index format version is stored in the 'File comment' field.
Only the latest format is supported.
If a ZIP file does not have a comment or the index version is not equal to the latest,
a fallback implementation is used that is capable of reading any ZIP file.
end note
INDEX -- CD
@enduml