ChiFS

Logo pending™

Share API

What is a Share

A Share is an HTTP Onion service that provides the following:

A few general rules:

  1. A Share must provide its HTTP service over port 80.
  2. A Share must make its metadata available under the /.chifs-share/ path.
  3. A Share must support HTTP GET, HEAD and range requests ("byte serving") on all published files.
  4. A Share must follow the requirements outlined in HttpRequirements.md.

Rationale

Rule 1 and 2 ensure that a Share can be uniquely identified by its .onion hostname alone. This simplifies duplicate detection and removes the need for URL parsing and normalization in Hubs and clients. A single Tor node can host multiple Onion services, so I do not expect that these rules will be a significant limitation.

Share descriptor

A Share must provide a basic description of itself at /.chifs-share/meta.json, this file holds the following JSON structure:

{
  "version": 1,
  "updated": "2018-12-15T08:32:17Z",
  "title": "User-provided title of this Share",
  "contact": "Free-form contact info (forum URL, email address, etc)"
}

The version field is mandatory and should be the number '1'. The version number will be used to signal changes and additions for the entire Share API, not just this share descriptor. It is expected that future API changes remain backwards compatible.

The updated field is also mandatory and should indicate the RFC 3339 timestamp of the last update to any of the share metadata. This can be used by Hubs or clients to check if they should download a new version of the Share Index or other metadata. The timestamp must be in UTC.

The title and contact fields are optional free-form human-readable strings.

Share Index

A Share must provide a list of all its published files in a single index file available at the path /.chifs-share/index.json.zst. This Share Index holds a JSON array of objects, where each object represents a published file.

Example:

[
  {
    "path": ["/iso/ubuntu-18.10-live-server-amd64.iso"],
    "hash": ["b2t.QXEYZ7AZLWCKDY6YDCQNHK63FGCOMTPQSKNWO3DLXRZ7TQRDLXOQ"],
    "size": 923795456,
    "modified": "2018-11-27T10:49:17Z"
  },
  {
    "path": ["/video/How to Cook an Egg.mp4"],
    "hash": ["b2t.LBHE3MWFWYNZKB3VYDZY3BLC46KVSQ2MS4HK5GIUFSCNY2THCWJA"],
    "size": 1639423565,
    "modified": "2015-10-01T05:10:43Z"
  }
]

The supported object fields are documented below in the File metadata. The following fields are mandatory for each file entry in the Share Index:

All other fields are optional. For space considerations, the b2t-chunks field should not be included in the Share Index.

Rationale

I have considered various file formats. The major downside of JSON is that it is a streamed (i.e. non-indexed) format. Any modification to the published file list or included file metadata will require rewriting the entire file. Quick lookups for file hashes or a particular file path are also not possible.

The two most important users of this Share Index are the Share management software itself and Hubs. Hubs will periodically fetch and read a Share Index, and update their internal indices accordingly. This process requires reading the entire file anyway and can be done in a streaming fashion.

The Share management software is responsible for regular synchronization of this Share Index with the actual list of published files. This process could greatly benefit from fast path lookups to verify that the information in the Index is still correct. Likewise, a database format that supports insertion, deletion and updates of file metadata would avoid the need to completely rewrite the Share Index on each change. But there are two reasons to avoid such a database format:

Another downside of JSON is that binary data (i.e. file hashes) will need to be encoded using base32 or similar encodings. My hope is that the storage overhead is countered by the use of compression, and the additional CPU overhead is not significant enough to warrant a binary format.

JSON vs. XML or other formats: Largely a matter of preference. There's a fair amount of tooling available for JSON to aid manual inspection of Share Index files and JSON parsing is relatively simple and fast.

The Index can grow pretty large (say, hundreds of megabytes, perhaps even gigabytes) for Shares with many files. This may cause Hubs to be unable to download the full Index if the network is slow and/or unreliable. It may be possible to split the Index up in multiple files, but this will complicate atomic updates and the Hub indexing process. See EfficientDirSync for a proposed solution.

As for the choice of Zstandard as compression algorithm:

Hash metadata

For each unique published file, a Share must provide a metadata file at the path /.chifs-share/meta/b2t.$HASH.json.zst, where $HASH is the full base32-encoded root hash of the file. This file holds a JSON object with metadata for the published file. The supported object fields are documented below in the File metadata. The following fields are required:

The Hash metadata is mainly of interest to clients that wish to download the file. It offers a simple and lightweight method for clients to grab the BLAKE2 intermediate hashes that they can use for incremental verification of downloaded files. Metadata that is relevant for searching and discovering files should be included in the Share Index (so that Hubs can just fetch that), but may also be replicated here.

Example path: /.chifs-share/meta/b2t.RYCJFTQ5OMJONYLUBGVEJVSE3MRVC6RUVVRM6JXZUDDQAEI77HQQ.json.zst.

File metadata

The following fields are used to describe a file in the Share Index and Hash metadata. This list is not complete, Share implementations are free to add additional fields and additional fields may be standardized later. Field names that are not standardized in this document must be prefixed with the name of the project that introduced them, to avoid name clashes with future standardized fields. For example, if myproject adds a field for the genre of a media type, this field should be named myproject:genre.

Common fields

path
Non-empty array of strings, the absolute paths of this file in the Share. Multiple paths may be provided when the same file, as identified by the hash, is available at multiple paths and if the other metadata fields apply to all the listed files.
hash

Non-empty array of hash values identifying this file. Each hash is a string in the format of hashtype.hashvalue. At the moment only the b2t hash type in base32 encoding is supported, see FileHash.md for details. Example:

"hash": ["b2t.LBHE3MWFWYNZKB3VYDZY3BLC46KVSQ2MS4HK5GIUFSCNY2THCWJA"],

This field is intended to (a) uniquely identify a file across the wider ChiFS network and (b) to permit clients to verify the contents of downloaded file chunks. To fullfill requirement (a), Shares must agree on a common set of hash types in order to prevent a network split. To fullfill requirement (b), supported hash types must use a construction based on hash lists or trees. It is expected that b2t will remain the only supported hash type for the forseeable future.

b2t-chunks

Array of base32-encoded BLAKE2 intermediate hashes. This array of hashes must form one complete level in the Merkle hash tree and the 0-based index into this array must match the hash offset.

The depth of these hashes can be inferred from the number of hashes and the file size. Share implementations are free to choose a depth that provides a suitable compromise between the number of hashes (and thus the size of the metadata) and the granularity at which clients can verify the downloaded data. As a general recommendation, less than 256 hashes or a granularity of at least 4 MiB is sufficient.

size
Size of the file, in bytes.
magic
A human-readable description of the file type, obtained from libmagic or similar methods.
modified
RFC 3339 timestamp of when the file was last modified. The timestamp must be in UTC.
mime
Mime type of the file. This field is generally only useful if the mime type was derived from the contents of the file - if the mime type is derived solely from the file extension, then the users of this metadata could derive it themselves from the file path.

File tags

Some file types support embedded metadata tags, this includes media files (ID3 tags, Vorbis comments) and document formats (e.g. PDF, HTML). At this point, the ChiFS-Share implementation only supports extraction of media tags using ffprobe(1). This list of fields may be expanded as support for other formats is being implemented.

album
Album name, string.
artist
Artist, string.
title
Title, string.

Media information

These fields can be extracted from audio and video files using tools such as ffprobe(1) or mediainfo(1). Note that, at this point, the ChiFS-Share implementation only supports ffprobe(1) and that's what this specification has been built around. Some of these field definitions may be revised in order to improve compatibility with other tools.

aspect-ratio
Aspect ratio of the video file, as a "width:height" string (e.g. "16:9"). This is the display_aspect_ratio attribute of the video stream in ffprobe(1).
audio

Array, list of audio streams. Each audio stream is represented as a JSON object with the following fields, all optional:

codec
Codec name, string. This is the codec_name stream attribute in ffprobe(1). The full list of possible codecs and their human-readable names can be obtained from running ffmpeg -codecs.
channels
Integer, number of channels.
channel-layout
String, human readable channel layout (e.g. "2.0" or "stereo").
bit-rate
Integer, (average) bits per second.
sample-rate
Integer, number of samples per second (Hz).
language
String, audio language. Most formats and encoders seem to use ISO 639-2/B tags for language identification, but some use free-form human-readable strings.
chapters
Integer, number of chapters.
duration
Number, duration in seconds.
resolution
Pixel resolution of the video file, as a "widthxheight" string (e.g. "1280x720").
subtitles

Array, list of subtitles. Each subtitle stream is represented as a JSON object with the following fields, all optional:

codec
Codec name, string. See audio codec.
language
Subtitle language, string. See audio language.
video

Array, list of video streams. Each video stream is represented as a JSON object with the following fields, all optional:

codec
Codec name, string. See audio codec.
bit-rate
Integer, (average) bits per second.
frame-rate
Integer, (average) number of frames per second.
pix-fmt
String, pixel format used for this video stream. This is the pix_fmt attribute in ffprobe(1). The full list of supported pixel formats can be obtained by running ffmpeg -pix_fmts.

Other considerations

The Share Index is not really suitable for clients that simply wish the browse through the published files of a Share - it may get too large, and users may have lost interest in browsing the Share long before they have been able to download the complete Index. This could be solved by offering an additional per-directory Index. But I'm not fond of increasing the metadata requirements of a Share - that increases management overhead and disk requirements. The per-directory indices could replace the global Share Index, but then indexing performed by a Hub would be slower and more resource-intensive. And in any case, I don't expect clients to directly use a Share for discovery purposes, that's what Hubs are for. See EfficientDirSync for a discussion.