API Reference

Public API

DataDeps.DataDepType
DataDep(
    name::String,
    message::String,
    remote_path::Union{String,Vector{String}...},
    [hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
    # keyword args (Optional):
    fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
    post_fetch_method=identity # (local_filepath)->Any
)

Required Fields

  • name: the name used to refer to this datadep
    • Corresponds to a folder name where the datatep will be stored.
    • It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
  • message: a message displayed to the user for they are asked if they want to download it
    • This is normally used to give a link to the original source of the data, a paper to be cited etc.
  • remote_path: where to fetch the data from
    • This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).

Optional Fields

  • hash: used to check whether the files downloaded correctly
    • By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
    • If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
    • If you want to use a different hashing algorithm, then you can provide a tuple (hashfun, targethex). hashfun should be a function which takes an IOStream, and returns a Vector{UInt8}. Such as any of the functions from SHA.jl, eg sha3_384, sha1_512 or md5 from MD5.jl
    • If you want to use a different hashing algorithm, but don't know the sum, you can provide just the hashfun and a warning message will be displayed, giving the correct tuple of (hashfun, targethex) that should be added to the registration block.
    • If you don't want to provide a checksum, because your data can change pass in the type Any which will suppress the warning messages. (But see above warnings about "what if my data is dynamic").
    • Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are xored to get the target hash. (See Recursive Structure in the documentation for developers).
  • fetch_method=fetch_default: a function to run to download the files
    • Function should take 2 parameters (remote_filepath, local_directorypath), and can must return the local filepath to the file downloaded.
    • Default (fetch_default) can correctly handle strings containing HTTP[S] URLs, or any remote_path type which overloads Base.basename and Base.download, e.g. AWSS3.S3Path.
    • Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
    • Overloading this lets you change things about how the download is done – the transport protocol.
    • The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
    • This function is also responsible to work out what the local file should be called (as this is protocol dependent).
  • post_fetch_method: a function to run after the files have been downloaded
    • Should take the local filepath as its first and only argument. Can return anything.
    • Default is to do nothing.
    • Can do what it wants from there, but most likely wants to extract the file into the data directory.
    • towards this end DataDeps.jl includes a command: unpack which will extract an compressed folder, deleting the original.
    • It should be noted that post_fetch_method runs from within the data directory.
      • which means operations that just write to the current working directory (like rm or mv or run(`SOMECMD`)) just work.
      • You can call cwd() to get the the data directory for your own functions. (Or dirname(local_filepath)).
    • Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
    • You can check this as part of Preupload Checking in the documentation for developers.
source
DataDeps.ManualDataDepType
ManualDataDep(name, message)

A DataDep for if the installation needs to be handled manually. This can be done via Pkg/git if you put the dependency into the packages repo's /deps/data directory. More generally, message should give instructions on how to setup the data.

source
DataDeps.registerFunction
register(datadep::AbstractDataDep)

Registers the given datadep to be globally available to the program. this makes datadep"Name" work. register should be run within this __init__ of your module.

source
DataDeps.@datadep_strMacro
`datadep"Name"` or `datadep"Name/file"`

Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.

source
Base.downloadFunction
Base.download(
    datadep::DataDep,
    localdir;
    remotepath=datadep.remotepath,
    skip_checksum=false,
    i_accept_the_terms_of_use=nothing)

A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName", to be executed it will be downloaded if not already present.

Invoking this download method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.

  • localdir: this is the local directory to save to.
  • remotepath: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.
  • skip_checksum: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.
  • i_accept_the_terms_of_use: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.
    • For automation perposes you can set the environment variable DATADEPS_ALWAYS_ACCEPT
    • If not set, and if DATADEPS_ALWAYS_ACCEPT is not set, then the user will be prompted.
    • Strictly speaking these are not always terms of use, it just refers to the message and permission to download.

If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.

source

Helpers

DataDeps.unpackFunction
unpack(f; keep_originals=false)

Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals flag is set.

source

Internal

DataDeps.DataDepType
DataDep(
    name::String,
    message::String,
    remote_path::Union{String,Vector{String}...},
    [hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
    # keyword args (Optional):
    fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
    post_fetch_method=identity # (local_filepath)->Any
)

Required Fields

  • name: the name used to refer to this datadep
    • Corresponds to a folder name where the datatep will be stored.
    • It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
  • message: a message displayed to the user for they are asked if they want to download it
    • This is normally used to give a link to the original source of the data, a paper to be cited etc.
  • remote_path: where to fetch the data from
    • This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).

Optional Fields

  • hash: used to check whether the files downloaded correctly
    • By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
    • If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
    • If you want to use a different hashing algorithm, then you can provide a tuple (hashfun, targethex). hashfun should be a function which takes an IOStream, and returns a Vector{UInt8}. Such as any of the functions from SHA.jl, eg sha3_384, sha1_512 or md5 from MD5.jl
    • If you want to use a different hashing algorithm, but don't know the sum, you can provide just the hashfun and a warning message will be displayed, giving the correct tuple of (hashfun, targethex) that should be added to the registration block.
    • If you don't want to provide a checksum, because your data can change pass in the type Any which will suppress the warning messages. (But see above warnings about "what if my data is dynamic").
    • Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are xored to get the target hash. (See Recursive Structure in the documentation for developers).
  • fetch_method=fetch_default: a function to run to download the files
    • Function should take 2 parameters (remote_filepath, local_directorypath), and can must return the local filepath to the file downloaded.
    • Default (fetch_default) can correctly handle strings containing HTTP[S] URLs, or any remote_path type which overloads Base.basename and Base.download, e.g. AWSS3.S3Path.
    • Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
    • Overloading this lets you change things about how the download is done – the transport protocol.
    • The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
    • This function is also responsible to work out what the local file should be called (as this is protocol dependent).
  • post_fetch_method: a function to run after the files have been downloaded
    • Should take the local filepath as its first and only argument. Can return anything.
    • Default is to do nothing.
    • Can do what it wants from there, but most likely wants to extract the file into the data directory.
    • towards this end DataDeps.jl includes a command: unpack which will extract an compressed folder, deleting the original.
    • It should be noted that post_fetch_method runs from within the data directory.
      • which means operations that just write to the current working directory (like rm or mv or run(`SOMECMD`)) just work.
      • You can call cwd() to get the the data directory for your own functions. (Or dirname(local_filepath)).
    • Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
    • You can check this as part of Preupload Checking in the documentation for developers.
source
DataDeps.preupload_checkMethod
preupload_check(datadep, local_filepath[s])::Bool)

Peforms preupload checks on the local files without having to download them. This is tool for creating or updating DataDeps, allowing the author to check the files before they are uploaded (or if downloaded directly). This checking includes checking the checksum, and the making sure the post_fetch_method runs without errors. It basically performs datadep resolution, but bypasses the step of downloading the files. The results of performing the post_fetch_method are not kept. As normal if the DataDep being checked does not have a checksum, or if the checksum does not match, then a warning message will be displayed. Similarly, if the post_fetch_method throws an exception, a warning will be displayed.

Returns: true or false, depending on if the checks were all good, or not.

Arguments:

  • datadep: Either an instance of a DataDep type, or the name of a registered DataDep as a AbstractString
  • local_filepath: a filepath or (recursive) list of filepaths. This is what would be returned by fetch in normal datadep use.
source
DataDeps.registerMethod
register(datadep::AbstractDataDep)

Registers the given datadep to be globally available to the program. this makes datadep"Name" work. register should be run within this __init__ of your module.

source
DataDeps.resolveMethod
resolve("name/path", @__FILE__)

Is the function that lives directly behind the datadep"name/path" macro. If you are working the the names of the datadeps programmatically, and don't want to download them by mistake; it can be easier to work with this function.

Note though that you must include @__FILE__ as the second argument, as DataDeps.jl uses this to allow reading the package specific deps/data directory. Advanced usage could specify a different file or nothing, but at that point you are on your own.

source
DataDeps.resolveMethod
resolve(datadep, inner_filepath, calling_filepath)

Returns a path to the folder containing the datadep. Even if that means downloading the dependency and putting it in there.

 - `inner_filepath` is the path to the file within the data dir
 - `calling_filepath` is a path to the file where this is being invoked from

This is basically the function the lives behind the string macro datadep"DepName/inner_filepath".

source
DataDeps.unpackMethod
unpack(f; keep_originals=false)

Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals flag is set.

source
DataDeps.@datadep_strMacro
`datadep"Name"` or `datadep"Name/file"`

Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.

source
Base.downloadMethod
Base.download(
    datadep::DataDep,
    localdir;
    remotepath=datadep.remotepath,
    skip_checksum=false,
    i_accept_the_terms_of_use=nothing)

A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName", to be executed it will be downloaded if not already present.

Invoking this download method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.

  • localdir: this is the local directory to save to.
  • remotepath: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.
  • skip_checksum: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.
  • i_accept_the_terms_of_use: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.
    • For automation perposes you can set the environment variable DATADEPS_ALWAYS_ACCEPT
    • If not set, and if DATADEPS_ALWAYS_ACCEPT is not set, then the user will be prompted.
    • Strictly speaking these are not always terms of use, it just refers to the message and permission to download.

If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.

source
DataDeps._resolveMethod

The core of the resolve function without any user friendly file stuff, returns the directory

source
DataDeps.accept_termsMethod
accept_terms(datadep, localpath, remotepath, i_accept_the_terms_of_use)

Ensures the user accepts the terms of use; otherwise errors out.

source
DataDeps.better_readlineFunction
better_readline(stream = stdin)

A version of readline that does not immediately return an empty string if the stream is closed. It will attempt to reopen the stream and if that fails then throw an error.

source
DataDeps.checksumMethod
checksum(hasher=sha2_256, filename[/s])

Executes the hasher, on the file/files, and returns a UInt8 array of the hash. xored if there are multiple files

source
DataDeps.checksum_passMethod
checksum_pass(hash, fetched_path)

Ensures the checksum passes, and handles the dialog with use user when it fails.

source
DataDeps.env_boolFunction
env_bool(key)

Checks for an environment variable and fuzzy converts it to a bool

source
DataDeps.env_listFunction
env_list(key)

Checks for an environment variable and converts it to a list of strings, sperated with a colon

source
DataDeps.fetch_baseMethod

fetchbase(remotepath, local_dir)

Download from remote_path to local_dir, via Base mechanisms. The download is performed using Base.download and Base.basename(remote_path) is used to determine the filename. This is very limited in the case of HTTP as the filename is not always encoded in the URL. But it does work for simple paths like "http://myserver/files/data.csv". In general for those cases prefer http_download.

The more important feature is that this works for anything that has overloaded Base.basename and Base.download, e.g. AWSS3.S3Path. While this doesn't work for all transport mechanisms (so some datadeps will still a custom fetch_method), it works for many.

source
DataDeps.fetch_httpMethod
fetch_http(remotepath, localdir; update_period=5)

Pass in a HTTP[/S] URL and a directory to save it to, and it downloads that file, returning the local path. This is using the HTTP protocol's method of defining filenames in headers, if that information is present. Returns the localpath that it was downloaded to.

update_period controls how often to print the download progress to the log. It is expressed in seconds. It is printed at @info level in the log. By default it is once per second, though this depends on configuration

source
DataDeps.handle_missingMethod
handle_missing(datadep::DataDep, calling_filepath)::String

This function is called when the datadep is missing.

source
DataDeps.input_choiceMethod
input_choice

Prompts the user for one of a list of options. Takes a vararg of tuples of Letter, Prompt, Action (0 argument function)

Example:

input_choice(
    ('A', "Abort -- errors out", ()->error("aborted")),
    ('X', "eXit -- exits normally", ()->exit()),
    ('C', "Continue -- continues running", ()->nothing)),
)
source
DataDeps.is_valid_nameMethod
is_valid_name(name)

This checks if a datadep name is valid. This basically means it must be a valid folder name on windows.

source
DataDeps.list_local_pathsMethod
list_local_paths( name|datadep, [calling_filepath|module|nothing])

Lists all the local paths to a given datadep. This may be an empty list

source
DataDeps.postfetch_checkMethod
postfetch_check(post_fetch_method, local_path)

Executes the postfetchmethod on the given local path, in a temporary directory. Returns true if there are no exceptions. Performs in (async) parallel if multiple paths are given

source
DataDeps.preferred_pathsFunction
preferred_paths(calling_filepath; use_package_dir=true)

returns the datadeps loadpath plus if callingfilepath is provided and use_package_dir=true and is currently inside a package directory then it also includes the path to the dataseps in that folder.

source
DataDeps.progress_update_periodMethod
progress_update_period()

Returns the period between updated being logged on the progress. This is used by the default fetch_method and is generally a good idea to use it in any custom fetch method, if possible

source
DataDeps.run_checksumMethod

Providing only a hash string, results in defaulting to sha2_256, with that string being the target

source
DataDeps.run_checksumMethod

If a vector of paths is provided and a vector of hashing methods (of any form) then they are all required to match.

source
DataDeps.run_checksumMethod

If only a function is provided then assume the user is a developer, wanting to know what hash-line to add to the Registration line.

source
DataDeps.run_checksumMethod

If nothing is provided then assume the user is a developer, wanting to know what sha2_256 hash-line to add to the Registration line.

source
DataDeps.run_checksumMethod
run_checksum(checksum, path)

THis runs the checksum on the files at the fetched_path. And returns true or false base on if the checksum matches. (always true if no target sum given) It is kinda flexible and accepts different kinds of behaviour to give different kinds of results.

If path (the second parameter) is a Vector, then unless checksum is also a Vector, the result is the xor of the all the file checksums.

source
DataDeps.run_fetchMethod
run_fetch(fetch_method, remotepath, localdir)

executes the fetchmethod on the given remotepath, into the local directory and local paths. Performs in (async) parallel if multiple paths are given

source
DataDeps.run_post_fetchMethod
run_post_fetch(post_fetch_method, fetched_path)

executes the postfetchmethod on the given fetched path, Performs in (async) parallel if multiple paths are given

source
DataDeps.splitpathMethod
splitpath(path)

The opposite of joinpath, splits a path unto each of its directories names / filename (for the last).

source
DataDeps.try_determine_package_datadeps_dirMethod
try_determine_package_datadeps_dir(filepath)

Takes a path to a file. If that path is in a package's folder, Then this returns a path to the deps/data dir for that package (as a Nullable). Which may or may not exist. If not in a package returns null

source
DataDeps.uv_accessMethod
uv_access(path, mode)

Check access to a path. Returns 2 results, first an error code (0 for all good), and second an error message. https://stackoverflow.com/a/47126837/179081

source