API Reference
Public API
DataDeps.DataDep
— TypeDataDep(
name::String,
message::String,
remote_path::Union{String,Vector{String}...},
[hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
# keyword args (Optional):
fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
post_fetch_method=identity # (local_filepath)->Any
)
Required Fields
name
: the name used to refer to this datadep- Corresponds to a folder name where the datatep will be stored.
- It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
message
: a message displayed to the user for they are asked if they want to download it- This is normally used to give a link to the original source of the data, a paper to be cited etc.
remote_path
: where to fetch the data from- This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).
Optional Fields
hash
: used to check whether the files downloaded correctly- By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
- If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
- If you want to use a different hashing algorithm, then you can provide a tuple
(hashfun, targethex)
.hashfun
should be a function which takes anIOStream
, and returns aVector{UInt8}
. Such as any of the functions from SHA.jl, egsha3_384
,sha1_512
ormd5
from MD5.jl - If you want to use a different hashing algorithm, but don't know the sum, you can provide just the
hashfun
and a warning message will be displayed, giving the correct tuple of(hashfun, targethex)
that should be added to the registration block. - If you don't want to provide a checksum, because your data can change pass in the type
Any
which will suppress the warning messages. (But see above warnings about "what if my data is dynamic"). - Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are
xor
ed to get the target hash. (See Recursive Structure in the documentation for developers).
fetch_method=fetch_default
: a function to run to download the files- Function should take 2 parameters
(remote_filepath, local_directorypath)
, and can must return the local filepath to the file downloaded. - Default (
fetch_default
) can correctly handle strings containing HTTP[S] URLs, or anyremote_path
type which overloadsBase.basename
andBase.download
, e.g.AWSS3.S3Path
. - Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
- Overloading this lets you change things about how the download is done – the transport protocol.
- The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
- This function is also responsible to work out what the local file should be called (as this is protocol dependent).
- Function should take 2 parameters
post_fetch_method
: a function to run after the files have been downloaded- Should take the local filepath as its first and only argument. Can return anything.
- Default is to do nothing.
- Can do what it wants from there, but most likely wants to extract the file into the data directory.
- towards this end DataDeps.jl includes a command:
unpack
which will extract an compressed folder, deleting the original. - It should be noted that
post_fetch_method
runs from within the data directory.- which means operations that just write to the current working directory (like
rm
ormv
orrun(`SOMECMD`))
just work. - You can call
cwd()
to get the the data directory for your own functions. (Ordirname(local_filepath)
).
- which means operations that just write to the current working directory (like
- Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
- You can check this as part of Preupload Checking in the documentation for developers.
DataDeps.ManualDataDep
— TypeManualDataDep(name, message)
A DataDep for if the installation needs to be handled manually. This can be done via Pkg/git if you put the dependency into the packages repo's /deps/data
directory. More generally, message should give instructions on how to setup the data.
DataDeps.register
— Functionregister(datadep::AbstractDataDep)
Registers the given datadep to be globally available to the program. this makes datadep"Name"
work. register
should be run within this __init__
of your module.
DataDeps.@datadep_str
— Macro`datadep"Name"` or `datadep"Name/file"`
Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.
Base.download
— FunctionBase.download(
datadep::DataDep,
localdir;
remotepath=datadep.remotepath,
skip_checksum=false,
i_accept_the_terms_of_use=nothing)
A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName"
, to be executed it will be downloaded if not already present.
Invoking this download
method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.
localdir
: this is the local directory to save to.remotepath
: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.skip_checksum
: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.i_accept_the_terms_of_use
: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.- For automation perposes you can set the environment variable
DATADEPS_ALWAYS_ACCEPT
- If not set, and if
DATADEPS_ALWAYS_ACCEPT
is not set, then the user will be prompted. - Strictly speaking these are not always terms of use, it just refers to the message and permission to download.
- For automation perposes you can set the environment variable
If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.
Helpers
DataDeps.unpack
— Functionunpack(f; keep_originals=false)
Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals
flag is set.
Internal
DataDeps.DataDep
— TypeDataDep(
name::String,
message::String,
remote_path::Union{String,Vector{String}...},
[hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
# keyword args (Optional):
fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
post_fetch_method=identity # (local_filepath)->Any
)
Required Fields
name
: the name used to refer to this datadep- Corresponds to a folder name where the datatep will be stored.
- It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
message
: a message displayed to the user for they are asked if they want to download it- This is normally used to give a link to the original source of the data, a paper to be cited etc.
remote_path
: where to fetch the data from- This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).
Optional Fields
hash
: used to check whether the files downloaded correctly- By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
- If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
- If you want to use a different hashing algorithm, then you can provide a tuple
(hashfun, targethex)
.hashfun
should be a function which takes anIOStream
, and returns aVector{UInt8}
. Such as any of the functions from SHA.jl, egsha3_384
,sha1_512
ormd5
from MD5.jl - If you want to use a different hashing algorithm, but don't know the sum, you can provide just the
hashfun
and a warning message will be displayed, giving the correct tuple of(hashfun, targethex)
that should be added to the registration block. - If you don't want to provide a checksum, because your data can change pass in the type
Any
which will suppress the warning messages. (But see above warnings about "what if my data is dynamic"). - Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are
xor
ed to get the target hash. (See Recursive Structure in the documentation for developers).
fetch_method=fetch_default
: a function to run to download the files- Function should take 2 parameters
(remote_filepath, local_directorypath)
, and can must return the local filepath to the file downloaded. - Default (
fetch_default
) can correctly handle strings containing HTTP[S] URLs, or anyremote_path
type which overloadsBase.basename
andBase.download
, e.g.AWSS3.S3Path
. - Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
- Overloading this lets you change things about how the download is done – the transport protocol.
- The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
- This function is also responsible to work out what the local file should be called (as this is protocol dependent).
- Function should take 2 parameters
post_fetch_method
: a function to run after the files have been downloaded- Should take the local filepath as its first and only argument. Can return anything.
- Default is to do nothing.
- Can do what it wants from there, but most likely wants to extract the file into the data directory.
- towards this end DataDeps.jl includes a command:
unpack
which will extract an compressed folder, deleting the original. - It should be noted that
post_fetch_method
runs from within the data directory.- which means operations that just write to the current working directory (like
rm
ormv
orrun(`SOMECMD`))
just work. - You can call
cwd()
to get the the data directory for your own functions. (Ordirname(local_filepath)
).
- which means operations that just write to the current working directory (like
- Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
- You can check this as part of Preupload Checking in the documentation for developers.
DataDeps.preupload_check
— Methodpreupload_check(datadep, local_filepath[s])::Bool)
Peforms preupload checks on the local files without having to download them. This is tool for creating or updating DataDeps, allowing the author to check the files before they are uploaded (or if downloaded directly). This checking includes checking the checksum, and the making sure the post_fetch_method
runs without errors. It basically performs datadep resolution, but bypasses the step of downloading the files. The results of performing the post_fetch_method
are not kept. As normal if the DataDep being checked does not have a checksum, or if the checksum does not match, then a warning message will be displayed. Similarly, if the post_fetch_method
throws an exception, a warning will be displayed.
Returns: true or false, depending on if the checks were all good, or not.
Arguments:
datadep
: Either an instance of a DataDep type, or the name of a registered DataDep as a AbstractStringlocal_filepath
: a filepath or (recursive) list of filepaths. This is what would be returned by fetch in normal datadep use.
DataDeps.register
— Methodregister(datadep::AbstractDataDep)
Registers the given datadep to be globally available to the program. this makes datadep"Name"
work. register
should be run within this __init__
of your module.
DataDeps.resolve
— Methodresolve("name/path", @__FILE__)
Is the function that lives directly behind the datadep"name/path"
macro. If you are working the the names of the datadeps programmatically, and don't want to download them by mistake; it can be easier to work with this function.
Note though that you must include @__FILE__
as the second argument, as DataDeps.jl uses this to allow reading the package specific deps/data
directory. Advanced usage could specify a different file or nothing
, but at that point you are on your own.
DataDeps.resolve
— Methodresolve(datadep, inner_filepath, calling_filepath)
Returns a path to the folder containing the datadep. Even if that means downloading the dependency and putting it in there.
- `inner_filepath` is the path to the file within the data dir
- `calling_filepath` is a path to the file where this is being invoked from
This is basically the function the lives behind the string macro datadep"DepName/inner_filepath"
.
DataDeps.unpack
— Methodunpack(f; keep_originals=false)
Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals
flag is set.
DataDeps.@datadep_str
— Macro`datadep"Name"` or `datadep"Name/file"`
Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.
DataDeps.DisabledError
— TypeDisabledError For when functionality that is disabled is attempted to be used
DataDeps.NoValidPathError
— TypeFor when there is no valid location available to save to.
DataDeps.UserAbortError
— TypeFor when a users has selected to abort
Base.download
— MethodBase.download(
datadep::DataDep,
localdir;
remotepath=datadep.remotepath,
skip_checksum=false,
i_accept_the_terms_of_use=nothing)
A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName"
, to be executed it will be downloaded if not already present.
Invoking this download
method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.
localdir
: this is the local directory to save to.remotepath
: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.skip_checksum
: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.i_accept_the_terms_of_use
: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.- For automation perposes you can set the environment variable
DATADEPS_ALWAYS_ACCEPT
- If not set, and if
DATADEPS_ALWAYS_ACCEPT
is not set, then the user will be prompted. - Strictly speaking these are not always terms of use, it just refers to the message and permission to download.
- For automation perposes you can set the environment variable
If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.
DataDeps._resolve
— MethodThe core of the resolve function without any user friendly file stuff, returns the directory
DataDeps.accept_terms
— Methodaccept_terms(datadep, localpath, remotepath, i_accept_the_terms_of_use)
Ensures the user accepts the terms of use; otherwise errors out.
DataDeps.better_readline
— Functionbetter_readline(stream = stdin)
A version of readline
that does not immediately return an empty string if the stream is closed. It will attempt to reopen the stream and if that fails then throw an error.
DataDeps.checksum
— Methodchecksum(hasher=sha2_256, filename[/s])
Executes the hasher, on the file/files, and returns a UInt8 array of the hash. xored if there are multiple files
DataDeps.checksum_pass
— Methodchecksum_pass(hash, fetched_path)
Ensures the checksum passes, and handles the dialog with use user when it fails.
DataDeps.determine_save_path
— Functiondetermine_save_path(name)
Determines the location to save a datadep with the given name to.
DataDeps.ensure_download_permitted
— Methodensure_download_permitted()
This function will throw an error if download functionality has been disabled. Otherwise will do nothing.
DataDeps.env_bool
— Functionenv_bool(key)
Checks for an environment variable and fuzzy converts it to a bool
DataDeps.env_list
— Functionenv_list(key)
Checks for an environment variable and converts it to a list of strings, sperated with a colon
DataDeps.fetch_base
— Methodfetchbase(remotepath, local_dir)
Download from remote_path
to local_dir
, via Base
mechanisms. The download is performed using Base.download
and Base.basename(remote_path)
is used to determine the filename. This is very limited in the case of HTTP as the filename is not always encoded in the URL. But it does work for simple paths like "http://myserver/files/data.csv"
. In general for those cases prefer http_download
.
The more important feature is that this works for anything that has overloaded Base.basename
and Base.download
, e.g. AWSS3.S3Path
. While this doesn't work for all transport mechanisms (so some datadeps will still a custom fetch_method
), it works for many.
DataDeps.fetch_default
— Methodfetch_default(remote_path, local_path)
The default fetch method. It tries to be a little bit smart to work with things other than just HTTP. See also fetch_base
and fetch_http
.
DataDeps.fetch_http
— Methodfetch_http(remotepath, localdir; update_period=5)
Pass in a HTTP[/S] URL and a directory to save it to, and it downloads that file, returning the local path. This is using the HTTP protocol's method of defining filenames in headers, if that information is present. Returns the localpath that it was downloaded to.
update_period
controls how often to print the download progress to the log. It is expressed in seconds. It is printed at @info
level in the log. By default it is once per second, though this depends on configuration
DataDeps.handle_missing
— Methodhandle_missing(datadep::DataDep, calling_filepath)::String
This function is called when the datadep is missing.
DataDeps.input_bool
— Functionbool_input
Prompted the user for a yes or no.
DataDeps.input_choice
— Methodinput_choice
Prompted the user for one of a list of options
DataDeps.input_choice
— Methodinput_choice
Prompts the user for one of a list of options. Takes a vararg of tuples of Letter, Prompt, Action (0 argument function)
Example:
input_choice(
('A', "Abort -- errors out", ()->error("aborted")),
('X', "eXit -- exits normally", ()->exit()),
('C', "Continue -- continues running", ()->nothing)),
)
DataDeps.is_valid_name
— Methodis_valid_name(name)
This checks if a datadep name is valid. This basically means it must be a valid folder name on windows.
DataDeps.list_local_paths
— Methodlist_local_paths( name|datadep, [calling_filepath|module|nothing])
Lists all the local paths to a given datadep. This may be an empty list
DataDeps.postfetch_check
— Methodpostfetch_check(post_fetch_method, local_path)
Executes the postfetchmethod on the given local path, in a temporary directory. Returns true if there are no exceptions. Performs in (async) parallel if multiple paths are given
DataDeps.preferred_paths
— Functionpreferred_paths(calling_filepath; use_package_dir=true)
returns the datadeps loadpath plus if callingfilepath is provided and use_package_dir=true
and is currently inside a package directory then it also includes the path to the dataseps in that folder.
DataDeps.progress_update_period
— Methodprogress_update_period()
Returns the period between updated being logged on the progress. This is used by the default fetch_method
and is generally a good idea to use it in any custom fetch method, if possible
DataDeps.run_checksum
— MethodProviding only a hash string, results in defaulting to sha2_256, with that string being the target
DataDeps.run_checksum
— MethodIf a vector of paths is provided and a vector of hashing methods (of any form) then they are all required to match.
DataDeps.run_checksum
— MethodIf only a function is provided then assume the user is a developer, wanting to know what hash-line to add to the Registration line.
DataDeps.run_checksum
— MethodIf nothing
is provided then assume the user is a developer, wanting to know what sha2_256 hash-line to add to the Registration line.
DataDeps.run_checksum
— Methodrun_checksum(checksum, path)
THis runs the checksum on the files at the fetched_path. And returns true or false base on if the checksum matches. (always true if no target sum given) It is kinda flexible and accepts different kinds of behaviour to give different kinds of results.
If path (the second parameter) is a Vector, then unless checksum is also a Vector, the result is the xor of the all the file checksums.
DataDeps.run_checksum
— MethodUse Any
to mark as not caring about the hash. Use this for data that can change
DataDeps.run_fetch
— Methodrun_fetch(fetch_method, remotepath, localdir)
executes the fetchmethod on the given remotepath, into the local directory and local paths. Performs in (async) parallel if multiple paths are given
DataDeps.run_post_fetch
— Methodrun_post_fetch(post_fetch_method, fetched_path)
executes the postfetchmethod on the given fetched path, Performs in (async) parallel if multiple paths are given
DataDeps.splitpath
— Methodsplitpath(path)
The opposite of joinpath
, splits a path unto each of its directories names / filename (for the last).
DataDeps.try_determine_load_path
— Methodtry_determine_load_path(name)
Tries to find a local path to the datadep with the given name. If it fails then it returns nothing.
DataDeps.try_determine_package_datadeps_dir
— Methodtry_determine_package_datadeps_dir(filepath)
Takes a path to a file. If that path is in a package's folder, Then this returns a path to the deps/data dir for that package (as a Nullable). Which may or may not exist. If not in a package returns null
DataDeps.uv_access
— Methoduv_access(path, mode)
Check access to a path. Returns 2 results, first an error code (0 for all good), and second an error message. https://stackoverflow.com/a/47126837/179081