DataDeps.jl Documentation

DataDeps.jl Documentation

What is DataDeps?

DataDeps is a package for simplifying the management of data in your julia application. In particular it is designed to simplify the process of getting static files from some server into the local machine, and making programs know where that data is.

For a few examples of its usefulness see this blog post

Usage in Brief:

I want to use some data I have in my project. What do?

The short version is:

  1. Stick your data anywhere with a open HTTP link. (Skip this if it is already online.)
  2. Write a DataDep registration block.
  3. Refer to the data using datadep"Dataname/file.csv etc as if it were a file path, and DataDeps.jl will sort out getting in onto your system.
  4. For CI purposes set the DATADEPS_ALWAYS_ACCEPT environment variable.

Where can I store my data online?

Where ever you want, so long as it gives an Open HTTP(/s) link to download it. **

**(In other protocols and auth can be supported by using a different fetch_method)

Why not store the data in Git?

Git is good for files that meet 3 requirements:

There is certainly some room around the edges for this, like storing a few images in the repository is OK, but storing all of ImageNet is a no go. For those edge cases ManualDataDeps are good (see below).

DataDeps.jl is good for:

The main use case is downloading large datasets for machine learning, and corpora for NLP. In this case the data is not even normally yours to begin with. It lives on some website somewhere. You don't want to copy and redistribute it; and depending on liscensing you may not even be allowed to.

But my data is dynamic

Well how dynamic? If you are willing to tag a new relase of your package each time the data changes, then maybe this is no worry, but maybe it is.

But the real question is, is DataDeps.jl really suitable for managing your data properly in the first place. DataDeps.jl does not provide for versioning of data – you can't force users to download new copies of your data using DataDeps. There are work arounds, such as using DataDeps.jl + deps/build.jl to rm(datadep"MyData", recursive=true, force=true every package update. Or considering each version of the data as a different datadep with a different name. DataDeps.jl may form part of your overall solution or it may not. That is a discussion to have on Slack or Discourse (feel free to tag me, I am @oxinabox on both). See also the list of related packages at the bottom

The other option is that if your data a good fit for git. If it is in overlapping area of plaintext & small (or close enough to those things), then you could add it as a ManualDataDep in and include it in the git repo in the deps/data/ folder of your package. The ManuaulDataDep will not need manual installation if it is being installed via git.

Other similar packages:

DataDeps.jl isn't the answer to everyone's download needs. It is focused squarely on static data. It is opinionated about providing user readable metadata at a prompt that must be accepted. It doesn't try to understand what the data means at all. It might not be good for your use case.

Alternatives that I am aware of are:

Outside of julia's ecosystem is