This is just a quick post to show off DataDeps.jl. DataDeps.jl is the long discussed BinDeps for data. At it’s heart it is a tool for reproducible data science. It means anyone trying to run your code later, in a different environment isn’t faffing around trying to work out where to download the data from and how to connect it to your scripts.
I am not going to go into too much detail here, It is all documented in the package README. This is more of a demo.
Like most of my blog-posts this is available as Jupyter Notebook on my Github.
The key features of DataDeps.jl are:
- Repeatable and idempotent data setup/downloading
- Flexible sortage locations that do not require change to code
- Checksum validated downloads
- Messages to users before downloading the first time.
Enough promo, on with the examples:
Example 1: Word Embeddings, data for your model
Your system might need word embeddings. They are pretty important for a lot of NLP research. If you want to use pretrained ones, they can be pretty big though. Too big for adding to your repository. They are definitely data that your model depends on.
Embeddings.jl exposes a large number of pretrained embeddings, basically using the process demonstrated below. It is just a front-end over a few hundred datadeps.
Input:
First we are going to register a DataDep.
In a package this would go in your modules __init__
function.
We are going to declare a data dependency for some word embeddings.
Input:
Let’s see what we’ve got. Rather than needing to refer to your data by a path on disk, DataDeps.jl allows you to refer to it by name with a datadep string macro. This resolves into a path to it on disk – even if that mean downloading it first. (But because I’ve run this code before, it was already downloaded so no download occurs this time.)
Input:
Output:
1-element Array{String,1}:
"wiki.en.vec"
Now we are going to define a function to load up those word embeddings. DataDeps.jl doesn’t handle loading data – just downloading data. To load the data would require understanding a lot about its format. That is left to the user, or to other packages like MLDatasets.jl that know what the data they are consuming is.
Notice here the use of filepath=datadep"FastText en/wiki.en.vec"
as an optional argument.
This is a common pattern that I recommend using with DataDeps.jl.
It means if the user provides a path, the datadep string is never evaluated.
Which in turn means the data download will not be triggered (though in this case it has already been).
Input:
Output:
word_embeddings (generic function with 2 methods)
Input:
Input:
Input:
Output:
Dict{String,Array{Float32,1}} with 52 entries:
"tape" => Float32[0.31498, -0.041574, -0.096835, -0.087724, -0.078622, -0…
"egg" => Float32[0.23074, 0.014205, -0.3986, 0.057022, 0.032088, 0.51731…
"risk" => Float32[-0.22041, 0.043005, -0.16092, 0.42121, -0.31625, -0.129…
"banana" => Float32[-0.30111, -0.19338, 0.035946, 0.040627, 0.24098, -0.356…
"china" => Float32[0.065689, 0.22287, -0.02309, 0.22571, -0.40829, 0.20209…
"walk" => Float32[0.053497, 0.17538, -0.12849, 0.068115, -0.34802, -0.206…
"nails" => Float32[0.38722, -0.087961, -0.33036, 0.25719, -0.10132, 0.3656…
"rook" => Float32[0.16451, 0.044197, -0.31782, 0.04001, -0.1339, 0.26903,…
"down" => Float32[-0.17515, 0.021885, -0.25901, 0.20048, -0.19916, -0.056…
"glue" => Float32[0.22836, 0.14853, -0.36956, 0.27853, -0.40004, 0.12266,…
"face" => Float32[-0.14819, 0.16016, -0.31916, 0.28058, -0.34405, 0.01762…
"old" => Float32[-0.063426, -0.021367, 0.056441, 0.1353, 0.12058, 0.2589…
"turkey" => Float32[-0.51473, 0.02193, -0.24228, 0.22971, -0.61302, -0.1356…
"cheese" => Float32[0.20742, 0.04882, 0.078373, -0.24411, -0.24788, 0.35715…
"truck" => Float32[-0.049757, -0.24887, 0.077345, 0.38045, -0.50517, -0.20…
"cricket" => Float32[-0.15663, 0.18529, -0.074915, 0.80175, 0.37125, 0.01455…
"shed" => Float32[0.076611, -0.17981, -0.26234, 0.57905, -0.25095, -0.026…
"reward" => Float32[-0.05137, -0.096855, -0.13516, 0.029344, -0.13654, -0.2…
"run" => Float32[0.19979, 0.20623, -0.22006, 0.084749, -0.26972, -0.0459…
"baseball" => Float32[0.044589, -0.089292, 0.18082, 0.54954, -0.25423, -0.247…
"saw" => Float32[-0.027857, -0.016778, -0.28143, 0.42337, 0.14235, 0.063…
"danger" => Float32[-0.15803, -0.18926, -0.35727, 0.25279, -0.4704, -0.0768…
"red" => Float32[-0.1397, -0.19608, 0.44096, 0.084868, 0.28052, -0.16625…
"swim" => Float32[0.17168, 0.18579, -0.60043, 0.36278, -0.24944, -0.26992…
"green" => Float32[-0.27572, -0.099347, 0.30856, 0.24058, 0.1654, 0.031648…
⋮ => ⋮
Let’s visualise them. Had to do a bit of hacking around with Plots.jl to get the visualisation I want. Color according to category, text according to index
Input:
Input:
Output:
That worked pretty well, I think tweaking the perplexity on TSNe a bit more could get better result. Of course as with all dimentionality reduction some information is going to be lost and not expressed in the final form. Sill I think it has done well, the FastText embeddings are pretty good. Notice that it has located turkey and china together, pressumably because their embeddings reflect that they are countries. FastText actually doesn’t capture countries during training as far as I can tell, I believe they attempt to remove all proper nouns during preprocessing (E.g. England is not in their), but I guess China and Turkey slip through as they are also regular nouns. Notice also that the ball-sports are located together, separately from movement types like swim, walk and run. New is near Fresh and old is near stale.
Example 2: WordNet.jl: Data for your package
I love WordNet.jl.
WordNet is a pretty fundermental tool for NLP research (though it is getting a bit dated).
WordNet.jl is the julia binding.
Understandably, @jbn doesn’t want to include the WordNet database in the repository.
Because of concerns about the filesize, and about redistributing someone elses work.
However, it is fully dependent on having that data.
So by not automatically installing that it makes it hard to build anything on top of it.
Now this could be done with BinDeps for example or just by sticking a download
into /deps/build.jl
,
but that isn’t great for this.
There is no chance to display a message about the data’s real owner,
and the location of the data wouldn’t be flexible – a path would need to be hardcoded in.
DataDeps.jl expressly designed for these concerns.
(Update 2018: WordNet.jl now uses DataDeps.jl to do exactly this.)
What do we have to do to get WordNet.jl working without any manual data configuration by the user?
Input:
Input:
Output:
Input:
That is it, that declaration of the datadep via the registration block (Mostly just copy-pasted from the WordNet website),
and the addition of a method to the DB constructor, and we are done.
Input:
Output:
turkey.n
Input:
Output:
5-element Array{WordNet.Synset,1}:
(n) Meleagris gallopavo, turkey (large gallinaceous bird with fan-shaped tail; widely domesticated for food)
(n) Republic of Turkey, Turkey (a Eurasian republic in Asia Minor and the Balkans; on the collapse of the Ottoman Empire in 1918, the Young Turks, led by Kemal Ataturk, established a republic in 1923)
(n) joker, turkey (a person who does something thoughtless or annoying; "some joker is blocking the driveway")
(n) turkey (flesh of large domesticated fowl usually roasted)
(n) bomb, dud, turkey (an event that fails badly or is totally ineffectual; "the first experiment was a real turkey"; "the meeting was a dud as far as new business was concerned")
Input:
Output:
8-element Array{WordNet.Synset,1}:
(n) land, country, state (the territory occupied by a nation; "he returned to the land of his birth"; "he visited several European countries")
(n) administrative division, administrative district, territorial division (a district defined for administrative purposes)
(n) district, dominion, territory, territorial dominion (a region marked off for administrative or other purposes)
(n) region (a large indefinite location on the surface of the Earth; "penguins inhabit the polar regions")
(n) location (a point or extent in space)
(n) physical object, object (a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects")
(n) physical entity (an entity that has physical existence)
(n) entity (that which is perceived or known or inferred to have its own distinct existence (living or nonliving))
Example 3: 538: Aveneger’s Comic Book Characters: DataDepsGenerators.jl
So this last example is a change to show off DataDepsGenerators.jl. It does the kinda fragile webscraping to generate code for registration blocks, which you can then edit and include into your project that uses DataDeps.jl.
We are going to load-up 538’s dataset on Marvel Comic book characters.
Input:
Input:
Output:
Now DataDepsGenerators.jl isn’t perfect, you do have to check it by hand, and probably edit it a bit. For example because of how 538 laid out their github repo (see issue fivethirtyeight/data/#101, DataDepsGenerators thinks this data is MIT lisenced. It is actually Creative Commons Attribution 4.0 International License. We’re in complaince with that notice anyway, as it includes (I believe, but IANAL) all the attribution information we need.
Not also it has failed to give it a good datadep name.
You shouldn’t do this in your packages, but for an demo like this, we can register that generated data dep immediately. We’ll pass in the name to the generator this time too.
Input:
Time to load it up, and then we will do some visualisations.
Input:
Lets see what the distribution of how frequently characters the characters appear appear is:
Input:
Output:
Looks kinda Ziphian. Not surprising. So who are the heavy hitters?
Input:
Output:
That’s Spiderman, Captain America, Wolverine, Ironman and Thor. Cool cool. So that is a bunch of dudes. How is the distribution of appreances is you separate out by gender:
Input:
Output:
Input:
Output:
Ok, well that tells a story. Note that the scale (vertical and horisontal) for the ladies plot is less than half that as for the dudes.
Lets see when characters were introduced, this is the year the characters join the avengers no the year they were first published (unfortunately). I suspect they correlate to some degree though.
Input:
Output:
Anyway that is enough about comic books.
DataDeps.jl isn’t about processing data, or the stuff I can do with it.
It is about setting up data so you can do stuff with it
Conclusion
DataDeps.jl: get on it.
Sort out your data depenancies.
Make your scientific code easier for other people reproduce run by having it automatically download it’s data.
Make your packages easier to install by removing manual steps.
Spend less time worrying about setting up and managing your data,
and more time analysing it and advancing science.