From 8675ab4d22836a1eb358d500fe97d9956a0ae988 Mon Sep 17 00:00:00 2001 From: Francesc Campoy Date: Fri, 9 Mar 2018 13:26:01 -0800 Subject: [PATCH] [WIP] a proposal to document all datasets and models Signed-off-by: Francesc Campoy --- developer-community/datasets-and-models.md | 255 +++++++++++++++++++++ developer-community/graph.png | Bin 0 -> 4553 bytes 2 files changed, 255 insertions(+) create mode 100644 developer-community/datasets-and-models.md create mode 100644 developer-community/graph.png diff --git a/developer-community/datasets-and-models.md b/developer-community/datasets-and-models.md new file mode 100644 index 00000000..ed6fea7b --- /dev/null +++ b/developer-community/datasets-and-models.md @@ -0,0 +1,255 @@ +# Datasets, Kernels, Models, and Problems + +As we start publishing more datasets and models, it is important to keep in mind why we're doing this. + +> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities. + +We consider datasets and models to be good when they are: +- discoverable, +- reproducible, and +- reusable. + +Keeping all of this in mind, let me propose a way to write documentation for these. + +## A Common Vocabulary + +It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph. + +![dataset graph](graph.png) + + +The following sections get into more detail on each concept, +but let me give a quick intro of all of these concepts. + +### Problems + +Everything we do at source{d} is around solving problems and +making predictions. Problems are the starting motivation +and ending point of most of our Machine Learning processes. + +Problems have a clear objective, and a measure of success that +let us rank different solutions to any problem in an objective +way. Think about accuracy, recall, etc. + +An example problem could be predicting what is the next key +a developer will press given what they've written so far. + +### Models + +Problems are solved using Models. Models are trained +to solve a specific problem by feeding Dataset to a +Kernel that optimizes a set of parameters. +These parameters, once optimized, are what models are made of. + +Models can be considered as a black box, where the only thing +we care about is the input and output formats. This provides +the possibility of reusing a model, to solve the same problem, +or to somehow feed into a different model (by knowledge +transfer or other techniques). + +Given the previous problem of predicting the next key pressed, +a model could get as an input the sequence of all keys pressed +so far, as ASCII codes, and the output could be a single ASCII +code with the prediction. + +A secondary goal of models is to be reproducible, meaning that +someone could try to repeat the same process we went through and +expect to obtain a similar result. If the kernel that generated +the dataset requires metaparameters (such as learning rate), +these values should also be documented. + +This is normally documented in research papers, with references +to what datasets and kernels were used, as well as how much +training time it took to obtain the resulting model. + +### Kernels + +Kernels are algorithms that feed from datasets and +generate models. These algorithms are responsible for describing +the model architecture chosen to solve a problem, e.g. RNN, +CNN, etc, and what metaparamaters were used + +### Datasets + +Datasets contain information retrieved from one or more +data sources, then pre-processed so it can easily be used to +answer questions, solve problems, train models, or even as +the data source to another dataset. + +The most important aspects of a dataset are its format, how to +download it, reproduce it, and what version contains what +exactly. + +Datasets evolve over time, and it's important to have versions +that can be explicitly referred to from trained models. + +### Predictor + +The last piece of the puzzle is what I call a predictor. +A predictor uses a model (sometimes more, sometimes no model +at all) to predict the answer to a question given some input. + +For instance, given a model trained with a large dataset of +the keystrokes of thousands of developers, we could write a +predictor that uses that trained model to create predictions. +That would be a pretty decent predictor. + +But we could also use a simple function that outputs random +ASCII codes, ignoring any other information available. This +predictor would probably have a lower accuracy for the given +problem. + +## Documenting these Artifacts + +So far we've documented models and some datasets to a certain +extent, but I think it's time to provide a framework for all +of these elements to be uniformly documented to improve the +discoverability, reproducibility, and reusability of our +results. + +We will evolve our documentation over time, into something that +hopefully will delight every one of our engineers and users. +But for now, let's keep it realistic and propose a reduced set +of measure we can start applying today to evolve towards that +perfect solution. + +## Current status + +Currently we document only datasets and models in two different +repositories: github.com/src-d/datasets and +github.com/src-d/models. + +We also have a modelforge tool that is intended to provide a way +to discover and download existing models. + +### Datasets + +We currently have only one public dataset: Public Git Archive. +For this dataset we document: + +- how to download the current version of the dataset with the `pga` CLI tool +- how to reproduce the dataset with borges and GHTorrent + +What are we missing? + +- versioning of the resulting dataset, how to download this an previous versions? +- format of the dataset +- what other datasets (and versions) were used to generate this? +- what models have been trained with this dataset +- LICENSE (the tools and scripts are licensed, but not the datasets?) + +### Models + +Models are already documented following some structure, following the +efforts put in place for [modelforge](https://github.com/src-d/modelforge). + +Currently models have an ID, which looks like a long random string like +`f64bacd4-67fb-4c64-8382-399a8e7db52a`. + +Models are accompanied by an example on how to use them, unfortunately the +examples are a bit simpler than expected. They mostly look like this: + +```python +from ast2vec import DocumentFrequencies +df = DocumentFrequencies().load() +print("Number of tokens:", len(df)) +``` + +What are we missing? +- Versioned models, corresponding to versioned datasets. +- Reference to the code (kernel) that was used to generate the model. +- Technical sheet with accuracy, recall, etc for the given model and dataset +- Format of input and output of the model +- At least one example using the model to make a prediction + +## My Proposal + +Since we care about individual versioning of datasets and models, +it seems like it's an obvious choice to use a git repository per dataset, +and model. + +Problems, predictors, and kernels can, for now, be documented directly with +a model. If we see that we start to have too much repetition because we have +many models for a single problem we will reassess this decision. + +### Dataset Repository + +A dataset repository should contain the following information: + +- short description +- long description and links to papers and blog posts +- technical sheet + - size of dataset + - schema(s) of the dataset + - download link +- using the dataset: + - downloading the dataset + - related tools +- reproducing the dataset: + - link to the original data sources + - related tools + +### Model Repository + +A dataset repository should contain the following information: + +- short description +- long description and links to papers and blog posts +- technical sheet + - size of model + - input/output schemas + - download link + - datasets used to train the model (including versions) +- using the model: + - downloading the model + - loading the model + - prerequisites (tensorflow? keras?) + - quick guide: making a prediction +- reproducing the model: + - link to the original dataset + - kernel used to train the model + - training process + - hardware and time spent + - metaparameters if any + - any other relevant details + +### General + +As any other source{d} repository, we need to follow the guidelines in +[Documentation at source{d}](https://github.com/src-d/guide/blob/master/engineering/documentation.md). +This includes having a CONTRIBUTING.md, Code of Conduct, etc. + +Every time a new version of a dataset or model is released a new tag and +associated release should be created in the repository. +The release should include links to anything that has changed since the +previous relaease: such as a new version of the datasets or changes in +the kernel. + +### github.com/src-d/datasest and github.com/src-d/models + +These two repositories should simply contain what is common to all datasets, +or to all models. They will also provide all the tooling build on top of +the documentation for datasets and models. + +Since we imagine these tools extracting information from the repositories +automatically, it is important to keep formatting in mind. + +I'm currently considering whether a `toml` file should be defined containing +the data common to all the datasets and models. +For instance, we could have the download size for each dataset and model, +as well as the associated schemas. A simple tool could then generate +documentation based on these values. \ No newline at end of file diff --git a/developer-community/graph.png b/developer-community/graph.png new file mode 100644 index 0000000000000000000000000000000000000000..484966532e87f6625348558e5c4323cd0ad41160 GIT binary patch literal 4553 zcmX{)c{J2-_p>p!8M2IhDOoDn%UH$|LfM5O%aBOINU}2%k&-Mm!k{E7+t@<3&(~f= zvePg{24jot+w1$jzw`U!o_n9W-}9V%o_n7JOLHRxn-Cia1VWe?8(4!tVAo?vgF%lu zk>wnb$LOr3={3W{qazy|o7UFW!otGl=H}eoTvu0DadB}S9i8at=&7lxfq{Wz-J_$U z*S;UW9aG?Y*5)=K-Puvk;{z++*dY)ELbm-2;1}nR!XVHIw26V855AW_b;qZb=eSFCZ(6D}CVO9z zZ0P<)$lpGhjF!+%HVyJ5W2KR+Joi{q0P%R&WX~~I*K1}H-r=;0rFFQ=^9MvVR-BCI zklPL78IS$n1u5%3NUXiqCs5lj%=@OkEowslSdF)(H*;xs#vDZ5t5Ob&v& z|9V~<1(JL)h!>|r-{5+3LU(}LUX}6oI-sjbMTj>PxW$#GmjX|mRUB&%1wsn-M z#=N%GA~dW)@llrRlYd@zfs^#KdYFa=!RS9q;uXldGBL$Yp!S1A2eh!hDgfYXhl!!7Z=OO+Wu>)UX;hYX($c(yqIq{! z#dtr?L?u4jl?MXsJuyBacqyenW01fxA-IzvS9B>Lc#E&Cjm}2av-6_0GUkRlEwiub zaor%=$GH9d;013UM1|;!b(E*4RJBH!;)rU~qutD)`Xm+Xh6~Sj-O9l&m2%Pd3agMR zADza?_9)~y{Nke}8e!)+GX-De-PFT~@T*$FZTIo73IJhCc&SL|4(GwnGSYq9eNp_w zbcH==5&1GHrC5NXSu^@N45EPLoZv-86}0hBJdE!V5(`n6oC)5plWH1G5>*y!d-u?= zg}3xzCD&~9*+y4J4{2aY=}jE0U>d)7u>WXVCpv=DUGoTcfa+vw#2V3@3Oc}9NcY5j z-JkEG_V)gzVHCnGPO|?35Ep~xkb{gv=~)_-pAsl2d*~)mCw;BvvHsv4t|x@-4)BwW z=pZPHXT;cgxd?0+G>E8mX>Ico;C2^ziyv(eacRqVW3H|?wkhHa~mb(cnK%y#(BmZ=wJocC^cUA zWrTh}1q+t4rC2kY2aF;@%kN&t{dD=<@QFPqh%~9ie09NK`$B+& ziJL1i>-11r4Uyiih0%u*p{<)>!VFZQ6ZV}^vaFkXAAB3QD72mI^c3nfm39bmRO|CB zyST*1IqIA|9HQzFm#07e!!|s!C(7X~md~4p)qR@7MRM`42GR5_p^gEx9|0xDjnkw4 zH@{l9SNqtwLB$Wl8!{l8Vk@hW|5cRBSSMIViovkWdm`)oDV|g2I4~3N3fA@z2C7MN z^!;zu5?;ApmG=38sj^~V9gfQ>bIBLg+<|i%P1JlsuzriCOe>4Pb3zF1=E`U=5@XL! zs(khlQPx#4H4Cj2`v~nn3A*lK7XI>M*78qh+Q$6YZ;I@Mk9D7 zr?u$m(Bm-lhptBFrdSCU!}xo#e?$teW?wc+r=F%%?hR=%H2X9aWH5yj0XZ7MuV9I9 zwZ1l=T1KzS&&*C7--v|Q)b8K|d%vU8R$8!}>?CnAqhyXNlR-YQ0Gn1UIW=ayP~?Ey z0n#z_i}kwS^W0k6UFQu)xUsp%o#)to;^b6U)lwF0b@R5FQBDP-F4v6ISILeS@`{5e zmBtUPphwGp?x&L>4nb0HuFd7m=a^6P1>9#Z57+oz@A-5~B!g7zGn~w7zWdSK+vSqJ zbZzR2zKx2`<3JxA)AC>2+2KJ?g@3+e?~i3SUdt3`KBsI{tgKV43hnz5;CprCYK14w zb{LHE)$}ycWj2eA<+BGe2WMXi(SEeQix^-{-2BqqOuk=&MV+Brlf3E11=}fw;C5J$SRBWIu?noA7V{d2p zvde;p5{pMNH9R7uANXVWo)wb0+X-scE$SS2JVl@Q{+!3UaD2kPD)=f3{WQU77>_Kq zY`^oqEdG6EA|44A`Btmu+6=aT_sdW@PKh5D3MjYKo#yRR^@o-}5qLu~3AuQ8K& zdUr^O#JVXPs5JYTpH8$lEqNS?-cUx&2M!xqIF_+i!b-cOaprS#x7sacf%jY1Ov4qZ zC(?STZnH-xr`jWMh}%;ozajis%LbK=ec(cK0T@Pb)TwX#+_$SrHLCX(StIx zH>5!%{>Oo_{#e&RVIC><=M4AK2zKX(C|sTVEZ7Ho{c)D!)BGQggEgpQmkJ;lz}mS1 znrNb5cL4+3btXe5(d-SH%O&1jRG}JCTmoE*0U?byQcQeqNX-N!&?vV{iT>%o<6;K*FvWAMH_g!rs|G0ooukrfXQzjoZTr7D*?6@ zlT2F5_jo?$2(&_`3o=tv!_{4=z%RwgF^iL-8)br6-j4R*V%h>Hf|Hz!G zgyCIEKNL4Tmyide*X67yMo?HmKoIl3Hc+vc$=($Cvb*4unu_9#(Mi^ZM!QEJ>?aSd ze7`W8DTfz$VRh&1iUP7&71x|rBT3ptE(2eno)%~pg`NwGDO-r}P+J!C{kKKFLGL_h zOO3G~w^hN5bL8u4T->@n&&-b;Y8hU`q|2AZH|2xXZ4l_`CnM3QQU=y4)a|L2Svg0T z$T=j(d5?%BB4z2z_@UuG(FkBD6V}xi*c*I*QyHx{wyMl>dOHs*@$qVDXvNF@%nO>v zIaN7zy)9XRFG9u%nj^XLN%BiWe-02=uflh=9B`IXdffFJcc}hpH%hrJgjWwl1xA2<=vp@`IX)c8iP>wU_r_-+L!68HRm5$n%0 zf1VTj(=ziOY>)Lf3oG8wG?-eK9wY?KEU&Y_x_%8d9Mnn7rT>d9ypY%d=Sn>sW}SPM zu+g(`TwHw?>XUI@g{}C;=66|2(vDjtAmm9yK#R%(jEkRngY8@lm664t%)P|00a8_L zxQ9O?t%XFN0-uK>1Rs4NL|#9A*otijh!>F9nmJ_(27e+k6Xav30k?zW&K%rH0K$U# zV?V-^GKLW{uV1hh2d0e-Zvp}OQNtVHVHxvB#mvCntbi{}!v*#U#ZEZiy>`hWFok@( z#sjzEVd*^qE#%O%QNRW8UtRw716pWCU(_5GL;DJ)&%G36O7dosEE2|x$d>ZWh<3iT zqQzd8QXFfX{|)U-5CU^Jb3k;sikuuo<~vL%PixkW(Nmy_&}1 z#sgGw7J6{x=`p@IPNi-wdVg|*Gz&p&3Z5u|`*G>kuE|ql1NT~CQ5&%PbkrkeIB)Gu z#OT1i7+BPIRyk9Uzb*})r#JUA-c^n?qF_w3x1lMiL;G=AkO7;%J_cQ8wMa_Qp@(E6 zEd(o}mE`_U2gkMDAMnzfY3iE~+c(56>%K>yx(c99t%VQY$iS{;JnI2@-CZIEIoO#F zqR{N7yW|O|mtbzOZfY|IA9v>W?31iBIPD2B?3+LCdAclF(1ON{Uujo-KVFCd<~y0Y zG4u0`U*zl7zB;0I1E{Zm^;_BDq37b07*07?t7?oBBLGc(oyDBU z?M;J--r+)HmWRbdDxpsKs2Jch!YM!IDMaSPz|^3z3Tp!uQ~szqS@jLjW`d>ayL?V-{+R|Vn(_!08Scf%{A zUstieDu{p7kttSI$`sSdt0DoP?>c2Dp|)mkON5ZlIiID>Fr8jlf7dDxeZLko8glA4 zF2aX1pFYjDtc*#uoAp!hTTedjlv=1htnts1caB-@Zna2y>P=a#ftW| z>Ar*Mjc=OqZRV3ED;GdzVPzwIler%`pY3Qnc*1I<>lY*Ohg$Kl@X0%oY{2od^jpIy zT|virHL{|Ig;Fn3^5<3y_hRTpbxNrm3Aci)cp&HQlq+{GAPWyUo*H*pH?^`p|AT#i z=?WpiP@FskVcTwsr{w<0@fw&aJGSuWuhl;JPnf{5kE;02s(4>(f0|vm!EiFJuQG#{ zBsE$ji06H_ytF1Icml$@*)YGx0SivX`NM0+1JJwUcfAUy;q-BfdtvArUHn#5)QsM6 zlg_Vag(aI=@(c3(;n@yL)vy$J>E`Y8yM%JU)|Jy=vzUJG?HBskVUE0;S-p?nWP`!9 zXH=$I&2Yn5cgI)Gw2rSp+H6UHsy|