From 8675ab4d22836a1eb358d500fe97d9956a0ae988 Mon Sep 17 00:00:00 2001
From: Francesc Campoy <campoy@golang.org>
Date: Fri, 9 Mar 2018 13:26:01 -0800
Subject: [PATCH] [WIP] a proposal to document all datasets and models

Signed-off-by: Francesc Campoy <campoy@golang.org>
---
 developer-community/datasets-and-models.md | 255 +++++++++++++++++++++
 developer-community/graph.png              | Bin 0 -> 4553 bytes
 2 files changed, 255 insertions(+)
 create mode 100644 developer-community/datasets-and-models.md
 create mode 100644 developer-community/graph.png

diff --git a/developer-community/datasets-and-models.md b/developer-community/datasets-and-models.md
new file mode 100644
index 00000000..ed6fea7b
--- /dev/null
+++ b/developer-community/datasets-and-models.md
@@ -0,0 +1,255 @@
+# Datasets, Kernels, Models, and Problems
+
+As we start publishing more datasets and models, it is important to keep in mind why we're doing this.
+
+> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities.
+
+We consider datasets and models to be good when they are:
+- discoverable,
+- reproducible, and
+- reusable.
+
+Keeping all of this in mind, let me propose a way to write documentation for these.
+
+## A Common Vocabulary
+
+It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph.
+
+![dataset graph](graph.png)
+<!--
+To rebuild the graph above, run:
+
+$ dot -Tpng -o graph.png
+
+And give the following as input:
+
+digraph G {
+	dataset -> kernel [ label = "feeds" ];
+        {kernel dataset} -> model [ label = "generates" ];
+        model -> problem [ label = "solves" ];
+        predictor -> model [ label = "uses" ];
+        predictor -> problem [ label = "solves" ];
+}
+-->
+
+The following sections get into more detail on each concept,
+but let me give a quick intro of all of these concepts.
+
+### Problems
+
+Everything we do at source{d} is around solving problems and
+making predictions. Problems are the starting motivation
+and ending point of most of our Machine Learning processes.
+
+Problems have a clear objective, and a measure of success that
+let us rank different solutions to any problem in an objective
+way. Think about accuracy, recall, etc.
+
+An example problem could be predicting what is the next key
+a developer will press given what they've written so far.
+
+### Models
+
+Problems are solved using Models. Models are trained
+to solve a specific problem by feeding Dataset to a
+Kernel that optimizes a set of parameters.
+These parameters, once optimized, are what models are made of.
+
+Models can be considered as a black box, where the only thing
+we care about is the input and output formats. This provides
+the possibility of reusing a model, to solve the same problem,
+or to somehow feed into a different model (by knowledge
+transfer or other techniques).
+
+Given the previous problem of predicting the next key pressed,
+a model could get as an input the sequence of all keys pressed
+so far, as ASCII codes, and the output could be a single ASCII
+code with the prediction.
+
+A secondary goal of models is to be reproducible, meaning that
+someone could try to repeat the same process we went through and
+expect to obtain a similar result. If the kernel that generated
+the dataset requires metaparameters (such as learning rate),
+these values should also be documented.
+
+This is normally documented in research papers, with references
+to what datasets and kernels were used, as well as how much
+training time it took to obtain the resulting model.
+
+### Kernels
+
+Kernels are algorithms that feed from datasets and
+generate models. These algorithms are responsible for describing
+the model architecture chosen to solve a problem, e.g. RNN,
+CNN, etc, and what metaparamaters were used
+
+### Datasets
+
+Datasets contain information retrieved from one or more
+data sources, then pre-processed so it can easily be used to
+answer questions, solve problems, train models, or even as
+the data source to another dataset.
+
+The most important aspects of a dataset are its format, how to
+download it, reproduce it, and what version contains what
+exactly.
+
+Datasets evolve over time, and it's important to have versions
+that can be explicitly referred to from trained models.
+
+### Predictor
+
+The last piece of the puzzle is what I call a predictor.
+A predictor uses a model (sometimes more, sometimes no model
+at all) to predict the answer to a question given some input.
+
+For instance, given a model trained with a large dataset of
+the keystrokes of thousands of developers, we could write a
+predictor that uses that trained model to create predictions.
+That would be a pretty decent predictor.
+
+But we could also use a simple function that outputs random
+ASCII codes, ignoring any other information available. This
+predictor would probably have a lower accuracy for the given
+problem.
+
+## Documenting these Artifacts
+
+So far we've documented models and some datasets to a certain
+extent, but I think it's time to provide a framework for all
+of these elements to be uniformly documented to improve the
+discoverability, reproducibility, and reusability of our
+results.
+
+We will evolve our documentation over time, into something that
+hopefully will delight every one of our engineers and users.
+But for now, let's keep it realistic and propose a reduced set
+of measure we can start applying today to evolve towards that
+perfect solution.
+
+## Current status
+
+Currently we document only datasets and models in two different
+repositories: github.com/src-d/datasets and
+github.com/src-d/models.
+
+We also have a modelforge tool that is intended to provide a way
+to discover and download existing models.
+
+### Datasets
+
+We currently have only one public dataset: Public Git Archive.
+For this dataset we document:
+
+- how to download the current version of the dataset with the `pga` CLI tool
+- how to reproduce the dataset with borges and GHTorrent
+
+What are we missing?
+
+- versioning of the resulting dataset, how to download this an previous versions?
+- format of the dataset
+- what other datasets (and versions) were used to generate this?
+- what models have been trained with this dataset
+- LICENSE (the tools and scripts are licensed, but not the datasets?)
+
+### Models
+
+Models are already documented following some structure, following the
+efforts put in place for [modelforge](https://github.com/src-d/modelforge).
+
+Currently models have an ID, which looks like a long random string like
+`f64bacd4-67fb-4c64-8382-399a8e7db52a`.
+
+Models are accompanied by an example on how to use them, unfortunately the
+examples are a bit simpler than expected. They mostly look like this:
+
+```python
+from ast2vec import DocumentFrequencies
+df = DocumentFrequencies().load()
+print("Number of tokens:", len(df))
+```
+
+What are we missing?
+- Versioned models, corresponding to versioned datasets.
+- Reference to the code (kernel) that was used to generate the model.
+- Technical sheet with accuracy, recall, etc for the given model and dataset
+- Format of input and output of the model
+- At least one example using the model to make a prediction
+
+## My Proposal
+
+Since we care about individual versioning of datasets and models,
+it seems like it's an obvious choice to use a git repository per dataset,
+and model.
+
+Problems, predictors, and kernels can, for now, be documented directly with
+a model. If we see that we start to have too much repetition because we have
+many models for a single problem we will reassess this decision.
+
+### Dataset Repository
+
+A dataset repository should contain the following information:
+
+- short description
+- long description and links to papers and blog posts
+- technical sheet
+    - size of dataset
+    - schema(s) of the dataset
+    - download link
+- using the dataset:
+    - downloading the dataset
+    - related tools
+- reproducing the dataset:
+    - link to the original data sources
+    - related tools
+
+### Model Repository
+
+A dataset repository should contain the following information:
+
+- short description
+- long description and links to papers and blog posts
+- technical sheet
+    - size of model
+    - input/output schemas
+    - download link
+    - datasets used to train the model (including versions)
+- using the model:
+    - downloading the model
+    - loading the model
+        - prerequisites (tensorflow? keras?)
+    - quick guide: making a prediction
+- reproducing the model:
+    - link to the original dataset
+    - kernel used to train the model
+    - training process
+        - hardware and time spent
+        - metaparameters if any
+        - any other relevant details
+
+### General
+
+As any other source{d} repository, we need to follow the guidelines in
+[Documentation at source{d}](https://github.com/src-d/guide/blob/master/engineering/documentation.md).
+This includes having a CONTRIBUTING.md, Code of Conduct, etc.
+
+Every time a new version of a dataset or model is released a new tag and
+associated release should be created in the repository.
+The release should include links to anything that has changed since the
+previous relaease: such as a new version of the datasets or changes in
+the kernel.
+
+### github.com/src-d/datasest and github.com/src-d/models
+
+These two repositories should simply contain what is common to all datasets,
+or to all models. They will also provide all the tooling build on top of
+the documentation for datasets and models.
+
+Since we imagine these tools extracting information from the repositories
+automatically, it is important to keep formatting in mind.
+
+I'm currently considering whether a `toml` file should be defined containing
+the data common to all the datasets and models.
+For instance, we could have the download size for each dataset and model,
+as well as the associated schemas. A simple tool could then generate
+documentation based on these values.
\ No newline at end of file
diff --git a/developer-community/graph.png b/developer-community/graph.png
new file mode 100644
index 0000000000000000000000000000000000000000..484966532e87f6625348558e5c4323cd0ad41160
GIT binary patch
literal 4553
zcmX{)c{J2-_p>p!8M2IhDOoDn%UH$|LfM5O%aBOINU}2%k&-Mm!k{E7+t@<3&(~f=
zvePg{24jot+w1$jzw`U!o_n9W-}9V%o_n7JOLHRxn-Cia1VWe?8(4!tVAo?vgF%lu
zk>wnb$LOr3={3W{qazy|o7UFW!otGl=H}eoTvu0DadB}S9i8at=&7lxfq{Wz-J_$U
z*S;UW9aG?Y*5)=K-Puvk;{z++*dY)ELbm-2;1}nR!XVHIw26V<wGi@BHudhrMO5dW
z#2#c189sK=+vwAT^jKo($nblkPnFZ7l9SaNs*=>855AW_b;qZb=eSFCZ(6D}CVO9z
zZ0P<)$lpGhjF!+%HVyJ5W2KR+Joi{q0P%R&WX~~I*K1}H-r=;0rFFQ=^9MvVR-BCI
zklPL78IS$n1<IZxt8p>u5%3NUXiqCs5lj%=@OkEowslSdF)(H*;xs#vDZ5t5Ob&v&
z|9V~<1(JL)h!>|r-{5+3LU(}LUX}6oI-sjbMTj>PxW$#GmjX|mRUB&%1ws<TK>n-M
z#=N%GA~dW)@llrRlYd@zfs^#KdYFa=!RS9q;uX<Uj|9KVKdsH!wiT^_Y8AHh+hagE
zS5#P}v6yC=Gu?BnIU~6{OI~o&RHGMny~T;k3u?r&HL$0?6LgU3#8tNU`l|Udd^yVq
z+BZv*A5#SEZ#=%z#lBglDPf}=WngFXe{49<qscNls9|w*#@+OClFIrWzxh-IR0~4u
zPSa;=XbN%X3HEDz6}|aiKGTl^po|5i!-2aKq9vNT3pRRV;A4tRS7D1Vc0@6WrB#v)
z?GoHdn1TLp3tvf%N6yWd?0b_3jJbaGjPr$)8}?<?dOEq~rZ95|Rei6Vs_u=@CfBY<
zPbKpQ`cOmF9daH2l`cD8x)oGQO0IkMu5-w7x-?!mxU;_@{8I<@f|{R8UZ{%B*R1P4
zKNfGU&&Fq|L8lrnB+Ztby5OIkr0OTdTkPiZwI{PWlk?-~q0Q?%mxR!YG3kpYX&`fD
zyP@`n#G(avR+>ldGBL$Yp!S1A2eh!hDgfYXhl!!7Z=OO+Wu>)UX;hYX($c(yqIq{!
z#dtr?L?u4jl?MXsJuyBacqyenW01fxA-IzvS9B>Lc#E&Cjm}2av-6_0GUkRlEwiub
zaor%=$GH9d;013UM1|;!b(E*4RJBH!;)rU~qutD)`Xm+Xh6~Sj-O9l&m2%Pd3agMR
zADza?_9)~y{Nke}8e!)+GX-De-PFT~@T*$FZTIo73IJhCc&SL|4(GwnGSYq9eNp_w
zbcH==5&1GHrC5NXSu^@N45EPLoZv-86}0hBJdE!V5(`n6oC)5plWH1G5>*y!d-u?=
zg}3xzCD&~9*+y4J4{2aY=}jE0U>d)7u>WXVCpv=DUGoTcfa+vw#2V3@3Oc}9NcY5j
z-JkEG_V)gzVHCnGPO|?35Ep~xkb{gv=~)_-pAsl2d*~)mCw;BvvHsv4t|x@-4)BwW
z=pZPHXT;cgxd?0+G>E8mX>Ico;C2^z<CPs}<<3d=xr8$y46_T^m_%-#I$47mWEHSA
z!4Re)cZu^%y=70~4GHhpt<2`s8Na@Isx`8cMFdO{Jmg4J#H{M7-|ON?<v*vCZ)tTe
zpn9j$fC#i3P8LERYSw=a>iyv(eacRqVW3H|?wkhHa~mb(cnK%y#(BmZ=wJocC^cUA
zWrTh}1q+t4rC2kY2aF;@%kN&t{dD=<@QFPqh<i_^ro^B4aOFbTraNH0XQJQ4h4t0J
zE8*UY#=0@UlwV1VOF*io7_^N`;CE`|JHSDv@t@J?7%*kQ)&Y#0KmYbn^Di+)-21*#
zC=FuO<`jYZ-N3O1Fl{bY%nGieJTNoQ!?&K)^35N930o0|6HAF+-x@d@O5%(LGZO0Z
ziItOKeZAng;0&0wq<ML@GCp|A!a<4GdmC+j8kW?+1dNqu^0)oTw<F%nK}=ML$ZXav
zf;0n-%eMNPw&|gr_K-Z{O1KWstVq~*$H`5HQf)hnqY)PL+WvOnQ3&R5uYUi}1I}2j
z?CXQ`O~K5zrp7;f;oDZi-x(Wz@kh0X_r5+S+@cNFWKRKgJjX8N>%~9ie09NK`$B+&
ziJL1i>-11r4Uyiih0%u*p{<)>!VFZQ6ZV}^vaFkXAAB3QD72mI^c3nfm39bmRO|CB
zyST*1IqIA|9HQzFm#07e!!|s!C(7X~md~4p)qR@7MRM`42GR5_p^gEx9|0xDjnkw4
zH@{l9SNqtwLB$Wl8!{l8Vk@hW|5cRBSSMIViovkWdm`)oDV|g2I4~3N3fA@z2C7MN
z^!;zu5?;ApmG=38sj^~V9gfQ>bIBLg+<|i%P1JlsuzriCOe>4Pb3zF1=E`U=5@XL!
zs(khlQPx#4H4Cj2`v~nn3A*lK7XI>M*78qh+Q$6<LNJTGA)d538;HbspHHsrt3I$-
z$_PFiHz8TUVHjUkH@rBo^*3F(g3Lx3Ik?Xd_%|(6L~WBHs5M3AZ-i&>YZ;I@Mk9D7
zr?u$m(Bm-lhptBFrdSCU!}xo#e?$teW?wc+r=F%%?hR=%H2X9aWH5yj0XZ7MuV9I9
zwZ1l=T1KzS&&*C7--v|Q)b8K|d%vU8R$8!}>?CnAqhyXNlR-YQ0Gn1UIW=ayP~?Ey
z0n#z_i}kwS^W0k6UFQu)xUsp%o#)to;^b6U)lwF0b@R5FQBDP-F4v6ISILeS@`{5e
zmBtUPphwGp?x&L>4nb0HuFd7m=a^6P1>9#Z57+oz@A-5~B!g7zGn~w7zWdSK+vSqJ
zbZzR2zKx2`<3JxA)AC>2+2KJ?g@3+e?~i3SUdt3`KBsI{tgKV43hnz5;CprCYK14w
zb{LH<d0D_dM6T|5O?li#K5?@)s@*<?t!!YY>E)$}ycWj2eA<<Fu1MR*H`pJ(_(r|N
z{h7fgFSX-ewdxp1ldH*GxvoT;YdG^Mb48Q?Rx{UXZ)lV3t8|HWNCnqwQrbHb2kq2T
z{@o90rG}Cel7%7Gq#9saGqGjs+3_q<Q!ISp2<5+WKmP=FJi)%H7ihmLY76gaCY31Q
zJF0^ti7ITuOe8_+3VA4ljqFfW3e+(o>+BGe2WMXi(SEeQix^-{-2Bq<L57kFGM_%r
zErbq!Yv92LVz;{jE8r8Lzu5ZN=0B!Iafk4%19hesV7&@9`naJNq9wqi&F2F|UP!N#
z4_{5XE2Q7TA**oh?CfnC7uvu$pGEE@xfP<&)f%zH0q}ib;og~tkhpqJ62WuHAyE9o
zvkwfSGzbSD`didR6zdnjxP>qOuk=&MV+Brlf3E11=}fw;C5J$SRBWIu?noA7V{d2p
zvde;p5{pMNH9R7uANXVWo)wb0+X-scE$SS2JVl@Q{+!3UaD2kPD)=f3{WQU77>_Kq
zY`^oqEdG6EA|44A`Btmu+6=aT_sdW@PKh5D3MjYKo#yRR^<C>@o-}5qLu~3AuQ8K&
zdUr^O#JVXPs5JYTpH8$lEqNS?-cUx&2M!xqIF_+i!b-cOaprS#x7sacf%jY1Ov4qZ
zC(?STZnH-xr`jWMh}%;ozajis%<LhbW>LbK=ec(cK0T@Pb)TwX#+_$SrHLCX(StIx
zH>5!%{>Oo_{#e&RVIC><=M4AK2zKX(C|sTVEZ7Ho{c)D!)BGQggEgpQmkJ;lz}mS1
znrNb5cL4+3btXe5(d-SH%O&1jRG}JCTmoE*0U?byQcQeqNX-<C6K*~@CFMd2Lse_z
z&iuSA5+eC_ZU`>N!&?vV{iT>%o<6;K*FvWAMH_g!rs|G0ooukrfXQzjoZTr7D*?6@
zlT2F5_jo?$2(&_`3o=tv!_{4=z%RwgF^iL-8)br6-j4R*V%h>Hf|Hz!G<CDE#4?q>
zgyCIEKNL4Tmyide*X67yMo?HmKoIl3Hc+vc$=($Cvb*4unu_9#(Mi^ZM!QEJ>?aSd
ze7`W8DTfz$VRh&1iUP7&71x|rBT3ptE(2eno)%~pg`NwGDO-r}P+J!C{kKKFLGL_h
zOO3G~w^hN5bL8u4T->@n&&-b;Y8hU`q|2AZH|2xXZ4l_`CnM3QQU=y4)a|L2Svg0T
z$T=j(d5?%BB4z2z_@UuG(FkBD6V}xi*c*I*QyHx{wyMl>dOHs*@$qVDXvNF@%nO>v
zIaN7zy)9XRFG9u%nj^XLN%BiWe-02=uflh=9B`IXdffFJcc}hpH%hrJgjWwl1<jUM
zWPi=6=f8zU=E$uUG`gECO1!0pP=BBOa}OeJNsGByw+pUlAZlBnMl~hn5=4(5R@<zi
zHXGPvKGDdxVh2Ic%S{J$czCXmDW~uAhN!Qz*e4V=#-Fd)Vg3FcF#zO=#Z7k~+9$Um
zFDKYfop9~hLT7tjQ&=4$R5is`-FwxS#^jO`uXZdaa5BraQ<JzY)K(XE2g+dIzdO4`
zZwTc)I|t=BuFV}T_?LEqt|gUok=C>xA2<=vp@`IX)c8iP>wU_r_-+L!68HRm5$n%0
zf1VTj(=ziOY>)Lf3oG8wG?-eK9wY?KEU&Y_x_%8d9Mnn7rT>d9ypY%d=Sn>sW}SPM
zu+g(`TwHw?>XUI@g{}C;=66|2(vDjtAmm9yK#R%(jEkRngY8@lm664t%)P|00a8_L
zxQ9O?t%XFN0-uK>1Rs4NL|#9A*otijh!>F9nmJ_(27e+k6Xav30k?zW&K%rH0K$U#
zV?V-^GKLW{uV1hh2d0e-Zvp}OQNtVHVHxvB#mvCntbi{}!v*#U#ZEZiy>`hWFok@(
z#sjzEVd*^qE#%O%QNRW8UtRw716pWCU(_5GL;DJ)&%G36O7dosEE2|x$d>ZWh<3iT
zqQzd8QXFfX{|)U-5CU^Jb3k;sikuuo<Ns#(DVoykEof<bU><~vL%PixkW(Nmy_&}1
z#sgGw7J6{x=`p@IPNi-wdVg|*Gz&p&3Z5u|`*G>kuE|ql1NT~CQ5&%PbkrkeIB)Gu
z#OT1i7+BPIRyk9Uzb*})r#JUA-c^n?qF_w3x1lMiL;G=AkO7;%J_cQ8wMa_Qp@(E6
zEd(o}mE`_U2gkMDAMnzfY3iE~+c(56>%K>yx(c99t%VQY$iS{;JnI2@-CZIEIoO#F
zqR{N7yW|O|mtbzOZfY|IA9v>W?31iBIPD2B?3+LCdAclF(1ON{Uujo-KVFCd<~y0Y
zG4u<nfq*6_f)t_bi<>0`U*zl7zB;0I1E{Zm^;_BDq37b07*07?t7?oBBLGc(oyDBU
z?M;J--r+)HmWRbdDxpsKs2Jch!YM!IDMaSPz|^3z3Tp!uQ~szqS@jLjW`d>ayL<gO
zuWBx^Uod@Od)T7_BV%GepUTScJD#Lw;@EkB7VYQr9#CL>?V-{+R|Vn(_!08Scf%{A
zUstieDu{p7kttSI$`sSdt0DoP?>c2Dp|)mkON5ZlIiID>Fr8jlf7dDxeZLko8glA4
zF2aX1pFYjDtc*#uoAp!hTTedjlv=1htnts1caB-@Zna2y><gBj75P4Cb>P=a#ftW|
z>Ar*Mjc=OqZRV3ED;GdzVPzwIler%`pY3Qnc*1I<>lY*Ohg$Kl@X0%oY{2od^jpIy
zT|virHL{|Ig;Fn3^5<3y_hRTpbxNrm3Aci)cp&HQlq+{GAPWyUo*H*pH?^`p|AT#i
z=?WpiP@FskVcTwsr{w<0@fw&aJGSuWuhl;JPnf{5kE;02s(4>(f0|vm!EiFJuQG#{
zBsE$ji06H_ytF1Icml$@*)YGx0SivX`NM0+1JJwUcfAUy;q-BfdtvArUHn#5)QsM6
zlg_Vag(aI=@(c3(;n@yL)vy$J>E`Y8yM%JU)|Jy=vzUJG?HBskVUE0;S-p?nWP`!9
zXH=$I&2Yn5cgI)Gw2rSp+H6UHsy|<Xs$Vv;*zMeF9OAl<=Wy}Re%IqOQV~TkZ2Kl~
gb3&~4&K~3FpldJ=b%rc+{ILQuF*G+Qy@Y%4AB3i(W&i*H

literal 0
HcmV?d00001