-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreadings.html
350 lines (323 loc) · 18.3 KB
/
readings.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
<!DOCTYPE html>
<html lang="en">
<!-- All of the meta data for the page belongs in the header tag -->
<head>
<!-- Google tag (gtag.js) -->
<script async src="https://www.googletagmanager.com/gtag/js?id=G-PWM4DVJLW0"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'G-PWM4DVJLW0');
</script>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="icon" href="../images/bio-photo.jpg">
<link rel="stylesheet" href="../css/html5reset.css">
<link rel="stylesheet" href="../css/style.css">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Roboto:wght@300&display=swap" rel="stylesheet">
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css">
<title>Personal Portfolio</title>
</head>
<body class="body_reads">
<div class="skip"><a href="#main">Skip to Main Content</a></div>
<header id="header">
<nav>
<ul>
<li><a href="../index.html">Home</a></li>
<li><a href="../pages/projects.html">Projects</a></li>
<li><a href="../pages/blogs.html">Blogs</a></li>
<li class="active"><a href="../pages/readings.html">My Reads</a></li>
</ul>
</nav>
<!-- <div class="mode_changer">
<p> Dark Mode </p>
<label class="switch">
<input type="checkbox" id="mode_changer">
<span class="slider round"></span>
</label>
</div> -->
</header>
<main id="main">
<div class="page_start">
<p>
I like to read in my free time, be it a research paper or a book. Here are some of the papers/books I find
interesting:
</p>
</div>
<h1 class="div_heads">Research Papers</h1>
<div class="all_projects" id="papers">
<div class="project">
<p class = "project_heading">
Perceiver IO: A general architecture for structured inputs & outputs
</p>
<div class = "project_image">
<figure>
<img src="../images/perceiver.png" alt = "Perceiver IO architecture" width="500">
<figcaption>perceiverIO architecture(<a href="https://arxiv.org/pdf/2107.14795.pdf" aria-label="src_1" style="color: black; text-decoration: underline;">source</a>)</figcaption>
</figure>
</div>
<div class="paper_desc" id="proj1">
<span class="bolded">
- What does the paper do?
</span>
<p class='bio_para'>
The authors here ask the question that "Is the development of problem-specific models for
each new set of inputs and outputs unavoidable?" Along those lines, the authors have proposed
a follow-up work, perceriverIO, that is able to give good results even on some complex task
of vision and language, as opposed to it previous counter-apart , perceriver.
</p>
<span class="bolded">
- How to they do it?
</span>
<p class='bio_para'>
The overall model follows and encoder-decoder based architecture. The encoder here takes
input a bytearray of the input image, language, etc. and tries to map it to a latent vector,
which is usually a lot less dimension that input vector. This latent vector is then passed
through an attention block that follows similar query,key,value followed by point-wise MLP
from Transformers. In decoder, the latent vector is mapped to the output vector using a query
vector(learnable). The dimensions of output are same as the query vector. This formulation
encourages the arbitrary sized outputs and lets you have control over generated output using
a query vector. The structure and formulation of queries here depends on what specific task
is being performed.
</p>
<span class="bolded">
- Positive points of the paper
</span>
<p class='bio_para'>
The paper is a novel contribution to the problem of task specific models. It uses the
concept of projecting input space to a latent space, which makes this efficient for compute.
Also, the authors performed various experiments for tasks like image classification, optical
flow and other language tasks and found at par performance with various baselines.
</p>
<span class="bolded">
- Negative points of the paper
</span>
<p class='bio_para'>
The paper starts with making a general model that doesn't take the task specificity into account.
But, I think the task specificity is still getting embedded(and is manual up-to certain extent)
while we form the latent and the query vector. Also, the model architecture has no way of
interpretting results, which could mean that the model learnt something for specific datasets.
I think more robust experimentation with different sorts of data would have been nice to see that.
</p>
</div>
<button class= "read_more" id="more1">Read More..</button>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
Where2Act: From Pixels to Actions for Articulated 3D Objects
</p>
<div class = "project_image">
<figure>
<img src="../images/where2act.png" alt = "Teaser image for where2act" width="500">
<figcaption>Teaser image for the paper(<a href="https://arxiv.org/pdf/2101.02692.pdf" aria-label="src_2" style="color: black; text-decoration: underline;">source</a>)</figcaption>
</figure>
</div>
<div class="paper_desc" id="proj2">
<span class="bolded">
- What does the paper do?
</span>
<p class='bio_para'>
The paper focuses around the idea of agent-environment interaction using visual perception.
We as humans are not only able to interact with objects like pulling a door, switching on, etc.
easily but also able to infer precisely how and where this interaction needs to take place.
Taking motivation from this, given an image and its depth map, the authors predict a set of actions
(like pulling, pulling-up, pulling-down, etc.) for each pixel, a gripper end-effector
3-DoF orientation in SO(3) space(proposed action) and a probability of success of this action.
</p>
<span class="bolded">
- How to they do it?
</span>
<p class='bio_para'>
The input here is an RGB image or a point cloud. The authors then use either a ResNet-18(for image) or
Pointnet++(for Point Cloud) network to extract the per pixel feature from the input. These features
are then passed through three 128-layered MLPs to get the final outputs required for interaction i.e.,
action per pixel, the proposed action the success of the action proposed. The final loss function is a
scaled sum of losses from the three MLP-based models. \\* Getting data for such specific interactions
is difficult, so the authors use SAPIEN, a state-of-the-art physics-based simulation. Initially, the
authors sampled data offline and trained the model on it, but they found that this data is highly
imbalanced in terms of time for which interaction happens. To mitigate this, the authors propose an
online adaptive data sampling strategy. Here, the authors use the learnt model from offline sampled
data to find interactions that give successful results and sample more data in those regions. The
final experiments are done on fifteen object categories present in in the PartNet-Mobility dataset.
Out of these, ten are used for training and rest 5 are used for testing. The authors use two
evaluations metrics. Firstly, they use F-score to predict the goodness of predicted actions. Secondly,
they define a ratio between the True Positive action proposals and the total number of proposals.
</p>
<span class="bolded">
- Positive points of the paper
</span>
<p class='bio_para'>
Given the complexity of the task, the solution provided by the authors just uses off-the-shelf
feature extractors and MLPs, which is elegant. It is also one of the first papers performing such
fine-grained level task. The experiments are performed robustly, the authors test their algorithm
on unseen test data and perform ablation studies to prove their claim.
</p>
<span class="bolded">
- Negative points of the paper
</span>
<p class='bio_para'>
The authors represent the world in a very simplistic manner, which are just 6 actions. Usually,
the actions are a lot more complicated. Also, some actions might require pushing with one hand and
pulling with other hand(for eg. rotating a table on its axis), there current model architecture might
not be able to handle that. Also, though this is a huge step in terms of agent-environment interaction,
but far from a real scene as the authors used a simulation-based
</p>
</div>
<button class= "read_more" id="more2">Read More..</button>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
Learning Transferable Visual models using Natural Language Pretraining
</p>
<div class = "project_image">
<figure>
<img src="../images/clip.png" alt = "CLIP model architecture" width="500">
<figcaption>CLIP model architecture(<a href="https://arxiv.org/pdf/2103.00020.pdf" aria-label="src_3" style="color: black; text-decoration: underline;">source</a>)</figcaption>
</figure>
</div>
<div class="paper_desc" id="proj3">
<span class="bolded">
- What does the paper do?
</span>
<p class='bio_para'>
Computer vision systems are built through training and prediction on a very specific kind of
dataset and consequently the task. Not only it is laborious to build these specific datasets for
specific tasks but also it limits model's generality and usability. The paper aims at solving this
problem by doing a simple pre-training task of predicting which captions goes with which image.
This allows them to learn State-of-the-art image representations in a scalable and efficient way.
The authors test this method by studying the performance of pre-trained model on 30 different existing
Computer Vision tasks.
</p>
<span class="bolded">
- How to they do it?
</span>
<p class='bio_para'>
The paper derives its inspiration from two major ideas. First being a paper called VirTex and second
is the idea of contrastive learning. During training, image-text pairs are used as a supervising
signal. We take an image, pass it through an image encoder and project it on a linear sub space.
Similarly, we take the corresponding text, pass it through a text encoder and project it on a linear
sub space. Post this, the image and text linear subspaces are used to compute the Cosine similarity
and the encoders are trained by contrastive loss. The authors here don't predict the exact caption of
the image, rather, they train using the similarity between the complete encoded text and an image.
The inference here is done in a zero shot setting. During inference, a textual prompt is engineered
and the loss is computed between the encoded test image and the \textit{engineered prompt} to predict
the final class of the image.
</p>
<span class="bolded">
- Positive points of the paper
</span>
<p class='bio_para'>
The paper introduces a simplistic yet elegant way of learning robust representations from a
large scale dataset. The experiment across 30 different datasets along with zero shot transfer
learning shows how transformers can learn representations across wide variety of tasks. The
thoroughness of the experiments proves that this method is highly transferable and gives us
representations that can handle distribution shift better than ImageNet models.
</p>
<span class="bolded">
- Negative points of the paper
</span>
<p class='bio_para'>
Though CLIP model gives us a way to train models in a datad ab task agnostic manner, but it
comes with a cost of a huge data requirement for pretraining task. The authors in the paper use
400 million images. The image encoders(ResNet like) takes took 18 days to train on 592 V100 GPUs
while the largest Vision Transformer took 12 days on 256 V100 GPUs. This is an enormous cost for
nearly SOTA results. It would have been interesting to see how the models trained for reasonable
time and dataset performs as that is something more accessible to an average researcher.
</p>
</div>
<button class= "read_more" id="more3">Read More..</button>
</div>
</div>
<!-- ***************************************************************************************************************************************** -->
<h1 class="div_heads">Books</h1>
<div class="all_projects" id ="books">
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
Sapiens: A Brief History of Humankind by Yuval Noah Harari
</p>
<div class = "project_image">
<figure>
<img src="../images/sapiens.jpeg" alt = "book cover for sapiens: A Brief History of Humankind" width="500">
<figcaption>sapiens book</figcaption>
</figure>
</div>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
A Short History of Nearly Everything by Bill Bryson
</p>
<div class = "project_image">
<figure>
<img src="../images/biref_history.jpeg" alt = "book cover for A Short History of Nearly Everything" width="500">
<figcaption>book cover</figcaption>
</figure>
</div>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
Animal Farm by George Orwell
</p>
<div class = "project_image">
<figure>
<img src="../images/animal_farm.jpeg" alt = "book cover for Animal farm" width="500">
<figcaption>book cover</figcaption>
</figure>
</div>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
1984 by George Orwell
</p>
<div class = "project_image">
<figure>
<img src="../images/1984.jpeg" alt = "book cover for 1984" width="500">
<figcaption>book cover</figcaption>
</figure>
</div>
</div>
<!-- ***************************************************************************************************************************************** -->
<div class="project">
<p class = "project_heading">
Quiet by Susan Cain
</p>
<div class = "project_image">
<figure>
<img src="../images/quiet.jpeg" alt = "Book cover for Quiet" width="500">
<figcaption>book cover</figcaption>
</figure>
</div>
</div>
<!-- ********************************************************************************************* -->
<!-- <div class="project">
<p class = "project_heading">
Emotional Intelligence by Daniel Goleman
</p>
<div class = "project_image">
<figure>
<img src="../images/emotional_intelligence.jpeg" alt = "Book cover for Emotional Intelligence" width="500">
<figcaption>book cover</figcaption>
</figure>
</div>
</div> -->
</div>
<div class="move_to_top">
<a class = "skip-to-content-link" href="#header" style="text-decoration: none;">
Move to top
</a>
</div>
<script src="../js/readings.js"></script>
<!-- <script src="../js/index.js"></script> -->
</main>
<footer id ="footer">
© 2022
</footer>
</body>