-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsearch.xml
88 lines (88 loc) · 33.2 KB
/
search.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title><![CDATA[CMU 15645 - Buffer Pools]]></title>
<url>%2F2020%2F02%2F16%2FCMU-15645-Buffer-Pools%2F</url>
<content type="text"><![CDATA[Locks 与 LatchesLocks: 保护数据库的逻辑内容(如元祖、表、数据库;从事务的角度看) 由事务获取 需要支持回滚 Latches: 保护DBMS内部数据结构的重要部分(从线程角度看) 由操作获取 不需要支持回滚 缓存池Buffer Pool缓存池是从磁盘中读取的页集合的内存缓存。它是内存中的一块区域,由固定大小的页数组组成。每个数组元素叫做帧(frame)。当DBMS需要一页时,该页从磁盘中读入存在帧中。 缓存池维护的元数据: 页表:内存中的哈希表,记录当前内存中的页。将页的ID与帧的位置映射起来。 脏位:当线程修改一个页时设置该位。代表存储管理器必须将页写回磁盘。 Pin Counter:记录当前访问该页的线程数量(包括读和修改)。线程访问前必需将该计数器自增。若该计数器值大于0,则内存管理器就不能将该页从内存中淘汰出去。 优化: 多缓存池:The DBMS can also have multiple buffer pools for different purposes. Thishelps reduce latch contention and improves locality 预取:The DBMS can also optimize by pre fetching pages based on the query plan. Com-monly done when accessing pages sequentially. Scan Sharing:Query cursors can attach to other cursors and scan pages together 分配策略: 全局策略:How a DBMS should make decisions for all active txns 局部策略:Allocate frames to a specific txn without considering the behavior of concurrent txns 替换策略替换策略是DBMS实现的一种算法,用于决定哪一页从缓存池中淘汰。 目标: 准确性 正确性 速度 元数据开销 LRU: 为每一页维护时间戳,记录上一次访问的时间 DBMS淘汰时间戳最旧的页 时钟替换策略CLOCK: 是一种LRU的近似,但不需要为每一页维护时间戳 每一页有一个引用位 当页被访问时,设为1 Organize the pages in a circular buffer with a “clock hand”: Upon sweeping check if a pages bit is set to 1 If yes, set to zero, if no, then evict Clock hand remembers position between evictions 替换: LRU和CLOCK有以下问题: LRU and Clock are susceptible tosequential floodingwhere the buffer pool’s contents are trasheddue to a sequential scan. It may be that the LRU page is actually important due to not tracking meta-data of how a page is used 更好的解决方法: LRU-K: Take into account history of the last K references Priority hints: Allow txns to tell the buffer pool whether page is important or not Localization: Choose pages to evict on a per txn/query basis]]></content>
<categories>
<category>Database System</category>
</categories>
<tags>
<tag>CMU 15645</tag>
</tags>
</entry>
<entry>
<title><![CDATA[CMU 15645 - Data Storage(2)]]></title>
<url>%2F2020%2F02%2F15%2FCMU-15645-Data-Storage-2%2F</url>
<content type="text"><![CDATA[数据表示元祖中可以存储的四种主要类型包含:整数(integers)、可变精度数(variable precision numbers)、定点数(fixed point precision numbers)、变长值(variable length values)、日期/时间。 Integers: 大多数DBMS使用原生C/C++类型 例如:INTEGER, BIGINT, SMALLINT, TINYINT. Variable Precision Numbers: 部分使用原生C/C++类型(IEEE-754标准) 例如:FLOAT, REAL. Fixed Point Precision Numbers: 任意精度和规模的数值类型。变长二进制表示,附带元信息(告诉系统哪里是小数) 通常当舍入错误不可接受时使用它。 例如:NUMERIC,DECIMAL. Variable Length Data: 任意长度的字节数组 具有头部,记录字符串的长度。 大多数DBMS不允许元祖超过页的大小,所以可以通过写入索引的方式解决这个问题。 有些系统允许将这些大的值存储在外部文件中,元祖包含指向该文件的指针。 例如:VARCHAR,VARBINARY,TEXT,BLOB. Dates and Times: 例如:TIME,DATE,TIMESTAMP. System Catalogs: 为了让DBMS读取这些值,它维护了一个内部日志文件,记录有关数据库的元信息,包括数据库含有的表和列,以及它们的类型和存储顺序。 工作负载OLTP: On-line Transaction Processing Fast, short running operation Queries operate on single entity at a time More writes than reads Repetitive operations Usually the kind of application that people build first Example: User invocations of Amazon. They can add things to their cart, they can make purchases,but the actions only affect their account. OLAP: On-line Analyitical Processing Long running, more complex queries Reads large portions of the database Exploratory queries Deriving new data from data collected on the OLTP side Example: Compute the five most bought items over a one month period for these geographical loca-tions. 存储模型页存储元祖的方式有很多种。 N元存储模型 N-Ary Storage Model (NSM) 对于单一元祖,DBMS将它的属性存储在相邻位置。这种方法适合OLTP(事务通常只操作一个实体)和插入较多的情况。对于一个元祖只用取一次数据页。(Row Stores) 好处: Fast inserts, updates, and deletes Good for queries that need the entire tuple 坏处: Not good for scanning large portions of the table and/or a subset of the attributes. This is because it pollutes the buffer pool by fetching data that is not needed for processing the query. 下面是两种不同组织NSM数据库的方式: Heap-Organized Tables: Tuples are stored in blocks called a heap, and the heap does not necessarily define an order. Index-Organized Tables: Tuples are stored in the primary key index itself, but different from a clustered index. 关于primary key主键和clustered index聚集索引:表通常具有包含唯一标识表中每一行的值的一列或一组列。这样的一列或多列称为表的主键 (PK)。主键是表中的一个字段或多个字段,用来唯一地标识表中的一条记录。唯一性是主键最主要的特性。对于一张表来说,聚集索引只能有一个,因为数据真实的物理存储顺序就是按照聚集索引存储的。 摘自Stack Overflow:主键是一种逻辑概念,聚集索引是一种物理概念,它可以影响磁盘上记录的存储顺序。 分离存储模型Decomposition Storage Model (DSM) DBMS把所有元祖的同一个属性存储在一起,适用于OLAP,它需要大规模的扫描属性的子集。(Column stores) 好处: Reduces the amount of wasted work during query execution because the DBMS only reads the datathat it needs for that query. Enables better compression because all of the values for the same attribute are stored contiguously 坏处: Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching 为了将元祖组合在一起,可以使用以下方式: 定长偏移量:假定所有属性长度固定。Then when the system wants the attribute for a specific tuple, it knows how to jump to that spot in the file. To accommodate the variable-length fields, the system can pad them so that they are all the same length, or you could use a dictionary that takes a fixed-size integer and maps the integer to the value. 内嵌元祖ID:For every attribute in the columns, store the tuple id with it. The system wouldalso need extra information to tell it how to jump to every attribute that has that id. 大多数使用定长偏移量。 Row stores are usually better for OLTP, while column stores ar better for OLAP.]]></content>
<categories>
<category>Database System</category>
</categories>
<tags>
<tag>CMU 15645</tag>
</tags>
</entry>
<entry>
<title><![CDATA[Effective Modern C++ 阅读笔记]]></title>
<url>%2F2020%2F02%2F14%2FEffective-Modern-C%2F</url>
<content type="text"><![CDATA[一、现代C++(1)]]></content>
<categories>
<category>Programming Language</category>
</categories>
<tags>
<tag>C++</tag>
</tags>
</entry>
<entry>
<title><![CDATA[Graph Embedding实验]]></title>
<url>%2F2019%2F12%2F26%2FGraph-Embedding%E5%AE%9E%E9%AA%8C%2F</url>
<content type="text"><![CDATA[Graph Embedding实验1. DeepWalk论文链接:https://arxiv.org/pdf/1403.6652 (1) 实现思路 利用networkx和pandas读取数据文件并建图 随机选定节点开始Random Walk,形成指定长度的序列 重复多次,构成sentences作为训练样本 利用gensim构建word2vec模型进行训练,得到embeddings向量 (2) 核心代码核心代码如下: 12345678910111213141516171819202122def random_walk(start, walk_length): path = [start] while len(path) < walk_length: cur = path[-1] neigh = list(g.neighbors(cur)) if len(neigh) > 0 : path.append(random.choice(neigh)) else: break return [str(node) for node in path] def build_deepwalk(num_walks, walk_length): walks = [] node = list(g.nodes) num_walks = len(node) * num_walks print("node: {}, walks_num: {}".format(len(node), num_walks)) for i in range(num_walks): random.shuffle(list(node)) for n in node: walks.append(random_walk(n, walk_length)) return walks (3) 其他 由于代码没有进行全面的优化,故与论文给出的实现代码相比运行速度较慢。 2. LINE论文链接:https://arxiv.org/pdf/1503.03578 (1) 实现思路 采用keras实现训练 定义损失函数,即两个概率分布的距离(同时进行负采样的优化),采用梯度下降进行训练 定义输入向量,以及嵌入的维度,在模型中添加嵌入层 对于二阶相似性,为每个定点维护两个嵌入的向量(本身及上下文) (2) 核心代码123456789101112131415def create_model(numNodes, factors): left_input = Input(shape=(1,)) right_input = Input(shape=(1,)) left_model = Sequential() left_model.add(Embedding(input_dim=numNodes + 1, output_dim=factors, input_length=1, mask_zero=False)) left_model.add(Reshape((factors,))) right_model = Sequential() right_model.add(Embedding(input_dim=numNodes + 1, output_dim=factors, input_length=1, mask_zero=False)) right_model.add(Reshape((factors,))) left_embed = left_model(left_input) right_embed = left_model(right_input) left_right_dot = Dot(axes=1)([left_embed, right_embed]) model = Model(input=[left_input, right_input], output=[left_right_dot]) embed_generator = Model(input=[left_input, right_input], output=[left_embed, right_embed]) return model, embed_generator (3) 其他 论文中提到的优化和采样算法借鉴了word2vec中的优化思想,是简化复杂网络嵌入的一种有力手段,论文中同时使用了O(1) 时间复杂度的Alias Sampling算法。 3. node2vec论文链接:https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf (1) 实现思路 与LINE不同,本算法中的给定点的embedding向量在不同的场合下都是相同的,故只需定义一个输入层和嵌入层 node2vec使用参数p和q控制游走的策略,使游走产生的序列介于BFS和DFS之间,当p=q=1时为DeepWalk 得到游走的序列后,采用Skip-gram模型进行训练 使用负采样优化 (2) 核心代码123456789101112131415161718192021def node2vec_walk(self, walk_length, start_node): G = self.G alias_nodes = self.alias_nodes alias_edges = self.alias_edges walk = [start_node] while len(walk) < walk_length: cur = walk[-1] cur_nbrs = list(G.neighbors(cur)) if len(cur_nbrs) > 0: if len(walk) == 1: walk.append( cur_nbrs[alias_sample(alias_nodes[cur][0], alias_nodes[cur][1])]) else: prev = walk[-2] edge = (prev, cur) next_node = cur_nbrs[alias_sample(alias_edges[edge][0], alias_edges[edge][1])] walk.append(next_node) else: break return walk (3) 其他 node2vec算法使用超参数p和q控制当前节点的游走策略,使其介于DeepWalk的DFS和LINE的BFS之间,同时考虑了节点的结构特性和功能特性。在实现时需要先预处理出概率转移矩阵,从而进行负采样优化。 4. struc2vec论文链接:https://arxiv.org/pdf/1704.03165 (1) 实现思路 struc2vec从空间角度考虑节点的关系,而不仅仅是从相邻的角度。 构建节点的有序度序列,采用Dynamic Time Warping方法来衡量两个序列的距离 按照论文中提出的方法建立层次带权图 在上一步建立出的图中进行随机游走,获得采样的序列 优化 (2) 核心代码1234567891011121314151617181920212223242526272829303132333435def create_context_graph(self, max_num_layers, workers=1, verbose=0,): pair_distances = self._compute_structural_distance( max_num_layers, workers, verbose,) layers_adj, layers_distances = self._get_layer_rep(pair_distances) pd.to_pickle(layers_adj, self.temp_path + 'layers_adj.pkl') layers_accept, layers_alias = self._get_transition_probs( layers_adj, layers_distances) pd.to_pickle(layers_alias, self.temp_path + 'layers_alias.pkl') pd.to_pickle(layers_accept, self.temp_path + 'layers_accept.pkl') def prepare_biased_walk(self,): sum_weights = {} sum_edges = {} average_weight = {} gamma = {} layer = 0 while (os.path.exists(self.temp_path+'norm_weights_distance-layer-' + str(layer)+'.pkl')): probs = pd.read_pickle( self.temp_path+'norm_weights_distance-layer-' + str(layer)+'.pkl') for v, list_weights in probs.items(): sum_weights.setdefault(layer, 0) sum_edges.setdefault(layer, 0) sum_weights[layer] += sum(list_weights) sum_edges[layer] += len(list_weights) average_weight[layer] = sum_weights[layer] / sum_edges[layer] gamma.setdefault(layer, {}) for v, list_weights in probs.items(): num_neighbours = 0 for w in list_weights: if (w > average_weight[layer]): num_neighbours += 1 gamma[layer][v] = num_neighbours layer += 1 pd.to_pickle(average_weight, self.temp_path + 'average_weight') pd.to_pickle(gamma, self.temp_path + 'gamma.pkl') (3) 其他struc2vec从结构特征方面入手完成Graph Embedding,根据简单的测试和比较,效果提升,但是需要做出更多的时间和空间方面的优化。]]></content>
<categories>
<category>深度学习</category>
</categories>
<tags>
<tag>Graph Embedding</tag>
</tags>
</entry>
<entry>
<title><![CDATA[CMU 15645 - Data Storage(1)]]></title>
<url>%2F2019%2F12%2F26%2FCMU-15645-Data-Storage-1%2F</url>
<content type="text"><![CDATA[1 存储CMU 15645课程要求实现一个面向磁盘(disk-oriented)的DBMS,因此假定数据存放在非易失性的磁盘中。 根据存储介质的易失性和非易失性,将存储层次划分为以下两种类别: 易失性设备:易失性指当机器断电时,存储的数据也随之丢失。而此类设备往往支持按字节寻址的快速随机访问,程序可以根据字节地址取出其中的数据。这里通常用它指代内存。(这并不代表易失性设备只有内存,其他如寄存器、CPU cache等) 非易失性设备:不必持续供电来存储数据(即数据不会因为断电而丢失),通常是按块/页寻址的。因此想要读入某处地址的数据,就必须同时读取所在的整块/页的数据。通常是顺序访问的(即不可以随机访问)。这里通常用它指代磁盘。 由于数据存储在磁盘上,DBMS就必需负责完成数据在磁盘和内存之间的交换工作。 2 面向磁盘的DBMS概括数据库文件中的数据按页组织,第一页是页目录(directory page)。为了对数据进行操作,DBMS通过缓存池(buffer pool)将数据存到内存中,缓存池管理数据在内存和磁盘之间的换入换出。DBMS通过执行引擎(execution engine)来进行查询操作,查询引擎向缓存池请求特定的页,缓存池则将该页换入内存,并返回一个指向该页的指针,同时缓存池也要确保查询引擎操作的页已经存放在内存中。 3 DBMS 与 OS 的对比DBMS的设计目标之一就是保证数据库的大小超过可用内存的大小。由于磁盘的读写开销较大,因此DBMS需要在从磁盘在获取数据时能够处理其他的查询操作。 为了实现这一目标,可以利用mmap(内存映射)实现虚拟内存,将文件的内容映射到进程的地址空间中。此时OS负责页的换入和换出,但是如果mmap发生了页故障,就会阻塞整个进程。因此比起OS,更应该由DBMS自己来控制流程,它比OS更清楚有关的数据访问和查询操作。 但是可以利用OS进行如下操作: madvise: 通知OS准备读某些特定的页。 mlock: 通知OS不要将内存换出到磁盘。 msync: 通知OS将内存内容刷新到磁盘上。 虽然OS可以提供某些DBMS需要的功能,但是由DBMS本身实现相关的过程可以使DBMS具有更好的控制能力和性能表现。 4 数据库页(Database Pages)DBMS通过页(固定大小的数据块)的形式来组织数据库。页可以包含不同类型的数据,如数据项(tuples)、索引(indexes)等。大多数系统不会在同一个页中混合不同类型的数据。 每个页都有一个唯一的标识。如果数据库是单文件构成的,那么页的ID可以是文件内的偏移量。大多数的DBMS通过一个间接层来将页ID映射到文件路径和偏移量。当更高层的系统需要某个特定编号的页时,存储管理器将页编号转换成对应的文件和偏移量来找到特定的页。 大多数DBMS采用固定大小的页,为了避免支持变长页的工程开销。在DBMS中有三种页的概念: 硬件页(通常是4KB) 操作系统页(4KB) 数据库页(1-16KB) 存储设备可以保证对硬件页大小的文件是原子写的。例如,如果硬件页大小是4KB,那么当系统向磁盘写入4KB的内容时,要么全部写入,要么发生错误而不写入。这也代表如果数据库页大小超过硬件页,DBMS就必须采取一些额外的措施来保证数组可以安全的写入到磁盘中,因为程序可能在写入时系统发生发生崩溃,导致数据写入出现问题。 5 数据库堆(Database Heap)堆文件是页的无序几何,页中元祖乱序存放。DBMS可以通过页的链表或页目录的方式定位给定的page_id对应的页在磁盘上的位置。 链表:header page存储空页和数据页的指针。然而当DBMS想要找到特定的页时,就必须做顺序扫描。 页目录:DBMS维护某些特定的页,这些页中存放数据页所在位置和每页中的空闲空间。 6 页布局每个页都包含一个首部,记录与该页内容有关的元数据。 页大小 校验和 DBMS版本 事务可见性 某些系统需要页是独立的(self-contained)(如oracle) 一个布局数据的简易方法就是,记录DBMS在页中存储了多少元祖,然后每当新增一个元祖时就追加到末尾。然而当删除一个元祖或者元祖具有变长的属性时就会发生问题。 两种数据布局的方法:1. slotted-pages 2. log-structured Slotted-pages:将槽(slots)映射到偏移量(offset) 如今DBMS中最常见的方法 首部跟踪已用槽的数量、最后使用的槽的起始位置偏移量和跟踪每个元祖起始位置的槽数组。 新增一个元祖时,槽数组从起始增长到结束位置,元祖从结束增长到起始位置,当二者相遇时则页满。 Log-Structured:DBMS只记录日志 记录数据库是如何改变的。(insert, delete, update) 为了读取一个记录,DBMS需要扫描日志文件。 快写慢读。 对于append-only效果很好。 为了避免long reads可以使用索引让DBMS跳到指定位置的日志。也可以周期性的压缩日志。 7 元祖布局元祖是字节的序列。DBMS需要将这些字节解释成属性和值。 元祖首部:包含元祖的元数组。 对于DBMS并发控制协议的可见性信息(如哪个事务创建/修改了元祖) 位图 DBMS不需要在此处存储数据库的元信息 元祖数据:属性的实际数据。 属性按照建表时指定的顺序存储 大多数DBMS不允许元祖超过页的大小 标识符: 每个元祖都分配一个唯一的标识符 最常见的:page_id + (offset or slot) 非正规的元祖数据: 如果两个表相关,DBMS就可以预连接(pre-join),因此表可能在相同的页中结尾。可以让DBMS更快的读取数据(只需要加载一页)。但是更新的开销增大了,因为对于每个元祖DBMS需要更多的空间。]]></content>
<categories>
<category>Database System</category>
</categories>
<tags>
<tag>CMU 15645</tag>
</tags>
</entry>
<entry>
<title><![CDATA[读Paper|DeepFool算法简介]]></title>
<url>%2F2019%2F08%2F27%2F%E8%AF%BBPaper-DeepFool%E7%AE%97%E6%B3%95%E7%AE%80%E4%BB%8B%2F</url>
<content type="text"><![CDATA[概述 DeepFool也是一种基于梯度的白盒攻击算法(与FGSM类似),由Seyed-Mohsen Moosavi-Dezfooli等人在DeepFool: a simple and accurate method to fool deep neural networks 一文中提出。DeepFool算法不用指定学习速率$\varepsilon$,可以计算出比FGSM算法更小的扰动来达到攻击的目的。 算法简介 论文中分别提出了针对二分类和多分类的DeepFool算法,并在不同的模型和数据集上进行了实验。 二分类问题 下图为论文中摘取的二分类问题示例,为了改变分类器的决策,在图片上叠加的最小扰动就是$x_{0}$到$f(x)$垂直方向的距离$r_{*}(x)$。 $r_{*}(x)$的解析解可用下式表示. 根据下式,可以很容易地计算得到 $r_{*}(x)$,但是这个扰动值只能使样本达到分类面,而不足以越过, 故最终的扰动值为 $r_{*}(1+\eta)$,$\eta\ll1$,实验中一般取0.02。 算法伪代码描述如下: 多分类问题 下图为多分类器的示意图,图中实线表示分类器真实的分类超平面,而虚线则代表近似的线性分类超平面,在每次迭代过程中,总是基于当前的迭代值,计算一组近似的线性分类超平面,并根据这组近似超平面,计算扰动,并进行迭代得到下一次的迭代值。 算法伪代码描述如下(移动距离比较小,参数矩阵可以用梯度代替): 代码实现12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485def deepfool_attack(sess, x, predictions, logits, grads, sample, nb_candidate, overshoot, max_iter, clip_min, clip_max, feed=None): """ TensorFlow implementation of DeepFool. Paper link: see https://arxiv.org/pdf/1511.04599.pdf :param sess: TF session :param x: The input placeholder :param predictions: The model's sorted symbolic output of logits, only the top nb_candidate classes are contained :param logits: The model's unnormalized output tensor (the input to the softmax layer) :param grads: Symbolic gradients of the top nb_candidate classes, procuded from gradient_graph :param sample: Numpy array with sample input :param nb_candidate: The number of classes to test against, i.e., deepfool only consider nb_candidate classes when attacking(thus accelerate speed). The nb_candidate classes are chosen according to the prediction confidence during implementation. :param overshoot: A termination criterion to prevent vanishing updates :param max_iter: Maximum number of iteration for DeepFool :param clip_min: Minimum value for components of the example returned :param clip_max: Maximum value for components of the example returned :return: Adversarial examples """ adv_x = copy.copy(sample) # Initialize the loop variables iteration = 0 current = utils_tf.model_argmax(sess, x, logits, adv_x, feed=feed) if current.shape == (): current = np.array([current]) w = np.squeeze(np.zeros(sample.shape[1:])) # same shape as original image r_tot = np.zeros(sample.shape) original = current # use original label as the reference _logger.debug( "Starting DeepFool attack up to %s iterations", max_iter) # Repeat this main loop until we have achieved misclassification while (np.any(current == original) and iteration < max_iter): if iteration % 5 == 0 and iteration > 0: _logger.info("Attack result at iteration %s is %s", iteration, current) gradients = sess.run(grads, feed_dict={x: adv_x}) predictions_val = sess.run(predictions, feed_dict={x: adv_x}) for idx in range(sample.shape[0]): pert = np.inf if current[idx] != original[idx]: continue for k in range(1, nb_candidate): w_k = gradients[idx, k, ...] - gradients[idx, 0, ...] f_k = predictions_val[idx, k] - predictions_val[idx, 0] # adding value 0.00001 to prevent f_k = 0 pert_k = (abs(f_k) + 0.00001) / np.linalg.norm(w_k.flatten()) if pert_k < pert: pert = pert_k w = w_k r_i = pert * w / np.linalg.norm(w) r_tot[idx, ...] = r_tot[idx, ...] + r_i adv_x = np.clip(r_tot + sample, clip_min, clip_max) current = utils_tf.model_argmax(sess, x, logits, adv_x, feed=feed) if current.shape == (): current = np.array([current]) # Update loop variables iteration = iteration + 1 # need more revision, including info like how many succeed _logger.info("Attack result at iteration %s is %s", iteration, current) _logger.info("%s out of %s become adversarial examples at iteration %s", sum(current != original), sample.shape[0], iteration) # need to clip this image into the given range adv_x = np.clip((1 + overshoot) * r_tot + sample, clip_min, clip_max) return adv_x 更详细的代码可见看这里。]]></content>
<categories>
<category>深度学习</category>
</categories>
<tags>
<tag>对抗样本</tag>
</tags>
</entry>
<entry>
<title><![CDATA[读Paper|FGSM算法简介]]></title>
<url>%2F2019%2F08%2F21%2F%E8%AF%BBPaper-FGSM%E7%AE%97%E6%B3%95%E7%AE%80%E4%BB%8B%2F</url>
<content type="text"><![CDATA[概述FGSM(Fast Gradient Sign Method)算法是一种白盒攻击算法,是Ian J. Goodfellow在Explaining and harnessing adversarial examples一文中提出的。论文中提出,可利用该方法作为一种正则化的手段,从而提高神经网络的准确性,同时增加抵抗对抗攻击的能力。 算法原理例子 文章首先解释了线性模型中的对抗样本,针对输入$x$,令扰动输入为$\tilde{x}=x+\eta$,其中$\eta$为很小的变化且$||\eta||_{\infty}<\epsilon$,考虑到$$\mathbf{\omega}^\mathrm{T}\tilde{x}=\mathbf{\omega}^\mathrm{T}x+\mathbf{\omega}^\mathrm{T}\eta.$$$\mathbf{\omega}$为权重向量,对抗扰动通过附加$\mathbf{\omega}^\mathrm{T}\eta$项,从而使激活函数的值而发生变化。因此如果令$\eta=sign(\omega)$(为了保证变化量与梯度方向一致),将每一个维度的微小影响叠加起来,就可以对最终分类结果产生较大的影响。 下图是一个经典的对抗样本示例,在ImageNet上训练后的GoogLeNet将原始图片以57.7%的概率识别为熊猫,而在图片上叠加一些扰动后,网络将样本识别以99.3%的概率识别为长臂猿,然而在人眼看来两幅图片几乎没有差别。 算法描述 令$\theta$为模型参数,$x$为模型输入,$y$为对应的目标,$J(\theta,x,y)$为用来训练神经网络的损失函数:$$\eta=\epsilon \text{sign}(\nabla_{x}J(\theta,x,y)).$$ 简单的讲就是求出损失函数对于输入$x$的梯度,将上式计算出扰动附加到输入$x$上,使$x$朝梯度上升的方向移动,从而使分类结果偏离样本的原始标签。 算法实现算法的tensorflow实现参考了cleverhans,详细可以参照原github。 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283def fgm(x, logits, y=None, eps=0.3, ord=np.inf, clip_min=None, clip_max=None, clip_grad=False, targeted=False, sanity_checks=True): """ TensorFlow implementation of the Fast Gradient Method. :param x: the input placeholder :param logits: output of model.get_logits :param y: (optional) A placeholder for the true labels. If targeted is true, then provide the target label. Otherwise, only provide this parameter if you'd like to use true labels when crafting adversarial samples. Otherwise, model predictions are used as labels to avoid the "label leaking" effect (explained in this paper: https://arxiv.org/abs/1611.01236). Default is None. Labels should be one-hot-encoded. :param eps: the epsilon (input variation parameter) :param ord: (optional) Order of the norm (mimics NumPy). Possible values: np.inf, 1 or 2. :param clip_min: Minimum float value for adversarial example components :param clip_max: Maximum float value for adversarial example components :param clip_grad: (optional bool) Ignore gradient components at positions where the input is already at the boundary of the domain, and the update step will get clipped out. :param targeted: Is the attack targeted or untargeted? Untargeted, the default, will try to make the label incorrect. Targeted will instead try to move in the direction of being more like y. :return: a tensor for the adversarial example """ asserts = [] # If a data range was specified, check that the input was in that range if clip_min is not None: asserts.append(utils_tf.assert_greater_equal( x, tf.cast(clip_min, x.dtype))) if clip_max is not None: asserts.append(utils_tf.assert_less_equal(x, tf.cast(clip_max, x.dtype))) # Make sure the caller has not passed probs by accident assert logits.op.type != 'Softmax' if y is None: # Using model predictions as ground truth to avoid label leaking preds_max = reduce_max(logits, 1, keepdims=True) y = tf.to_float(tf.equal(logits, preds_max)) y = tf.stop_gradient(y) y = y / reduce_sum(y, 1, keepdims=True) # Compute loss loss = softmax_cross_entropy_with_logits(labels=y, logits=logits) if targeted: loss = -loss # Define gradient of loss wrt input grad, = tf.gradients(loss, x) if clip_grad: grad = utils_tf.zero_out_clipped_grads(grad, x, clip_min, clip_max) optimal_perturbation = optimize_linear(grad, eps, ord) # Add perturbation to original example to obtain adversarial example adv_x = x + optimal_perturbation # If clipping is needed, reset all values outside of [clip_min, clip_max] if (clip_min is not None) or (clip_max is not None): # We don't currently support one-sided clipping assert clip_min is not None and clip_max is not None adv_x = utils_tf.clip_by_value(adv_x, clip_min, clip_max) if sanity_checks: with tf.control_dependencies(asserts): adv_x = tf.identity(adv_x) return adv_x 简要理解:代码中用targeted变量来标明是否为定向攻击,若取值为False,则会使样本预测值朝梯度上升的方向移动,加大与给定标签logits之间的偏移程度,反之则靠近标签logits。接下来计算样本x与损失函数loss之间的梯度,然后通过optimize_linear函数将偏移量叠加到输入上。]]></content>
<categories>
<category>深度学习</category>
</categories>
<tags>
<tag>对抗样本</tag>
</tags>
</entry>
<entry>
<title><![CDATA[对抗机器学习(Adversarial Machine Learning)相关的论文合集]]></title>
<url>%2F2019%2F08%2F20%2F%E5%AF%B9%E6%8A%97%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-Adversarial-Machine-Learning-%E7%9B%B8%E5%85%B3%E7%9A%84%E8%AE%BA%E6%96%87%E5%90%88%E9%9B%86%2F</url>
<content type="text"><![CDATA[预备/基础 Intriguing properties of neural networks Evasion Attacks against Machine Learning at Test Time Explaining and Harnessing Adversarial Examples (FGSM/FGM算法) 攻击 The Limitations of Deep Learning in Adversarial Settings DeepFool: a simple and accurate method to fool deep neural networks (DeepFool算法) Towards Evaluating the Robustness of Neural Networks 迁移性 Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples Delving into Transferable Adversarial Examples and Black-box Attacks Universal adversarial perturbations 对抗样本检测 On Detecting Adversarial Perturbations Detecting Adversarial Samples from Artifacts Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods 有限威胁模型攻击 ZOO: Zeroth Order Optimization based Black-box Attacks to Deep Neural Networks Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models Prior Convictions: Black-Box Adversarial Attacks with Bandits and Priors 物理攻击 Adversarial examples in the physical world Synthesizing Robust Adversarial Examples Robust Physical-World Attacks on Deep Learning Models 验证 Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks Certifying Some Distributional Robustness with Principled Adversarial Training 防御与攻击 MagNet: a Two-Pronged Defense against Adversarial Examples MagNet and “Efficient Defenses Against Adversarial Attacks” are Not Robust to Adversarial Examples Towards Deep Learning Models Resistant to Adversarial Attacks Attacking the Madry Defense Model with L1-based Adversarial Examples Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples Adversarial Risk and the Dangers of Evaluating Against Weak Attacks Reinforcing Adversarial Robustness using Model Confidence Induced by Adversarial Training Towards the first adversarially robust neural network model on MNIST Adversarial Attacks on Neural Network Policies Audio Adversarial Examples: Targeted Attacks on Speech-to-Text Seq2Sick: Evaluating the Robustness of Sequence-to-Sequence Models with Adversarial Examples Adversarial examples for generative models 更全面的关于对抗机器学习的文献列表,可以参见这篇博客。]]></content>
<categories>
<category>深度学习</category>
</categories>
</entry>
</search>