2009年4月11日星期六

time

7:00 :起床
8:00:工作
11:20:吃饭
12:00:文化阅读
13:30:休息
14:00 :工作
17:20:吃饭
18:00:listen
19:00:看书
22:00:锻炼
22:30:洗澡
23:00:文化阅读
23:30:开始入睡
24:00:进入睡眠

总计:睡觉八个小时,白天工作六小时,晚上看书三小时;文化阅读两小时;锻炼半小时


time

if only

  後来,我总算学会了如何去爱,可惜你早已远去,消失在人海,
  後来,终於在眼泪中明白,有些人一旦错过就不再。

                                          《刘若英˙後来》

  有人说人生就像是一连串不断的後悔,成长的过程就是不断的犯
  错,直到自己错到一个程度时,才会真正的长大。然而,长大後
  的我们总会想,如果能一开始就走对的话,是不是人生能改变?

  时间永远不能倒流,而我们也只能在不断的追悔中写下次次的遗
  憾。因此,如果能重新来过会有什麼改变,就成了电影中常见的
  题材。或许也只有在电影的世界里,才能满足许多人渴望回到过
  去的心愿。

  如果爱情代表的是一种无私的奉献和给予,那麼最崇高的爱情莫
  过於献出自己最宝贵的生命,当连唯一仅有的生命都能奉献,这
  样用生命成就的爱情,究竟是上天冰冷的讽刺,还是一种残缺的
  美好? 

2009年4月7日星期二

少有人走的路:与心灵的对话

20090403
 少有人走的路:与心灵的对话
1、生活是什么?生活是你已经规划好的事情之外所发生的一切。所以,我们应该对变化充满感激!
2、多数人认为勇气就是不害怕。现在让我来告诉你,不害怕不是勇气,它是某种脑损伤。勇气是尽管你感觉害怕,但仍能迎难而上;尽管你感觉痛苦,但仍能直接面对
3、如果明白发生在自己身上的每件事,都是上苍设计好的,我们就会永远立于不败之地。

20090405
1、可是治疗一个人的痛苦,往往不是设法去消除痛苦,而是与他一起承受.
2、之所以要宽恕,完全是为了我们自己,为了自己的健康,或许那些人根本不知道自己需要被宽恕。如果我们抓住愤怒不放,我们就会停止成长,我们的灵魂就会枯萎。

20090406
因为我们太懒惰,所以也只能生活在幻觉里。假如我们能清醒地认识到自己的无知,我们将不得不承认自己的极度愚蠢,或在最低限度内,要求自己一生努力学习。由于大多数人不愿意承认自己的愚蠢,也不想为任何事情去付出努力,因此,生活在一个美好的幻觉中,营造一种自以为自己懂得很多的幻境,是非常舒服的。
问题是,这是一个幻觉,是不真实的!
现在,我们的文化是回避痛苦,因此心理健康总是得不到鼓励。当某人经受情感的挫折时,我们会说:"噢,可怜的乔,他的幻想破灭了。"而我们应该说的是:"幸运的乔,他已经醒悟了。
耶稣在布道时,从他嘴里说出的第一句话是:'困惑是福。'"如果你问为什么耶稣会那样说,我会告诉你,困惑能激起人们寻求答案的动机,而寻求的欲望又会促使人们努力地学习。
实际上,这世界上所有的邪恶,都是那些很明确地知道自己在干什么的人造成的,而不是那些处于困惑中的人造成的,不是那些"心灵可怜之人"造成的。
谦逊代表着你有自知之明。真正的谦虚应该是实事求是。

"少有人走的路:与心灵的对话"的电子版本(不全,貌似就三章不到),看了之后发现没有第一版的好,感觉有点累赘,而且很多的道理和第一版还有重复,因此也不准备书店买了。

20090407
“全球通史”,今天看了开头一部分,其实这一部分以前看过,但是发现还是有必要再看一遍,这里面不仅仅是包含了单纯的历史知识,更多的是你要用来指导现在的生活的很多道理,我想他有名的原因应该也是在于此吧。
这本书的目的也不是让人知道什么时候发生了什么,这是它的可贵之处。

2009年4月2日星期四

少有人走的路---通向心智成熟的旅程


很多时候你必须先吃苦后享受,这叫自律?那怎么才能自律,那必须你有爱,你必须对这件事这个东西是爱的,那怎么才能爱呢?爱意味着明白这个东西的要失去,如果你觉得你现在不努力学习,那么这个机会即将失去,以后再也不会有的时候,我想你可能会爱上努力学习。

人生是一个面对问题并解决问题的过程。问题能启发我们的智慧,激发我们的勇气;问题是我们成功与失败的分水岭。为解决问题而付出努力,能使思想和心智不断成熟。

唯有痛苦才会带来教益

20090324
爱,是为了促进自我和他人心智成熟,而具有的一直自我完善的意愿。

20090327
真正具有爱的人,绝不会随意批评别人或与对方发生冲突,他们竭力避免给对方造成傲慢的印象。
应该是尊重对方的独立性,渴望给予对方爱的指导。
浅水喧闹,深潭无波。
自恋的人无视别人的存在,只把别人当成自我的延伸。

20090401
我们学习某种新东西,实际上只是发现了一直存在于脑海中的某种事物。education(教育)来自与拉丁语educare,字面意思是“带出来”并且“带领到”,因此教育别人的时候,如果认真地加以对待,就不会把某种新的东西强塞入他们的思维。事实上,我们是把这种东西从其思维中引导出来,让它从潜意识进入意识,因为对方早已是这些知识的占有者。

20090402
没有任何行动,其本身也可视为是一种行动。在某些情况下,没有行动,或许是最好的选择,而在其他情况下,不采取行动,却有可能造成灾难性的后果。
所谓心灵的力量,不但是要意识到各种可能的情况,随着认知范围的扩大,我们还要具备当机立断地做出决定的能力。
一个人知道的越多,就越是难以采取行动。
懒惰就像魔鬼一般狡诈,使得当事者不仅擅长伪装和欺骗,还会想方设法让懒惰变得合情合理。

“少有人走的路--通向心智成熟的旅程”,今天顺利结束,翻看了一下阅读记录,除去感冒的两天,才花了8天时间,这看书的速度越来越快了,这是一种好的现象的后面,其实也反映了最近一段时间对科研上面的专注不够啊,关键是最近科研上遇到了一个问题,所以老是逃避,这或许就是书中说的懒惰吧,所谓的熵的原罪其作用了;其实看书,包括看科研的书,而喜欢看论文,写程序,都是懒惰的一种体现;因为看书比看论文简单,因为经典的书往往浅显易懂,写程序更是要自己创造了(当然不是指那种照着随便改改写写的那种程序)。
每个人都有熵的本性使其堕落,当又要上进的力量是其成长,这两者是每个人的两个方面,或许说每个人中有两个人,一个是善良的人,一个是邪恶的人(每个人心中都有两个我,一个是病态的自我,一个是健康的自我),人的一生或许是在这两者之间徘徊前进,所以一般来说正义总是战士邪恶的,所以人类也是进化的。其实我们不必为懒惰为悲观,这是自然规律,熵都是趋于最大化的,趋于分散,而不是有规律的组织;所以按照这个规律,人类是不可能进化的,应该是处于最低等的原始单细胞;但是人类依靠其一股神秘的力量来抗拒这熵的力量,那么这力量到底是来自于哪里呢?或许一个最好的回答,那就是上帝。
上帝在哪里?上帝在每个人的心里。其实并没有错,上帝就是每个人的潜意识,而每个人的潜意识并不属于你自己,而是属于整个人类。就像是一课参天大树的根基,而每个人只是其地面以上的一点枝叶而已。
当你因为某件事而烦躁,当你因为某些东西而苦恼的时候,或许正是让你心智成熟的机会。上帝给每个人相等的机会,只是有的人把握住了,有的人却经常失去。即使把握住了,也未必能禁得住心智成熟的锻炼中的痛苦。
不要为痛苦而胆怯,没有它的幸福是不甜的。


2008年3月28日星期五

有价值的paper

有价值的paper

从 笑对人生,傲立寰宇

CVPR 2008的审稿期刚刚结束了。今年,我对于所审的paper,采取了更加宽容的态度。

Vision依旧很热闹,但是,我感觉这个领域在喧嚣的背后似乎有点疲态了。年复一年,每年成百上千的paper仍旧是在那几个旧的舞台上唱着老调 子。比如object recognition,无数打着”novel framework”旗号的车子,仍旧挤在local feature extractor + classifier (SVM/AdaBoost …)的独木桥上,难道,这是唯一的方法么?

在没有看到有人开辟新的道路的时候,我更欣赏那些专注解决于一个具体的小问题,并且提出了有见地的方法的文章。对于那种表面华美的,而内里却仅仅是把A feature换成B feature,C model换成D model的,我一般评价很低。

这里的一位教授在谈到写paper的时候,提到了一种很多人都会犯的毛病。还是用object recognition的工作为例子,为了完成实验,你必须做大量的工作,把整个framework搭建起来,从data到feature到 classifier,要写很多很多的code,花很多很多的时间去debug。为了对得起这些付出,很多人想把这些努力都写到paper上去,因此形成 了很多并不新颖的工作每年都在投。而事实上,这些工作并不完全是没有新的东西,但是,那一点新的东西,在整个framework式的表述中被喧宾夺主了。

要写一篇有吸引力的文章,必须有取舍的决断。有些为了完成实验必须做的工作,你即使在上面付出了半年时间,但是如果缺乏真正的学术价值,在 paper中应该尽量简省,把大部分的篇幅着力于那些真正的有意义的地方(哪怕那个地方其实你只花了3个小时想出来)。评paper不是评劳模(当然有些 reviewer可能有这种倾向),不能把工作量的因素拿来布局paper的篇幅,不能把对某些工作“舍不得”的情绪带到paper的 presentation当中。

CVPR审稿落幕了,我们的reading group又开始了。这个学期,John决定让大家自己轮流选paper,lead每个星期的reading。他说,除非有充分的理由,不要选近五年的文 章。他上学期其实就是这样的风格,选的很多都是五六十年代的文章——信息论和统计学习的奠基者们那种seminal的经典著述。这些paper让我感慨前 辈们的工作是多么有生命力,今天无数的主流算法仍旧发源于40年前的某篇文章,而且事实上没有走远多少。科技日新月异,其核心学理的进化则缓慢得多,艰难 得多。

在paper里面通过比较几个近期工作来claim自己的东西是新的很容易,但是,要让一个工作放在这个学科的整个发展历史中去考量却依然有价值, 则是非常艰难。这个学期,我开始要参加Alan的meeting, 他是MIT另外一个大实验室LIDS的director。有一次和Alan meeting的时候,大家提到一些最新发表的算法,他说,这些东西has been done 40 years ago。他人很nice,但是一项工作要得到他的认同很难。那次,我在他面前present了40分钟我的新工作,很多的东西都被他认为是在数学领域已经 解决的(虽然vision里面没有出现这样的publication),不过庆幸的是,还是有一个point,被他指出I have never seen people working on this。后来两个星期,我在这个point上投入了很多时间去思考,发现这确实是一个很有价值的问题。

我在这里所接触的教授都很nice,平常对学生的工作也不干涉太多,但是对于一项工作的评价非常挑剔。John告诉我,要解决最困难的问题,容易解 决的问题让别人做去。这半年来,脱离了CVPR的指挥棒,在沿着自己的道路一点一点的缓慢前进着,但是走的很踏实。刚来的时候,对MIT的氛围有点不太习 惯,好像CVPR也好,NIPS也好,都没什么要紧的。现在才慢慢觉得,只有从conference的指挥棒中走出来,才能脱离浮躁,实实在在的进行有意 义的探索。

两个星期后,将轮到我挑选reading group上讨论的paper。这么长时间大家都讨论的是信息论和统计方面的文章,我说,我要变一下,找vision的paper,John答应了,不过 条件是paper必须是经得住考验的真正的好paper。我现在不知道哪篇能达到这个要求。

2008年2月18日星期一

Kernels, distances and strings

Kernels and distances are closely related. Given a kernel K(x,z), this induces an RKHS H such that K(x,z)=, where f is the mapping from the input space to H, and <.,.> is the dot product in H. Dot products induce norms in the obvious way: ||x||^2 = K(x,x). This, in turn, induces an obvious distance metric: d(x,z)^2 = K(x,x)+K(z,z)-2K(x,z).

On the other hand, we also know how to turn distance metrics into kernels. If (X,d) is a metric space (i.e., X is a space and d is a metric on X), then K(x,z)=exp(-d(x,z)) is positive semi-definite when x,z are drawn from X.

Now, there are three questions that arise. The first is: in the distance->kernel setting, what does the RKHS "look like." I think the sense here is that it's basically the same as the RKHS induced by the Gaussian kernel: it's an infinite-dimensional feature space where different dimensions look like distances to some point. In practice, this "some point" is going to be one of the training points.

The second question follows the fact that we can iterate this construction. One way to ask this is: if we use a kernel to induce a distance, then use this distance to introduce a new kernel, what does this new kernel look like. Or, alternatively, if we have a distance, kernelize it to induce a new distance, then what does this distance look like. I don't have a good intuitive sense for the answer to any of these. Obviously it's straightforward to write out what these things look like, but that doesn't directly give me a sense of what's going on.

The final question is a bit "off topic" from the above two, but still relevant. There's been a lot of work in kernels for discrete data, like strings. The most popular string subsequence kernel is based on counting how many subsequences match between two strings, down-weighted by the length of the subsequences. It's well known, and recognized in string kernel papers, that a kernel of the form K(x,z) = 1-ned(x,z), where ned is a normalized string edit distance is not a valid kernel. However, from what we know about distance metrics, K(x,z) = exp(-sed(x,z)) should be a valid kernel. Yet I'm not aware of this being used anywhere. Is there any reason why? The reason I ask is because it's a much more intuitive than the subsequence kernel, which I only have a vague sense about.

9 comments:

Fernando Pereira said...

Quick comment on your 3rd point: K(x,z) is proportional to the probability x -> z in a probabilistic mutation model with the log-odds of mutations given by the edit costs. Hum...

Suresh said...

don't you mean exp(-d^2), rather than exp(-d) ?

also, what about the Haussler convolution kernel for strings ?

Kathy said...
This post has been removed by the author.
hal said...

suresh: right, d^2.

fernando: so let's say we use edit distance to induce a kernel. based on your comment, we can think of the kernel value to be the probability of an automata mapping one string to the other. the kernel-induced distance then looks like some distance between (probabilistic) automata that do the mapping. that actually sounds kind of interesting :). i have no idea what happens if you iterate again, though, and create a new kernel based on this.

Bob Carpenter said...

For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance. You get good character n-gram subsequence relations between "Smith, John" and "John Smith" even though they're miles apart in terms of character-level edit distances.

There are richer probabilistic edit distances like the ones introduced by Brill and Moore for spelling and by McCallum, Bellare and Pereira for word skipping and other general edits. These don't, in general, don't have negative logs that (when offset from match cost) form a proper metric like Levenshtein distance.

I don't know much about kernels, but if K(x,y) = exp(-d(x,y)**2) always produces a kernel if d is a proper metric, then the question arises of when a probabilistic string transducer defining p(s1|s2) defines a metric. I think that reduces to when:

d(s1,s2) = - log p(s1|s2) + log p(s2|s2)

forms a metric (the second term is so that d(s,s) = 0.0).

Plain Levenshtein distance with uniform edit costs defines a distance metric, but needs some fiddling to turn into a probability distribution (sum of all operations, including matching, must have probabilty 1.0).

hal said...

bob:

"For applications in which token reorderings are likely, basic subsequence comparison works better than simple edit distance." -- is this true or is it just plausible? I.e., has this effect actually been verified? I definitely find it plausible, but are there cases where it actually works out that way? What about when you're talking about words instead of just characters?

There are also a ton of edit distances that William Cohen has proposed and even more that he's compared. If they're actually metrics, then these could also be easily kernelized.

Bob Carpenter said...

Yes, if you mean do character n-grams work better than edit distance for matching.

Last year, we worked on a database linkage and deduplication problem for film names and actor names, and indeed found character n-grams with TF/IDF weighting a reasonable comparison metric. It put almost all the string matching true positives above an easily identified threshold, with only a few residuals where you had things like names transliterated from the original language versus translated.

We've also used this technique for entity transliteration detection, as in finding variants of "Al Jazeera". These probably would've worked OK with edit distance, too.

Substring character n-grams neatly deal with issues such as diacritics (only a small penalty for mismatch), minor case variation (e.g. "University Of Michigan" vs. "Univ. of Michigan") for varying spellings of titles (e.g. "Star Wars IV" vs. "Star Wars Four"), and for various token orders (e.g. "La Traviata" vs. "Traviata, La").

I've also used them for word-sense disambiguation in our tutorial, using both a TF/IDF form of classification and a k-nearest neighbors built on character n-gram dimensions. Again, you get significant robustness boosts over whole word matchers.

Note that we extract character n-grams across word boundaries, so you get some higher-order token-like effects for free. The bag of words assumption is particularly bad for text classifiers.

Character n-grams also work very well for general robust search over text. I'd like to see them compared to character n-gram language models for search. They're actually the norm for languages like Chinese that are impossible to tokenize reliably (i.e. state of the art is 97-98%). And they're also common for transcribed speech at a phonemic or syllabic lattice level.

There'd obviously be rule-based ways to handle all the things mentioned above, as well as variations due to pronunciation, whole word re-orderings, deletions (e.g. the affine edit distances used for genomic/proteomic matching).

I like the idea behind Cohen et al.'s soft TF/IDF:

http://www.cs.cmu.edu/~wcohen/postscript/kdd-2003-match-ws.pdf

But I can't understand either where the IDF is computed or whether the resulting "distance" is even symmetric.

The Jaro-Winkler string comparison is a custom model designed by Jaro and modified by Winkler for matching individual first or last names.

Huzefa said...

String Kernels are highly popular for protein sequence classification problems (as an example). Here are some references below. The second is some of my doctoral work involving use of string kernels with a biological similarity. Using such a biological measure -makes the kernel not positive semi-definite. We work around this using an eigen value transformation.

Christina S. Leslie , Eleazar Eskin , Adiel Cohen , Jason Weston , and William Stafford Noble - Mismatch string kernels for discriminative protein classification Bioinformatics 20: 467-476.

Huzefa Rangwala, George Karypis. Profile-based Direct Kernels for Remote Homology Detection and Fold Recognition in BIOINFORMATICS, 21(23):4239-4247 (2005)

Andre Martins said...

It is not true that "if (X,d) is a metric space then K(x,z)=exp(-t * d^2(x,z)) is positive definite". d must be a "Hilbertian distance", that is, a distance arising from a inner product in a RKHS; not any metric (under the axioms of nonnegativity, symmetry and triangle inequality) is allowed. In particular the string edit distance is NOT Hilbertian, therefore, K(x,z)=exp(-t * sed^2(x,z)) is not pd. See for example:

Corinna Cortes, Patrick Haffner and Mehryar Mohri, "Positive Definite Rational Kernels", Proceedings of The 16th Annual Conference on Computational Learning Theory (COLT 2003)

from http://nlpers.blogspot.com/2008/02/kernels-distances-and-strings.html