I'm an undergraduate student at Peking University. I worked as an intern at Knowledge Computing Group, Microsoft Research Asia (MSRA) from 2017.11 to 2018.6, where I was supervised by Chin-Yew Lin. I went to Johns Hopkins University as a visiting researcher from 2018.7 to 2018.10, where I was supervised by Jason Eisner. Currently, I'm remotely collaborating with Hongyuan Mei and Jason Eisner.
This toolkit is used in my EMNLP 2018 paper (Learning Latent Semantic Annotations for Grounding Natural Language to Structured Data). In this paper, we intend to assign tags to phrases, or text spans with various lengths, and we adopt the semi-Markov models as our alignment model. Our model is trained in an unsupervised fashion with the forward-backward algorithm since there are no available labels.
However, due to the notorious garbage collection issue, some infrequent tags tend to be aligned to meaningless words (such as "the", "and" or punctuations), instead of the "NULL" tag. Thus we adopt the posterior regularization approach, which could add soft statistical constraints into the Expectation step of the EM algorithm. We add a constraint to encourage our model to assign more "NULL" tags, which is very effective.
We also adopt the NULL-skip trick that is commonly used word alignment model in the statistical machine translation. That is, when calculating the transition probability, all the "NULL" tags will be omitted for they could intercept the Markov chain.
Our implementation is very efficient with off-the-shelf matrix computation package. We provide a very flexible interface for users to customize their models.[Click to hide details]
Training RNNs with extremely long sequences (e.g., T > 10,000) is infeasible due to the limit of memory. Normally our algorithm will memorize every intermediate state for backpropagation. However, our algorithm doesn't necessarily need to memorize all nodes. It could store only some key nodes, and reconstruct other intermediate states when they are needed.
I design a binary tree to maintain those stored key nodes recursively. (refer to the animation if the repo for details of this algorithm) This could reduce the space complexity from O(n) to O(log n), but increase the time complexity from O(n) to O(n log n) -- thus it is called sublinear backpropagation.
Interestingly, in practice, our algorithm is faster than the original algorithm. This is counterfactual but understandable. When the sequence is extremely, although our memory may be large enough to store all the intermediate states, it still may cost a huge amount of time to acquire a large memory space and do I/O operations on it.
This toolkit is released with a very flexible interface. Any customized RNNs are acceptable. Moreover, stacked RNNs and bidirectional RNNs are also acceptable. However, stacked bidirectional RNNs are not feasible due to theoretical limitations.[Click to hide details]