Chinese Lipreading with RNN

(This is a brief summary of my undergraduate thesis project, which has won an university-level excellence award.)

Feature of my thesis

System architecture

system architecture The architecture of the lipreading system. The corresponding English translations are: 1. frames 2. face detection 3. fitting contour model 4. segmentation 5. vocal segment 7. Chinese words and sentences

The frames in the video are sent into face detection and face alignment module. Now the original video has been converted into lip feature sequences, which will be subsequently segmented into two types of segments, namely vocal segments and silent segments. In vocal segments, the speaker in the original video is speaking something; while in silent segments, the speaker in the original video stays silent. The vocal segments are sent into the RNN to be categorized, which outputs the corresponding Chinese words and sentences.

Corpus

Due to the fact that public Chinese corpus materials are pretty rare, the corpus is recorded by the author himself. There is only one speaker, which means all the experiments are speaker-dependent. The corpus contains three 2-character words, three 3-character words and three 4-character words. In the training set, each word is repeated at least 10 times. In the validation set, each word is repeated exactly 2 times.

Overall Performance

If lipreading is accurate if and if only both segmentation result and LSTM output are correct, the system achieves 84.85% accuracy on training set, and 91.67% accuracy on validation set.

If the accuracy is calculated using the angle between word vectors, the angle for training set is 10.07◦, and the angle for validation set is 16.10◦.

Conclusion