Combining Common Sense Rules and Machine Learning to Understand Object Manipulation
Automatic situation understanding in videos has improved remarkably in recent years. However, state-of-the-art methods still have considerable shortcomings: they usually require training data for each object class present and may have high false positive or false negative rates, making them impractical for general applications. We study a case that has a limited goal in a narrow context and argue about the complexity of the general problem. We suggest to solve this problem by including common sense rules and by exploiting various state-of-the art deep neural networks (DNNs) as the detectors of the conditions of those rules.
We want to deal with the manipulation of unknown objects at a remote table. We have two action types to be detected: `picking up an object from the table' and `putting an object onto the table' and due to remote monitoring, we consider monocular observation. We quantitatively evaluate the performance of the system on manually annotated video segments, present precision and recall scores. We also discuss issues on machine reasoning. We conclude that the proposed neural-symbolic approach a) diminishes the required size of training data and b) enables new applications where labeled data are difficult or expensive to get.
Abdulla, Waleed. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow, 2017.
Ahad, Md Atiqur Rahman. Motion history images for action recognition and understanding. Springer Science & Business Media, 2012.
Alayrac, Jean-Baptiste, Sivic, Josef, Laptev, Ivan, and Lacoste-Julien, Simon. Joint discovery of object states and manipulation actions. arXiv preprint arXiv:1702.02738, 2, 2017.
Bellman, Richard E. Adaptive control processes: a guided tour, volume 2045. Princeton University Press, 2015.
Bishop, Christopher. Pattern Recognition and Machine Learning. Springer-Verlag New York, 2006.
Dai, Jifeng, Li, Yi, He, Kaiming, and Sun, Jian. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379--387, 2016.
Du, Pan, Kibbe, Warren A, and Lin, Simon M. Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching. Bioinformatics, 22(17):2059--2065, 2006. DOI: 10.1093/bioinformatics/btl355.
Duan, Kun, Parikh, Devi, Crandall, David, and Grauman, Kristen. Discovering localized attributes for fine-grained recognition. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 3474--3481. IEEE, 2012. DOI: 10.1109/CVPR.2012.6248089.
Everingham, M, Van Gool, L, Williams, C, Winn, J, and Zisserman, A. The pascal action classification taster competition. International Journal of Computer Vision, 88:303--338, 2011.
Godard, Clément, Mac Aodha, Oisin, and Brostow, Gabriel J. Unsupervised monocular depth estimation with left-right consistency. In CVPR, volume 2, page 7, 2017.
He, Kaiming, Gkioxari, Georgia, Dollár, Piotr, and Girshick, Ross. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980--2988. IEEE, 2017. DOI: 10.1109/ICCV.2017.322.
Ilg, Eddy, Mayer, Nikolaus, Saikia, Tonmoy, Keuper, Margret, Dosovitskiy, Alexey, and Brox, Thomas. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), volume 2, page 6, 2017. DOI: 10.1109/CVPR.2017.179.
Ilsever, Murat and Ünsalan, Cem. Pixel-based change detection methods. In Two-Dimensional Change Detection Methods, pages 7--21. Springer, 2012.
Lőrincz, A, Csákvári, Máté, Fóthi, Áron, Milacski, Z Ádám, Sárkány, András, and Tősér, Z. Towards reasoning based representations: Deep consistence seeking machine. Cognitive Systems Research, 47:92--108, 2018. DOI: 10.1016/j.cogsys.2017.08.004.
Mehta, Dushyant, Sotnychenko, Oleksandr, Mueller, Franziska, Xu, Weipeng, Sridhar, Srinath, Pons-Moll, Gerard, and Theobalt, Christian. Single-shot multi-person 3D body pose estimation from monocular RGB input. arXiv preprint arXiv:1712.03453, 2017.
Mittal, Arpit, Zisserman, Andrew, and Torr, Philip HS. Hand detection using multiple proposals. In BMVC, pages 1--11. Citeseer, 2011.
Ren, Shaoqing, He, Kaiming, Girshick, Ross, and Sun, Jian. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91--99, 2015.
Rohrbach, Marcus, Rohrbach, Anna, Regneri, Michaela, Amin, Sikandar, Andriluka, Mykhaylo, Pinkal, Manfred, and Schiele, Bernt. Recognizing fine-grained and composite activities using hand-centric features and script data. International Journal of Computer Vision, pages 1-28, 2015. DOI: 10.1007/s11263-015-0851-8.
Simonyan, Karen and Zisserman, Andrew. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568--576, 2014.
Spurr, Adrian, Song, Jie, Park, Seonwook, and Hilliges, Otmar. Cross-modal deep variational hand pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 89--98, 2018. DOI: 10.1109/cvpr.2018.00017.
Teo, Ching L, Yang, Yezhou, Daumé, Hal, Fermüller, Cornelia, and Aloimonos, Yiannis. Towards a watson that sees: Language-guided action recognition for robots. In Robotics and Automation (ICRA), 2012 IEEE International Conference on, pages 374--381. IEEE, 2012. DOI: 10.1109/ICRA.2012.6224589.
Véges, Márton, Varga, Viktor, and Lőrincz, András. 3d human pose estimation with siamese equivariant embedding. arXiv preprint arXiv:1809.07217, 2018.
Wei, Shih-En, Ramakrishna, Varun, Kanade, Takeo, and Sheikh, Yaser. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4724--4732, 2016. DOI: 10.1109/CVPR.2016.511.