Tutorials

The following tutorials have been accepted for presentation at ACM Multimedia 2019:

Monday 21 October 2019
Morning:
Learning from 3D (Point Cloud) Data
AutoML and Meta-learning for Multimedia
Afternoon:
Multimedia Forensics
A Journey towards Fully Immersive Media Access

Friday 25 October 2019
Morning:

Principle-to-program: Neural fashion recommendation with multi-modal input
Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data
Afternoon:

Medical Multimedia Systems and Applications
Multimodal Data Collection for Social Interaction Analysis In-the-Wild


Learning from 3D (Point Cloud) Data

Morning, Monday 21 October 2019

Winston H. Hsu (National Taiwan University, Taiwan)

Learning on (3D) point clouds is vital for a broad range of emerging applications such as autonomous driving, robot perception, VR/AR, gaming, and security. Such needs have increased recently due to the prevalence of 3D sensors such as LiDAR, 3D camera, and RGB-D. Point clouds consist of thousands to millions of points and are complementary to the traditional 2D cameras that we have been working on for years in the vision (or multimedia) community. 3D learning algorithms on point cloud data are new, and exciting, for numerous core problems such as 3D classification, detection, semantic segmentation, and face recognition. The tutorial covers the requirements of point cloud data, the background of capturing the data, 3D representations, emerging applications, core problems, state-of-the art learning algorithms (e.g., voxel-based, point-based, etc.), and future research opportunities. We will also showcase our leading work in several 3D benchmarks such as ScanNet, KITTI, etc.

Prof. Winston Hsu is an active researcher dedicated to large-scale image/video retrieval/mining, visual recognition, and machine intelligence. He is a Professor in the Department of Computer Science and Information Engineering, National Taiwan University. He and his team have been recognized with technical awards in multimedia and computer vision research communities including IBM Research Pat Goldberg Memorial Best Paper Award (2018), Best Brave New Idea Paper Award in ACM Multimedia 2017, First Place for IARPA Disguised Faces in the Wild Competition (CVPR 2018), First Prize in ACM Multimedia Grand Challenge 2011, ACM Multimedia 2013/2014 Grand Challenge Multimodal Award, etc. Prof. Hsu is keen to realizing advanced researches towards business deliverables via academia-industry collaborations and co-founding startups. He was a Visiting Scientist at Microsoft Research Redmond (2014) and had his 1-year sabbatical leave (2016-2017) at IBM TJ Watson Research Center. He served as the Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) and IEEE Transactions on Multimedia, two premier journals, and was on the Editorial Board for IEEE Multimedia Magazine (2010 – 2017).


AutoML and Meta-learning for Multimedia

Morning, Monday 21 October 2019

Wenwu Zhu (Tsinghua University, China)

Xin Wang (Tsinghua University, China)

Wenpeng Zhang (Tsinghua University, China)

This tutorial disseminates and promotes the recent research achievements on AutoML and meta-learning as well as their applications on multimedia, which is an exciting and fast-growing research direction in the general field of machine learning and multimedia. We will present novel, high-quality research findings, as well as innovative solutions to the challenging problems in AutoML and meta-learning on multimedia. The tutorial contains five sections: 1. The research and industrial motivation, 2. Hyperparameter optimization for multimedia applications, 3. Neural architecture search and its applications in multimedia, 4. Meta-learning for multimedia, 5. Discussions and future directions.

Biographies

Prof. Wenwu Zhu is currently a Professor and the Vice Chair of the Department of Computer Science and Technology at Tsinghua University. Prior to his current post, he was a Senior Researcher and Research Manager at Microsoft Research Asia. He was the Chief Scientist and Director at Intel Research China from 2004 to 2008. He worked at Bell Labs New Jersey as Member of Technical Staff during 1996-1999. Wenwu Zhu is an AAAS Fellow, IEEE Fellow, SPIE Fellow, and a member of The Academy of Europe (Academia Europaea). He has published over 300 referred papers in the areas of multimedia computing, communications and networking, and big data. He is inventor or co-inventor of over 50 patents. He received seven Best Paper Awards, including ACM Multimedia 2012 and IEEE Transactions on Circuits and Systems for Video Technology in 2001. He served in the steering committee for IEEE Transactions on Multimedia (2015-2016) and IEEE Transactions on Mobile Computing (2007-2010), respectively. He served as TPC Co-chair for ACM Multimedia 2014 and IEEE ISCAS 2013, respectively. He serves as General Co-Chair for ACM Multimedia 2018 and ACM CIKM 2019, respectively.

Dr. Xin Wang is currently a Postdoctoral researcher at the Department of Computer Science and Technology, Tsinghua University. He got both of his Ph.D. and B.E. degrees in Computer Science and Technology from Zhejiang University, China. He also holds a Ph.D. degree in Computing Science from Simon Fraser University, Canada. His research interests include cross-modal multimedia intelligence and recommendation in social media. He has published several high-quality research papers in top conferences including ICML, KDD, WWW, MM, SIGIR, AAAI, IJCAI, and CIKM.

Dr. Wenpeng Zhang obtained his Ph.D. degree in machine learning at Tsinghua University, China in 2018. He has published several papers in top tier conferences and journals, including ICML, WWW, KDD, and TKDE. Now, his research interests lie in online learning, Meta-learning and AutoML. He led the team Meta_Learners that won the second place in the NeurlPS 2018 AutoML Competition (https://arxiv.org/abs/1903.05263).


Multimedia Forensics

Afternoon, Monday 21 October 2019

Luisa Verdoliva (University Federico II of Naples, Italy)

Paolo Bestagini (Politecnico di Milano, Italy)

With the availability of powerful and easy-to-use media editing tools, falsifying images and videos has become widespread in the last few years. Coupled with ubiquitous social networks, this allows for the viral dissemination of fake news. This raises huge concerns on multimedia security. This scenario became even worse with the advent of deep learning. New, sophisticated methods have been proposed to accomplish manipulations that were previously unthinkable (e.g., deepfake). This tutorial will present the most reliable methods for detection of manipulated images and for source identification. These are important tools nowadays to carry out fact checking and authorship verification. State-of-the-art solutions exploiting either model-based or data-driven techniques will be presented. The most innovative approaches based on deep learning will be described, considering both supervised and unsupervised approaches. Results will be presented on challenging datasets and realistic scenarios.

Luisa Verdoliva is Associate Professor at University Federico II of Naples (Italy). Her scientific interests are in the field of image and video processing, with main contributions in the area of multimedia forensics. She is the Principal Investigator for the Research Unit of University Federico II of Naples in the DISPARITY (Digital, Semantic and Physical Analysis of Media Integrity) project funded by DARPA under the MEDIFOR program. She is Associate Editor for IEEE Transactions on Information Forensics and Security. This year she is General co-Chair of the ACM Workshop on Information Hiding and Multimedia Security and Technical Program Chair of the IEEE Workshop in Information Forensics and Security. She led her research group in several international contests, including the recent 2018 IEEE Signal Processing Cup on camera model identification (first prize) and the 2013 IEEE Image Forensics Challenge (first prize). This year she received a Google Faculty Research Award.

Paolo Bestagini is Assistant Professor at the Image and Sound Processing Group, Politecnico di Milano, where he teaches Multimedia Signal Processing and Audio Signals. His research interests focus on multimedia forensics, acoustic signal processing for microphone arrays, and machine learning for geophysical data processing. He is involved in multiple international research projects. He is Co-Principal Investigator for Politecnico di Milano in the projects “MEDIFOR – Media Forensics Integrity Analytics”, “Forensic Analysis of Scientific Images”, and “Forensic Analysis of Overhead Images” funded by DARPA under the MEDIFOR program (2016-2020). He’s been Scientific Investigator for the projects “SCENIC – Self-Configuring Environmental-aware Intelligent Acoustic Sensing” and “REWIND – Revers Engineering of Audio-Visual Content Data” funded by the European Commission under FP7 framework (2010-2014). He is an elected member of the IEEE Information Forensics and Security Technical Committee since 2017. He is co-organizer of the IEEE Signal Processing Cup 2018 on image source attribution.


A Journey towards Fully Immersive Media Access

Afternoon, Monday 21 October 2019

Christian Timmerer (Alpen-Adria-Universität Klagenfurt and Bitmovin, Inc., Austria)

Ali C. Begen (Özyeğin University and Networked Media, Turkey)

Universal media access as proposed almost two decades ago is now reality. We can generate, distribute, share, and consume any media content, anywhere, anytime, and with/on any device. A technical breakthrough was the adaptive streaming over HTTP resulting in the standardization of MPEG-DASH, which is now successfully deployed in a plethora of environments. The next big thing in adaptive media streaming is virtual reality applications and, specifically, omnidirectional (360°) media streaming, which is currently built on top of the existing adaptive streaming ecosystems. This tutorial provides a detailed overview of adaptive streaming of both traditional and omnidirectional media. The tutorial focuses on the basic principles and paradigms for adaptive streaming as well as on already deployed content generation, distribution, and consumption workflows. Additionally, the tutorial provides insights into standards and emerging technologies in the adaptive streaming space. Finally, the tutorial includes the latest approaches for immersive media streaming enabling 6DoF DASH through Point Cloud Compression (PCC) and concludes with open research issues and industry efforts in this domain. More information available at: https://multimediacommunication.blogspot.com/2019/07/acmmm19-tutorial-journey-towards-fully.html

Christian Timmerer received his M.Sc. in January 2003 and his Ph.D. in June 2006 both from the Alpen-Adria-Universität (AAU) Klagenfurt. He is currently an Associate Professor at the Institute of Information Technology (ITEC) where he is the head of CD laboratory ATHENA. His research interests include immersive multimedia communication, streaming, adaptation, and Quality of Experience. He was the general chair of QoMEX’13, MMSys’16, and PV’18 and has participated in EC-funded projects (e.g., SocialSensor, QUALINET, ICoSOLE). He also participated in ISO/MPEG work (e.g., MPEG- 21, MPEG-M, MPEG-V, and MPEG-DASH where he also served as standard editor) In 2013, he cofounded Bitmovin (http://www.bitmovin.com/) where he holds the position of the Chief Innovation Officer (CIO) – Head of Research and Standardization. He is a senior member of IEEE and member of ACM. Further information available at http://blog.timmerer.com.

Ali C. Begen is the co-founder of Networked Media, a technology company that offers consulting services in the IP video space. Between 2007 and 2015, he was with the Video and Content Platforms Research and Advanced Development Group at Cisco, where he designed and developed algorithms, protocols, products and solutions in the service provider and enterprise video domains. Currently, he is also affiliated with Ozyegin University, where he is teaching and advising students in the computer science department. Ali has a PhD in electrical and computer engineering from Georgia Tech. To date, he received a number of academic and industry awards, and was granted 30+ US patents. He is a senior member of both the IEEE and ACM. In 2016, he was elected distinguished lecturer by the IEEE Communications Society, and in 2018, he was re-elected for another two-year term. More details are at http://ali.begen.net.


Principle-to-program: Neural fashion recommendation with multi-modal input

Morning, Friday 25 October 2019

Muthusamy Chelliah (Flipkart, India)

Soma Biswas (Indian Institute of Science, India)

Lucky Dhakad (Flipkart, India)

Outfit recommendation automatically pairs user-specified reference clothing with the most suitable complement from online shops. Aesthetic combination is a criterion for matching such fashion items. Fashion style tells a lot about one’s personality and emerges from how people assemble clothing outfit from seemingly disjoint items into a cohesive concept. Experts share fashion tips showcasing their compositions to public where each item has both an image and textual metadata though. Also, retrieving products from online shopping catalogs in response to such real-world image query is essential for outfit recommendation. We cover style and compatibility in fashion recommendation – building on metric and deep learning approaches introduced elsewhere. We present in addition visual signals more broadly (e.g., cross-scenario retrieval, attribute classification) and combine text input (e.g., interpretable embedding) as well. Each section concludes walking through programs executed on Jupyter workstation using real-world datasets.

We would like to offer opportunity to execute code fragments from the original authors of a few of the papers covered in our session. If you are interested in taking advantage of this opportunity, please bring your laptop. You can install Jupyter Notebook in advance as described here: https://www.digitalocean.com/community/tutorials/how-to-install-run-connect-to-jupyter-notebook-on-remote-server

Dr. Muthusamy Chelliah heads external research collaboration for Flipkart – online shopping market pioneer from India with a recent, majority stake investment from Walmart. Chelliah holds a PhD degree in Computer Science from Georgia Tech., Atlanta with a focus in distributed systems. He has over 20 years of experience at HP and Yahoo in corporate research and product engineering. He now helps engage academia on solving problems relevant to his industry (i.e., e-tailing) leveraging research in ML, IR, NLP and data mining. Similar tutorials in other recent forums like WebConf ‘19, ECIR ‘19 and RecSys ‘17 indicate his passion in this regard.

Dr. Soma Biswas is an Assistant Professor in the  Department of Electrical Engineering at Indian Institute of Science, Bangalore, India. Her research interests are in Computer Vision, Pattern Recognition, Image Processing and related areas. She has published extensively in conferences (e.g., CVPR) and journals relevant to computer vision and multimedia. She is a senior member of IEEE and has been awarded the “IEEE Late Shri Pralhad P Chhabria Award for Best Female Engineer” in 2018.

Lucky Dhakad holds an MS in machine learning from Indian Institute of  Science, Bangalore. Presently, she works as a Data Scientist in Flipkart in the field of recommendation and ranking. She has published key results from her masters’ thesis at CIKM ‘17.


Reproducibility and Experimental Design for Machine Learning on Audio and Multimedia Data

Morning, Friday 25 October 2019

Gerald Friedland (Lawrence Livermore Lab and University of California, Berkeley, USA)

This tutorial provides an actionable perspective on the experimental design for machine learning experiments on multimedia data. The tutorial consists of lectures and hands-on exercises. The lectures provide a theoretical introduction to machine learning design and signal processing. The thought framework presented is derived from the traditional experimental sciences which require published results to be self-contained with regards to reproducibility. In the practical exercises, we will work on calculating and measuring quantities like capacity or generalization ratio for different machine learners and data sets and discuss how these quantities relate to reproducible experimental design.

Please bring paper, pencil, and a laptop.

The tutorial is based on a UC Berkeley graduate class, more information is available here: http://www.icsi.berkeley.edu/~fractor/spring2019/

I recommend printing and bringing this cheat sheet: http://www.icsi.berkeley.edu/~fractor/spring2019/ewExternalFiles/ML_Cheat_Sheet_0.97.pdf

Dr. Gerald Friedland is at Lawrence Livermore National Lab and is also teaching as an adjunct professor at the Electrical Engineering and Computer Sciences department at the University of California Berkeley. His work includes large-scale video retrieval, geo-location and also privacy. Recently he added a focus on experimental design of machine learning experiments. He is the lead figure behind the Multimedia Commons initiative, a collection of 100M images and 1M videos for research and has published more than 200 peer-reviewed articles in conferences, journals, and books. He also co-authored a new textbook on multimedia computing with Cambridge University Press. Dr. Friedland received his doctorate (summa cum laude) and master’s degree in computer science from Freie Universitaet Berlin, Germany, in 2002 and 2006, respectively.


Medical Multimedia Systems and Applications

Afternoon, Friday 25 October 2019

Michael Riegler (Simula Metropolitan Center for Digital Engineering)

Pål Halvorsen (Simula Metropolitan Center for Digital Engineering)

Klaus Schoeffmann (Alpen-Adria-Universität Klagenfurt)

Over the last decade we could observe an increasing need from clinicians for help from the multimedia community. The reason for this is the fact that more and more videos and images are stored in the hospital information system for post-operative usage. Although the storage itself is quite challenging due to the massive amount of data, even more serious problems arise when surgery videos should be used for purposes such as teaching and training, retrospective analysis, visual analytics, as well as diagnostic decision support. In this tutorial we will give a broad overview of the medical multimedia field, inclusive of its requirements and characteristics, discuss existing work of medical video/image analysis and outline open challenges and opportunities. We will cover several medical fields, such as laparoscopy gynecology, cholecystectomy, colonoscopy, and ophthalmology.

Dr. Michael Riegler is a senior researcher at SimulaMet and an associate Professor at the Kristiania University College. He received his Master’s degree from Klagenfurt University with distinction and finished his PhD at the University of Oslo in two and a half years. His PhD thesis topic was efficient processing of medical multimedia workloads. His research interests are medical multimedia data analysis and understanding, image processing, image retrieval, parallel processing, crowdsourcing, social computing, and user intent. He is involved in several initiatives like the MediaEval Benchmarking Initiative for Multimedia Evaluation, which runs the Medico task. Furthermore, he is part of an expert group for the Norwegian Council of Technology on Machine Learning for Healthcare.

Dr. Pål Halvorsen is a chief research scientist at SimulaMet, a professor at OsloMet University, an adjunct professor at University of Oslo, Norway, and the CEO of ForzaSys AS. He received his doctoral degree (Dr.Scient.) in 2001. His research focuses mainly on complete distributed multimedia systems including operating systems, processing, storage and retrieval, communication and distribution from a performance and efficiency point of view. He is a member of the IEEE and ACM. More information can be found at http://home.ifi.uio.no/paalh

Dr. Klaus Schoeffmann is an associate professor at the Institute of Information Technology (ITEC) at Klagenfurt University, Austria, where he received his habilitation (venia docendi) in Computer Science in 2015. He holds a PhD (with distinction) and a MSc. (with distinction) in Computer Science. His research focuses on video content analysis (in particular of medical/surgery videos), multimedia retrieval, interactive multimedia, and deep learning. He has co-authored more than 120 publications on various topics in multimedia, inclusive of more than 40 on different aspects of medical video analysis. He has co-organized several international conferences, workshops, and special sessions in the field of multimedia. Klaus Schoeffmann is a co-founder and co-organizer of the Video Browser Showdown (VBS) ? an international annual live evaluation competition of interactive video search systems, started in 2012.


Multimodal Data Collection for Social Interaction Analysis In-the-Wild

Afternoon, Friday 25 October 2019

Hayley Hung (Delft University of Technology)

Chirag Raman (Delft University of Technology)

Ekin Gedik (Delft University of Technology)

Stephanie Tan (Delft University of Technology)

Jose Vargas-Quiros (Delft University of Technology)

Abstract

The benefits of exploiting multi-modality in the analysis of human-human social behaviour has been demonstrated widely in the community. An important aspect of this problem is the collection of data-sets that provide a rich and realistic representation of how people actually socialize with each other in real life. These subtle coordination patterns are influenced by individual beliefs, goals, and desires related to what an individual stands to lose or gain in the activities they perform in their everyday life. These conditions cannot be easily replicated in a lab setting and require a radical re-thinking of both how and what to collect. This tutorial provides a guide on how to create such multi-modal multi-sensor data sets when considering the entire experimental design and data collection process. It will also include a debriefing for the ConfLab experiment run at the conference to encourage a community discussion on issues of privacy and data sharing and ethical practices.

Hayley Hung is an Associate Professor at Delft University of Technology, The Netherlands. She leads the Socially Perceptive Computing Lab. Her research interests are in social signal processing, multi-sensor processing, machine learning, and ubiquitous computing. Specifically her research focuses on devising novel pattern recognition and machine learning methods to automatically interpret group social and affective behavior during face-to-face human interactions. In 2015, she was awarded the prestigious Dutch Research Foundation (NWO) Career Talent Award for experienced researchers (Vidi). Her research contributions have also been recognized via an invited talk at the ACM Multimedia 2016 Rising Star Session. She was nominated for outstanding paper at ICMI 2011 and was named outstanding reviewer at ICME 2014.

Chirag Raman is a PhD Candidate in the Socially Perceptive Computing Lab at the Delft University of Technology. His research interests lie at the intersection of multimodal machine learning and social psychology, with an emphasis on anticipating human behavior. Prior to pursuing his PhD, Chirag worked as a Senior Research Engineer at Carnegie Mellon University within the Multicomp Lab. He also has several years of experience in engineering, computer vision, and experience design, gained during his years at Disney Research, ProductionPro, and Walt Disney Parks and Resorts. He completed his Bachelor of Engineering – Information Technology (2010) from the University of Mumbai, India, and his Master of Entertainment Technology (2013) from Carnegie Mellon University, USA.

Ekin Gedik received his bachelor’s (2010) and master’s (2013) degrees from Middle East Technical University, Turkey and his Ph.D. (2018) from Delft University of Technology, Netherlands. He is currently a postdoctoral researcher in the Socially Perceptive Computing Lab of Delft University of Technology. His research interests are social signal processing, wearable sensing, affective computing, computational behavioral science, human activity detection, domain adaptation and transfer learning.

Stephanie Tan is a PhD candidate in the Socially Perceptive Computing Lab at Delft University of Technology. She received her BS (2015) from California Institute of Technology and Msc (2017) from Imperial College London. Her research interests are multimodal machine learning with an emphasis on human pose estimation methods, and social scene analysis using computer vision and social signal processing.

Jose Vargas received his BS (2015) from University of Costa Rica and MSc (2017) from Utrecht University in The Netherlands. He is currently a PhD student at the Socially Perceptive Computing Lab in Delft University of Technology. His research interests are computer vision and signal processing of wearable sensor signals, applied to the recognition of short-term social actions, specially gesturing and laughter, from body motion alone. He is also interested in the role of these actions in different aspects of social interaction, especially turn-taking behavior, and phenomena like interactional synchrony and mimicry.