DIAS post-events discussions

SIGMOD/PODS 2017

Another year has gone by and early summer is the time for one of the most exciting data management conferences of the season - SIGMOD/PODS.

If for some reason you wandered through the web and landed here by chance:

SIGMOD/PODS is the leading international ACM conference considering the area of Database Management and Data engineering.

We are lucky to have had the opportunity to attend most SIGMODs of the past 4 years (thank you Anastasia). And this year's SIGMOD actually ranks close to the top. (Matt’s No1 pick SIGMOD@Snowbird)

But let’s take it in a more structured way. This year's SIGMOD took place in the city center of Chicago, a wonderful (and windy!) classic American city (find all the procedural stuff here). Chicago was a last-minute decision (find here why), thus the organizers had limited options for venue. Eventually, they chose the Chicago Hilton which, in our opinion, was underwhelming.

However, (thankfully) this had no impact on the quality and innovation of the work presented. The work was interesting and well presented. The attendance was high and it was great to meet some of the most well-known scientists and fellow students. The conference had many super-interesting sessions and workshops along with several enlightening keynotes on future challenges of transaction processing systems and approximate query processing.

Amongst the multitude of papers however we recognized few patterns where current research is going towards. As always, the novelty hardware spawns interest in the community with as main driver the prospective advent of Non-Volatile memory as well as the increasing use of GP-GPUs and FPGAs. Anastasia Ailamaki and Andy Pavlo, during their respective keynotes, discussed the progress of traditional research on transaction processing. Andy concentrated on the need of academic research to turn to real-life for problems rather than make assumptions that do not necessarily hold and Anastasia concluded on the increasing popularity of Hybrid transaction and analytical workloads (HTAP) and how we should efficiently treat them. During the second keynote session Surajit Chaudhuri, Barzan Mozafari, and Tim Kraska, discussed the re-incarnation of the need for approximation due to the ever-increasing datasets and how we can achieve it without requiring DB users to get a Ph.D. in probability and statistics. Finally, machine learning is here to stay and there has been an increasing trend of using machine learning techniques to improve database operations. Being a proper DB conference SIGMOD had also its share of traditional DB research where some interesting contributions in transaction processing and query optimization shone-out.

Let’s take these trends one by one:

New Hardware:

When talking about new hardware, an increasingly influential venue is Damon (one of the workshops collocated with SIGMOD). We are always interested in it as it has very interesting publications (e.g., http://dl.acm.org/citation.cfm?id=1995446http://dl.acm.org/citation.cfm?id=2619229 ) . This year Damon started with a super-interesting keynote on high-performance computing and machine learning by Dr. Eng Lim Goh from HPE. Dr. Goh shared his experiences and vision on the opportunities and challenges on running high-scale machine learning tasks on HPC environments. The paper presentations covered wide-range of interesting topics on analyzing/using cutting-edge technologies for solving database problems.

There were six full paper presentations, and a flash poster session for the short papers. One of the presentations showed an analysis of the power consumption of the ever-growing server memory both for transactional and analytical workloads. The study is particularly interesting as it uses a custom-built power measurement apparatus directly measuring the memory power consumption, rather than using power measurement counters. There were several talks on GPUs varying from analyzing the complete micro-architectural behavior of GPUs, to dealing with the random data access problem on GPUs, and to a lightweight GPU decompression library. The study examining the random data access patterns on GPUs particularly focused on the TLB structures of GPUs, and propose a TLB-conscious algorithm for several database operators.

We find the study particularly interesting in that it deals with the random data access problem on GPUs. As OLTP workloads also include large number of random data access, the study can have impact on designing OLTP engines on GPUs. Another GPU study presented a lightweight decompression algorithm for GPU processing. We have seen a study on specialized hardware for analytical database operators, and a study on a high-level database system design targeted for heterogeneous computing environments. Having dark silicon down the road, energy-efficient specialized hardware is a popular topic in computer architecture research. We find it exciting to have papers on solving inter-disciplinary problems approaching the problem from different perspectives in our Damon workshop. Another study presented an analysis of NVRAM on a particular database system, Google’s LevelDB. As NVRAM is getting more and more popular, the study presents one of the earliest results on using NVRAM as a storage layer below DRAM.

Another study presented a PetriNet-based core allocation mechanism for alleviating NUMA affect for multi-core database operations. The study is interesting in that its proposed mechanism does not require modifying the implementation of the database operation, i.e., non-intrusive. Last but not least, we have seen an interesting use of vectorization to accelerate creating and querying a particular index type, column imprints. Damon ended with an impressive keynote on GPU architectures by Nikolay Sakharnykh from Nvidia. Nikolay presented the advancements in the GPU architectures over the recent years, and the developments on the evolving GPU programming models. Overall, we found this year’s Damon super-interesting with super-useful system analysis works as well as impressive hardware/software designs for solving database problems. 

Following a similar line of the research presented in Damon, we have seen several papers on new hardware and two tutorials on non-volatile memory (NVRAM) in this year’s SIGMOD. There were two studies using FPGAs to accelerate certain database operations such as partitioning and string matching. Similarly, there were two studies using GPUs to accelerate database operations. While one of these studies was on accelerating hybrid radix sort on GPUs, the other one was about a top-down templating methodology providing a high-level framework for CPU-GPU co-processing. Lastly, we have seen a study on optimizing database performance on Flash storage, particularly targeting the low write-performance of Flash storage. On the other hand, the first tutorial on NVRAM covered the design of the entire database system stack in the light of NVRAMs as the middle layer between the DRAM and hard-disk, whereas the second tutorial went into deep technical challenges in using NVRAM as a global memory, and presented design and implementation of NVRAM-based algorithms and data structures. Overall, we found the tutorials and the papers on new hardware very challenging and informative in this year’s SIGMOD.

Hybrid Transaction and analytic workloads (HTAP)

It is becoming apparent from a number of talks including a keynote (A.Ailamaki) and a very interesting tutorial by IBM (Fatma Ozcan/Pinar Tozun/ Yuanyuan Tian) that database workloads in a multiple research and industrial projects require the functionality of both transaction and analytic processing. We have to shed the bottleneck that is the ETL process and be able to execute queries always on fresh data efficiently.

HTAP transactions may contain a set of insert/update/delete statements along with complex OLAP queries which need to be executed on the same dataset.

A number of solutions for this problem has been created with different architectures, however, it seems that the main driver for the current trend is the advances in modern hardware such as multi-core processors, advances in memory technology and the advent of general purpose co-processors. As seen in the tutorial most existing systems combine multiple products in innovative ways to support the functionality and thus each has pros and cons. Being aware of novel research efforts which offer support for HTAP workloads inherently (Caldera), we look forward for the new systems that will come up.

Approximation

The ever-growing gap of computing power and the increasing data sizes lead the community on the search for alternative approaches to gain performance. During the 80's, Approximate Query Processing (AQP) was on the rise with systems such as AQUA, STRAT and eventually Online Aggregation. In 2017, we face the same problems in a slightly different context. The increasing sizes do not allow data analysis prior to query execution which is a prerequisite for these deprecated approaches thus increase the complexity of the problem. This has led the community to build on different approaches. Surajit discussed the need for online approximation in an efficient manner by introducing approximation operators which allow sampling efficiently over joins. Tim Kraska advocated for a more intrusive approach where the application is built to encapsulate approximation. The application in question is exploratory analysis where a user incrementally constructs queries thus giving enough time for the system to build samples. Barzan during his keynote shared a number of insights from his commercial effort, snappydata. Apart from the keynote there was a very interesting tutorial on sampling techniques by Ke Yi which gave a great overview of sampling and some mathematical background.

Machine Learning:

Another hot topic of this year’s SIGMOD was the integration of data management and machine learning systems. We have seen several papers and two tutorials on machine learning. The first tutorial covered both aspects of the intersection of data management and machine learning: using machine learning in data systems and using data management techniques in machine learning challenges; whereas the second tutorial was focusing on handling data management challenges in machine learning systems such as understanding, cleaning and validating the data. On the other hand, the papers covered a set of interesting topics. We have seen a study proposing a schema-independent relational learning algorithm that learns novel relations from the existing relations in the database. We have also seen a study proposing a declarative language, BUDS, for distributed machine learning algorithms. In the similar vein, another study was taking a step towards a declarative language for machine learning by proposing a cost-based gradient descent optimizer selecting the best gradient descent algorithm for a given machine learning task. Last but not least, we have also seen a study improving the performance of using kernel density estimation used for classifying the points into low and high density regions. Overall, we found these studies an exciting step towards integrating machine learning and database systems challenges.

Traditional DBs: (QO/Transactions)

Another exciting topic of this year’s SIGMOD was query optimization. The optimization works covered wide-range of topics. One line of research was exploring optimization opportunities for a particular query type and/or query operator. We have seen a study comparing scan and index probe operation on column stores by taking both concurrency and selectivity into account. Another study was optimizing for disjunctive queries on column stores by taking branch misprediction cost into account. We have also seen a study presenting a framework for optimizing iceberg queries, the queries filtering large amounts of data and returning only a subset of them passing a certain threshold. Last but not least, we have seen a study focusing on efficiently handling updates for real-time analytics systems. Another line of research was proposing general query optimization frameworks. We have seen a study converting the join ordering problem into a mixed integer linear program, a study proposing a combinators-based nested relational algebra improving query optimization and compilation, a study designing a reuse-aware query optimizer that takes advantage of already materialized internal data structures for upcoming queries, and a study proposing a novel online parameterized query optimization (PQO) technique satisfying the three major requirements of online PQO, i.e., bounded cost, low optimization overhead and small number of stored plan.

On the other hand, we have seen several studies focusing on the predictability aspect of the database systems performance. One such study focused on the major sources of the performance variance in the database system, and proposed a novel algorithm, variance-aware transaction scheduling, to mitigate the performance variance. Similarly, another study focused on auto-tuning database systems based on machine learning techniques. Last but not least, we have seen a comprehensive experimental comparison of bitmap and inverted list compression, providing many useful insights for designing and developing compression algorithms. Overall, we found these studies particularly interesting as the hardware and software landscape of database systems is getting increasingly heterogeneous, and performance predictability can be helpful on dealing with the heterogeneity, which inherently contributes to the performance variance.

Overall, we found this year’s SIGMOD very exciting and informative. We learned and discovered a lot. We also liked the venue, enjoyed the nice neighborhood and the food of Chicago.

If this article looks interesting ping us 😉

Utku & Matt

P.S.:

Our blog post sounded far too positive thus we have to add couple of negative comments:

1) During most of the talks (or at least the ones that interested me) either I was standing or there was a commotion for people to find a sit mid-talk. Maybe it is time to use the amazing wonders of computer science and predict the audience size at a given talk.

2) During this instance of SIGMOD I constantly had the question “why do we attend a conference talk”. Is it to understand the core ideas of the paper? Is it to motivate us to read the paper?

During most talks, I was left on a cliffhanger with a trillion un-answered questions. Should the talks have 5 minutes more to go to more depth? Or have longer Q&As?

Posted by Dimitra Tsaoussis Melissargos at 17:08
Comments (0)
Schloss Dagstuhl Seminar: "Robust Performance in Database Query Processing"

Angelos and Tahir attended the Dagstuhl seminar on “Robust Performance in Database Query Processing” from May 28 to June 2, 2017 https://www.dagstuhl.de/en/program/calendar/semhp/?semnr=17222. Schloss Dagstuhl is a meeting center for computer science research located in Saarland, Germany. Since its foundation in 1990, it has hosted several seminars and workshops, bringing together researchers and practitioners from both industry and academia. Each seminar is typically one week long, from Sunday afternoon to Friday noon. Since participants have living and working facilities in the same building, they get the chance to spend the whole week working and living in close quarters. This helps to stimulate not only their research, but also their social interaction. The venue is surrounded by a forest which provides a great recreational opportunity to go for long hikes around the woods and on nearby walking trails.

The seminar on “Robust Performance in Database Query Processing” is the successor of two older seminars on the same topic, which took place in 2012 and 2010. Organized by Renata Borovica-Gajic, Goetz Graefe and Allison Lee, the goal of the seminar was to come up with ideas for research projects to enable robust performance in database systems. The term “robust performance” was itself controversial; in the end, everyone agreed to an informal definition of “good performance every time”. So the goal of each research project was to reduce, or ideally, eliminate performance disruptions in database systems that may be caused for a variety of reasons.

In the current seminar that we attended, there were 25 participants, who split into four working groups, focusing on (1) the optimal sequencing of operators in query execution, (2) database updates and associated robustness issues, (3) the parallelization of workloads in the face of severe skew, and (4) the application of machine learning in order to better understand the performance of database performance during query execution. Each working group was responsible for delivering performance metrics and benchmarks and framing solutions for the problem it was focusing on. Days were split into sessions where people met only within their group, and sessions where all of the participants met together in order to share their findings, ask questions from other groups and receive feedback on how to proceed. This gave us the opportunity to take a good glimpse on what everybody had been doing and grasp the basic concepts behind their ideas. We were involved in the working groups focusing on parallelization and skew, and on learning to identify “non-robust” behavior, so we will describe our experiences in these two groups.

Angelos was part of the working group that focused on parallelization and skew. The group first approached the issue of having the right benchmark as well as the appropriate metric, in order to stress and evaluate the robustness of a database system running parallel joins in the presence of severe skew. The group members, consisting of researchers from both the academia and industry, came up with novel ideas on how to assess the robustness of a database system that, even though primarily focuses on skewed workloads, it can potentially be extended into a more general context. Moreover, they started the specification of a benchmark that can be used to generate data with various forms of skew and provide with a concrete model of workload that can be used to stress and evaluate the robustness of the system in the presence of these forms of skew. Finally, the group members provided with an outline of works done in the literature, going back to the 90’s, which have addressed the problems of skewed workloads from various perspectives.

Tahir participated in a working group whose goal was to automatically identify queries exhibiting unexpectedly slow performance and fix the underlying reasons for the slow performance. It soon became clear that this was a massive undertaking, since even the definition of a “slow” query was unclear and there was a whole host of reasons why a query might be slow. So, the group focused instead on making slow query performance explainable to users, where a slow query was defined as a query whose performance a user complained about. Based on this definition, the idea was to collect statistics and build performance models for each operator in a query, so that a user could be shown a visual explanation for a slow query. To flesh out this idea further, the group spent substantial amounts of time creating a taxonomy of possible causes of slow performance, coming up with possible benchmarks for experimental validation and reviewing related work in the area of modeling query performance.

Overall, in our opinion, the seminar was a great success. All of the participants were excited about the progress they made during their one week there, sharing their research ideas and trying to provide solutions in an important, high-impact area of database systems. Moreover, social interactions among the participants were stimulating, good-natured and positive, making it a very enjoyable experience overall.

 

                                                                                            by Tahir Azim & Angelos Anadiotis

Posted by Dimitra Tsaoussis Melissargos at 13:54
Schloss Dagstuhl Seminar:" Rack-scale Computing"

Schloss Dagstuhl is a venue in the Saarland area of southwest Germany that specializes in week long seminars in computer science. The seminars typically start on a Sunday evening and last until the following Friday with an audience of around 40 participants from the international world of academia and industry. Even though the activities are subsidized by the German government which keeps expenses low, anyone can propose a seminar on a specific topic and timeslots are usually filled around 1.5 years in advance. A typical seminar program includes lectures and small group discussions with plenty of opportunities for interaction. One feature of the organization which is worth mentioning is that during the meals, the participants have to sit at randomly assigned places around the tables which promotes unplanned interactions among participants.

Recently, I had the opportunity to attend a Dagstuhl seminar on “Rack-scale Computing” (http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=15421) that featured a very mixed crowd of hardware and software people and inspired many wide-ranging discussions. The seminar grew out of the discussions at the first rack-scale computing workshops co-located with the Eurosys conference in 2014. Rack-scale computing is the emerging research area focusing on how to design and program the machines used in data centers. Interestingly, after a week of discussions there was no consensus on how the rack-scale systems will look like, but there was a shared sense that it will inspire many exciting research questions.

On the hardware side, majority of discussions focused on processors, memory and networking. We can expect to see processors with many cores of lean out-of-order designs and crossbars for communication. Also, power limitations are inspiring renewed interest in accelerators, both fixed function and programmable (FPGA) and we can expect to see further integration in System-on-Chip designs. With compute scaling faster than memory bandwidth and capacity, memory is becoming a bottleneck and in enterprise environment, customers are sometimes buying additional CPUs just to benefit from more memory. Memory bandwidth bottlenecks can be avoided by 3D stacking, while capacity can be significantly increased with the emerging non-volatile memory technologies. These technologies are expected to bring very high density without converging with traditional storage. On the storage side, 3D stacked flash and novel magnetic disk technologies will enable continued increase in storage capacity. The way how non-volatile memory will be accessed by the processors remains an open question. On the networking side, 100 Gb/s are already faster than any current software can utilize, hence silicon photonics interconnects are not expected to bring direct improvements in latency, however, it will bring higher bandwidths. Also, networks will move to distributed switching fabrics.

High bandwidth networks, large main memories and dynamic application requirements are motivating disaggregation of resources where compute, memory and storage can be combined on-demand to best fit application requirements. HP’s “the machine” and UC Berkeley’s Firebox are two early proposals for rack-scale (or datacenter-scale) designs with thousands of cores, large non-volatile memory pools and photonic interconnects.

One of the most interesting topics for me personally was the one on applications. While all participants agreed that we still cannot identify a single “killer-app” for rack-scale hardware platforms, we heard of a diverse range of applications that can benefit from such platforms: data analytics, graph analytics, traditional high-performance computing (HPC) applications, as well as applications that have elastic resource requirements. The operating system in this environment will be decentralized and should support diverse services, including fault tolerance and resource isolation.

Programming models also remain an open question as rack-scale platforms are likely to combine worst properties of multicores and distributed systems and will make programming challenging. One of the main issues would be whether to ship data or functions in order to extract locality from the application. Transactions will be a very useful abstraction in the rack-scale context and they can benefit a great deal from hardware support.

Finally, power efficiency is one of the main goals of rack-scale designs, as it is currently a major problem for datacenter operators. Even though for the last few years, many people have expected ARM64 processors to become a standard in datacenters, experience showed that transition from current datacenter architectures will be slow. One of the main reasons for slow adoption of any new technology is the amount of time it takes to rewrite software; large cloud-computing providers, however, are continuously experimenting with new hardware platforms.

Overall, the seminar was a success and everyone agreed that we learned a lot from each other during the week. The format was well received, although many people wanted a bit more time for discussions in smaller groups. I really enjoyed this event and can strongly recommend it to any fellow Computer Scientist.

                                                                                                              

                                                                                                                     by Danica Porobic

Posted by Dimitra Tsaoussis Melissargos at 16:10
Comments (0)
Impressions from HPTS 2015

HPTS (high performance transaction systems, http://hpts.ws/) is a series of informal events that brings together a diverse group of database system researchers and practitioners. It started out 30 years ago with the focus on transaction processing systems and evolved to encompass all aspects of large scale systems. The main attractiveness of the workshop is in its small size - less than 100 participants - and the mix of people from industry and academia at all levels of seniority which results in many lively discussions that often stretch long into the night. The event is occurring every odd year in early fall in Asilomar, CA and spans from Sunday afternoon until Wednesday morning.

This year’s event was held from the 27th until the 30th of September. In contrast to previous editions, it didn’t feature any panel discussions, but instead devoted more time to presentations that were of very high quality. Despite the word “transactions” appearing in the title, only one of the long talk sessions was devoted to transactions consisting of discussions around high performance distributed transaction processing, optimizations for the flash storage and testing correctness of emerging distributed transaction implementations for cloud environments. However, transactions were much more popular in the gong show session with many talks discussing a variety of system and application aspects. Despite many concerns that we’re just revisiting the same old problems solved decades ago and that our systems provide enough performance for all human-generated (high value) transactions, participants identified many application areas in need of systems that combine efficient short transactions with other types of data processing including long updates, complex analytics and machine learning. However, semantics of transactions in this context and how to use this information to build efficient systems remain open problems.

An overarching theme of many presentations this year was the inevitability of moving data management systems to the cloud. In this environment, one needs to take security, fault tolerance and elasticity as first class citizens when designing systems. Fine-grained instrumentation and monitoring are essential tools for achieving this and we have heard many talks about different challenges in achieving predictability and ensuring quality of service in distributed systems. Operational aspects, including deployment, configuration and debugging remain challenging, but containers and associated orchestration technology promise to solve many of these issues.

The proliferation of monitoring applications, both in the context of the internet of things and in datacenters, led to renewed interest in streaming applications. In contrast to previous generations of streaming systems, modern systems support wider variety of analytical computations in real time which eliminates the need to ingest the data inside a data analytics system.

The growing size of data stored in a multitude of different systems is emphasizing the need for integration of data from various sources. One issue that has been challenging for a long time is ensuring data quality which still requires manual data processing in many domains. It is generally acknowledged that it is not enough to just dump the data in the Hadoop data lake and expect that systems higher up the stack would be able to efficiently process it. In practice, this often means that they need to convert the data to more suitable format which creates same issues as the traditional data warehouses. Modern systems take a more dynamic approach by keeping data in-situ and either integrating it in middleware querying layer or even generating query processing pipelines just-in-time for maximum efficiency.

Finally, efficiency was one goal that everyone was aiming at, although it had different meaning in various contexts. In particular, we have heard talks about system designs that exploit abundant parallelism and the features of modern processors such as hardware transactional memory, as well as the emerging non-volatile memory. In the distributed system space, efficiency concerns were mostly about resource utilization which inspired designs for better storage layouts, the use of code compilation techniques and fine-grained memory management due to unpredictability of the default garbage collection mechanisms.

Overall, this year’s HPTS was a great event with a lot of opportunities for discussions with other researchers and practitioners from both academia and industry and we’re looking forward to the next workshop in 2017.

by Danica Porobic

Posted by Dimitra Tsaoussis Melissargos at 10:32
Comments (0)
RSS