Another year has gone by and early summer is the time for one of the most exciting data management conferences of the season - SIGMOD/PODS.
If for some reason you wandered through the web and landed here by chance:
SIGMOD/PODS is the leading international ACM conference considering the area of Database Management and Data engineering.
We are lucky to have had the opportunity to attend most SIGMODs of the past 4 years (thank you Anastasia). And this year's SIGMOD actually ranks close to the top. (Matt’s No1 pick SIGMOD@Snowbird)
But let’s take it in a more structured way. This year's SIGMOD took place in the city center of Chicago, a wonderful (and windy!) classic American city (find all the procedural stuff here). Chicago was a last-minute decision (find here why), thus the organizers had limited options for venue. Eventually, they chose the Chicago Hilton which, in our opinion, was underwhelming.
However, (thankfully) this had no impact on the quality and innovation of the work presented. The work was interesting and well presented. The attendance was high and it was great to meet some of the most well-known scientists and fellow students. The conference had many super-interesting sessions and workshops along with several enlightening keynotes on future challenges of transaction processing systems and approximate query processing.
Amongst the multitude of papers however we recognized few patterns where current research is going towards. As always, the novelty hardware spawns interest in the community with as main driver the prospective advent of Non-Volatile memory as well as the increasing use of GP-GPUs and FPGAs. Anastasia Ailamaki and Andy Pavlo, during their respective keynotes, discussed the progress of traditional research on transaction processing. Andy concentrated on the need of academic research to turn to real-life for problems rather than make assumptions that do not necessarily hold and Anastasia concluded on the increasing popularity of Hybrid transaction and analytical workloads (HTAP) and how we should efficiently treat them. During the second keynote session Surajit Chaudhuri, Barzan Mozafari, and Tim Kraska, discussed the re-incarnation of the need for approximation due to the ever-increasing datasets and how we can achieve it without requiring DB users to get a Ph.D. in probability and statistics. Finally, machine learning is here to stay and there has been an increasing trend of using machine learning techniques to improve database operations. Being a proper DB conference SIGMOD had also its share of traditional DB research where some interesting contributions in transaction processing and query optimization shone-out.
Let’s take these trends one by one:
When talking about new hardware, an increasingly influential venue is Damon (one of the workshops collocated with SIGMOD). We are always interested in it as it has very interesting publications (e.g., http://dl.acm.org/citation.cfm?id=1995446 , http://dl.acm.org/citation.cfm?id=2619229 ) . This year Damon started with a super-interesting keynote on high-performance computing and machine learning by Dr. Eng Lim Goh from HPE. Dr. Goh shared his experiences and vision on the opportunities and challenges on running high-scale machine learning tasks on HPC environments. The paper presentations covered wide-range of interesting topics on analyzing/using cutting-edge technologies for solving database problems.
There were six full paper presentations, and a flash poster session for the short papers. One of the presentations showed an analysis of the power consumption of the ever-growing server memory both for transactional and analytical workloads. The study is particularly interesting as it uses a custom-built power measurement apparatus directly measuring the memory power consumption, rather than using power measurement counters. There were several talks on GPUs varying from analyzing the complete micro-architectural behavior of GPUs, to dealing with the random data access problem on GPUs, and to a lightweight GPU decompression library. The study examining the random data access patterns on GPUs particularly focused on the TLB structures of GPUs, and propose a TLB-conscious algorithm for several database operators.
We find the study particularly interesting in that it deals with the random data access problem on GPUs. As OLTP workloads also include large number of random data access, the study can have impact on designing OLTP engines on GPUs. Another GPU study presented a lightweight decompression algorithm for GPU processing. We have seen a study on specialized hardware for analytical database operators, and a study on a high-level database system design targeted for heterogeneous computing environments. Having dark silicon down the road, energy-efficient specialized hardware is a popular topic in computer architecture research. We find it exciting to have papers on solving inter-disciplinary problems approaching the problem from different perspectives in our Damon workshop. Another study presented an analysis of NVRAM on a particular database system, Google’s LevelDB. As NVRAM is getting more and more popular, the study presents one of the earliest results on using NVRAM as a storage layer below DRAM.
Another study presented a PetriNet-based core allocation mechanism for alleviating NUMA affect for multi-core database operations. The study is interesting in that its proposed mechanism does not require modifying the implementation of the database operation, i.e., non-intrusive. Last but not least, we have seen an interesting use of vectorization to accelerate creating and querying a particular index type, column imprints. Damon ended with an impressive keynote on GPU architectures by Nikolay Sakharnykh from Nvidia. Nikolay presented the advancements in the GPU architectures over the recent years, and the developments on the evolving GPU programming models. Overall, we found this year’s Damon super-interesting with super-useful system analysis works as well as impressive hardware/software designs for solving database problems.
Following a similar line of the research presented in Damon, we have seen several papers on new hardware and two tutorials on non-volatile memory (NVRAM) in this year’s SIGMOD. There were two studies using FPGAs to accelerate certain database operations such as partitioning and string matching. Similarly, there were two studies using GPUs to accelerate database operations. While one of these studies was on accelerating hybrid radix sort on GPUs, the other one was about a top-down templating methodology providing a high-level framework for CPU-GPU co-processing. Lastly, we have seen a study on optimizing database performance on Flash storage, particularly targeting the low write-performance of Flash storage. On the other hand, the first tutorial on NVRAM covered the design of the entire database system stack in the light of NVRAMs as the middle layer between the DRAM and hard-disk, whereas the second tutorial went into deep technical challenges in using NVRAM as a global memory, and presented design and implementation of NVRAM-based algorithms and data structures. Overall, we found the tutorials and the papers on new hardware very challenging and informative in this year’s SIGMOD.
Hybrid Transaction and analytic workloads (HTAP)
It is becoming apparent from a number of talks including a keynote (A.Ailamaki) and a very interesting tutorial by IBM (Fatma Ozcan/Pinar Tozun/ Yuanyuan Tian) that database workloads in a multiple research and industrial projects require the functionality of both transaction and analytic processing. We have to shed the bottleneck that is the ETL process and be able to execute queries always on fresh data efficiently.
HTAP transactions may contain a set of insert/update/delete statements along with complex OLAP queries which need to be executed on the same dataset.
A number of solutions for this problem has been created with different architectures, however, it seems that the main driver for the current trend is the advances in modern hardware such as multi-core processors, advances in memory technology and the advent of general purpose co-processors. As seen in the tutorial most existing systems combine multiple products in innovative ways to support the functionality and thus each has pros and cons. Being aware of novel research efforts which offer support for HTAP workloads inherently (Caldera), we look forward for the new systems that will come up.
The ever-growing gap of computing power and the increasing data sizes lead the community on the search for alternative approaches to gain performance. During the 80's, Approximate Query Processing (AQP) was on the rise with systems such as AQUA, STRAT and eventually Online Aggregation. In 2017, we face the same problems in a slightly different context. The increasing sizes do not allow data analysis prior to query execution which is a prerequisite for these deprecated approaches thus increase the complexity of the problem. This has led the community to build on different approaches. Surajit discussed the need for online approximation in an efficient manner by introducing approximation operators which allow sampling efficiently over joins. Tim Kraska advocated for a more intrusive approach where the application is built to encapsulate approximation. The application in question is exploratory analysis where a user incrementally constructs queries thus giving enough time for the system to build samples. Barzan during his keynote shared a number of insights from his commercial effort, snappydata. Apart from the keynote there was a very interesting tutorial on sampling techniques by Ke Yi which gave a great overview of sampling and some mathematical background.
Another hot topic of this year’s SIGMOD was the integration of data management and machine learning systems. We have seen several papers and two tutorials on machine learning. The first tutorial covered both aspects of the intersection of data management and machine learning: using machine learning in data systems and using data management techniques in machine learning challenges; whereas the second tutorial was focusing on handling data management challenges in machine learning systems such as understanding, cleaning and validating the data. On the other hand, the papers covered a set of interesting topics. We have seen a study proposing a schema-independent relational learning algorithm that learns novel relations from the existing relations in the database. We have also seen a study proposing a declarative language, BUDS, for distributed machine learning algorithms. In the similar vein, another study was taking a step towards a declarative language for machine learning by proposing a cost-based gradient descent optimizer selecting the best gradient descent algorithm for a given machine learning task. Last but not least, we have also seen a study improving the performance of using kernel density estimation used for classifying the points into low and high density regions. Overall, we found these studies an exciting step towards integrating machine learning and database systems challenges.
Traditional DBs: (QO/Transactions)
Another exciting topic of this year’s SIGMOD was query optimization. The optimization works covered wide-range of topics. One line of research was exploring optimization opportunities for a particular query type and/or query operator. We have seen a study comparing scan and index probe operation on column stores by taking both concurrency and selectivity into account. Another study was optimizing for disjunctive queries on column stores by taking branch misprediction cost into account. We have also seen a study presenting a framework for optimizing iceberg queries, the queries filtering large amounts of data and returning only a subset of them passing a certain threshold. Last but not least, we have seen a study focusing on efficiently handling updates for real-time analytics systems. Another line of research was proposing general query optimization frameworks. We have seen a study converting the join ordering problem into a mixed integer linear program, a study proposing a combinators-based nested relational algebra improving query optimization and compilation, a study designing a reuse-aware query optimizer that takes advantage of already materialized internal data structures for upcoming queries, and a study proposing a novel online parameterized query optimization (PQO) technique satisfying the three major requirements of online PQO, i.e., bounded cost, low optimization overhead and small number of stored plan.
On the other hand, we have seen several studies focusing on the predictability aspect of the database systems performance. One such study focused on the major sources of the performance variance in the database system, and proposed a novel algorithm, variance-aware transaction scheduling, to mitigate the performance variance. Similarly, another study focused on auto-tuning database systems based on machine learning techniques. Last but not least, we have seen a comprehensive experimental comparison of bitmap and inverted list compression, providing many useful insights for designing and developing compression algorithms. Overall, we found these studies particularly interesting as the hardware and software landscape of database systems is getting increasingly heterogeneous, and performance predictability can be helpful on dealing with the heterogeneity, which inherently contributes to the performance variance.
Overall, we found this year’s SIGMOD very exciting and informative. We learned and discovered a lot. We also liked the venue, enjoyed the nice neighborhood and the food of Chicago.
If this article looks interesting ping us 😉
Utku & Matt
Our blog post sounded far too positive thus we have to add couple of negative comments:
1) During most of the talks (or at least the ones that interested me) either I was standing or there was a commotion for people to find a sit mid-talk. Maybe it is time to use the amazing wonders of computer science and predict the audience size at a given talk.
2) During this instance of SIGMOD I constantly had the question “why do we attend a conference talk”. Is it to understand the core ideas of the paper? Is it to motivate us to read the paper?
During most talks, I was left on a cliffhanger with a trillion un-answered questions. Should the talks have 5 minutes more to go to more depth? Or have longer Q&As?