Mr. President of the Republic of Cyprus,
Mr. President of the Parliament,
Ladies and gentlemen,
As a student, I was only happy when I was solving math problems. Math was more than a set of formulae and functions for me; it was a language that I could express everything in and a world that agreed with me. Enter computational thinking.
Computer science is the ability to simplify and hence solve more easily difficult problems through scientific abstraction. To implement this magic we use genius engineering. I was infatuated with several aspects of computer science – computer communications (networks), hardware (architecture) – never really used serious math again for a long time. I finally decided to embark on a data management project, and found a wealth of intellectual and technological challenges.
Nowadays, we hear about data a lot – mostly about its size (“big data”). In businesses and enterprises, in services and commerce, in science and education, data is collected so aggressively that data sizes are overwhelming (we hear about “data deluge”).
But, what is data?
At my grandparent’s house in Acropolis, Nicosia, my brother and I spent the better part of our childhood and adolescent summers with virtually nothing to do all day but read the scarce books and observe our family’s life around us. My grandmother, Kyria Anastasia as per the respectful grocery delivery man, would prepare trahana, shieftalia, and mahallepi and we would play in the sunroom (“iliako”) or fight over her legendary rocking chairs. There was no notion of computers or internet. If we needed information, we opened books or asked other people.
Through searching in the World Wide Web, we can ask our question and if someone in the world has thought to previously provide the answer we will find it. Famous chefs have their recipes available online; airlines have detailed schedules at our fingertips; shops have all their products available for purchase and delivery at our doorstep.
Therefore, data is information.
When I looked for a job in 1995, the web was taking its first steps in Europe. Still, information was scarce, as the web content was not as rich as it is today. Today we have more data, and we can certainly get more information from it, but is search enough? Ask a question no one else thought to ask: E.g. Hollywood actors who are Obama supporters? Or, is the star I saw in last night’s telescope sky scan the same as the one I observed last week? Or which US universities promote women in computer science most successfully?
To obtain an answer for these questions one needs to retrieve parts of the data needed and then manually combine it. For example, to find the Obama supporters who are Hollywood actors we need to look for a list of Hollywood actors and for a list of Obama supporters and find the names these two lists have in common. Anything not explicitly stored as data needs further work to produce; regular search cannot provide that. For combining and filtering data into answers we use data management systems. These software systems implement various filtering tools, which are then combined together through an optimization process, in order to answer ad-hoc questions.
Data management software today supports decisions in all of today’s businesses; it reads all collected data ahead of time and answers all possible questions to it. Knowledge from answering user questions drives decisions for business operation. It is the same for science: We prepare all of our observation (e.g., astronomy) and simulation (e.g., earthquake prediction) data and then scientists can ask questions combining all this data, proving or disproving hypotheses, thinking of more questions, and discovering new scientific truth.
Information is therefore the knowledge that comes from processing our data. But, what is data?
Data is anything you can store and process later to generate information.
In the second year of my PhD at the UW-Madison, I published a paper which caught the attention of the late Jim Gray, the biggest name in data management to date (among his numerous innovations, he defined transactions as we know them from banking and shopping). I worked by his side during a cold summer in San Francisco in 1999 (although, after three years in Wisconsin, it wasn’t the coldest winter I’d ever spent – apologies, Mark Twain). We helped astronomers from the Sloan Digital Sky Survey to organize their data into Microsoft SQL Server database system and then use the system’s programming language to ask questions such as, “find the fastest galaxies in the last six months”. The SDSS telescope fed databases with millions of objects from scanning the night. More than ten years later, the new Large Synoptic Survey Telescope reads deeper into the sky: astronomy researcher Andreas Wicenec says that “The LSST project expects to detect more than 100 billion objects in one year, which is at least 10 times more than we’ve observed in the last 400 years of astronomy.” How can we store this data?
When I came to Switzerland I visited and worked with scientists from the European Organization for Nuclear Research (CERN). They produce daily tremendous amounts of data using the largest microscope on Earth: The Large Hadron Collider (LHC). The LHC runs with proton-proton bunch crossings recording fourty million events per second. These are reduced to about 500 events per second which is data saved “permanently” to tape and processed fully for subsequent analysis. Of course, filtering is done through advanced techniques such as classification, and only keeps for classes of results we know we want to achieve is important. Nevertheless it would be nice to be able to save more information.
The problem is omnipresent – we cannot store all the data we produce. What’s more, businesses today only use only 10-12% of all data they collect and keep. We throw data away, because we lack the resources needed to keep all of it around, and we invest in storing data we will never use to produce information. This is the worst ROI ever.
Our data management systems are built based on investing on all the data, interesting or not, and on anticipating the questions ahead of time, so we can prepare the processing tools; both are impossible today. Although data can be predictable as it is produced by machines, there is too much data to know which will be interesting and for what. Worse, we cannot know what questions will arrive as the possibilities are endless and human mind is unpredictable. As a result, data management tools are really difficult to use with today’s huge data quantities: we have created glorified search engines. By preparing all data and anticipating the questions, we hit a glass ceiling.
In my 20 years working as a data management professor, I have been blessed with brilliant students and scientific collaborators. Together, we have been building technologies to adapt data management systems to emerging hardware platforms and to demanding scientific applications. Since 2010, we have invented a new data management paradigm which marks the new generation of systems. Our new systems do not need any knowledge of data – hence, no resources spent to prepare it for questions, and no questions are anticipated. Instead of building tools and loading data ahead of time, we define data and operations mathematically. When the questions come, we fetch the data necessary and automatically generate the software needed to produce answers on the fly. The technology is based on detailed mathematical modeling of the data, which wastes no resources ahead of time, and code generation technology, which allows to create the tools only when we know what work needs to be done. The result is extremely efficient data processing with no wasted resources.
Data is, therefore, anything that can be defined mathematically.
When I was 27 years old, I wanted to change my life. I was thinking of going to the USA for a Ph.D. (although I knew what a Ph.D. was only approximately), but I knew it would take several years – those days, a woman at 27 was not exactly what people called “young”. At this important crossroads of my life, I called my godmother, an amazing woman who had to abandon her dream of a PhD herself to raise a family and pursued it again — successfully — twenty years later. She told me to leave immediately and do my PhD. I said, “but, I will only finish when I will be thirty-two”! And she said, “My dear, this will happen anyway, and it is your decision to be thirty-two with a Ph.D. or without a Ph.D.” My family sent me off to the USA with a one-way ticket and their love, and this love I will forever carry with me.
I am grateful to the many people in my life who supported me in pursuing my crazy dreams, to Cyprus where I was born and to Greece where I grew up, and to the Nemitsas Foundation for supporting me and for honoring me with this award today. Thank you, from the bottom of my heart.