Google is a pioneer in providing cloud-based services to billions of users and using their data to further enhance user experiences. As a public cloud infrastructure, Google Cloud Platform offers a host of services for businesses, data analytics being an important part of the offerings. The platform has a wide variety of data analytics and management tools and services that can be enhanced by integrating Google’s Artificial Intelligence (AI) and Machine Learning (ML) solutions to glean real-time insights and intelligence.
Understanding the Data Lifecycle for Better Data Management
As the Mathematician, Clive Humby said, “data is the new oil”. With the digitization of businesses and the growth of IoT devices, there is an explosion of data and to make strategic use of data big data is critical for businesses. Big data changes the way business runs and in order to utilize data successfully, it needs to be collected, linked, and structured in a manner that enables it to become a decision-making tool. Such data can boost analytics capabilities as well as be integrated into the strategy, operations, and culture of businesses. The data lifecycle on the Google Cloud Platform collects data in its raw form and enables data scientists, developers as well as business executives to store, process, extract intelligent insights, and make data-driven decisions with the help of its services.
Figure: Data Transformed into Actions on Google Cloud
Data Ingestion
The first step in managing data using Google Cloud services is ingestion or collecting raw data from multiple sources such as:
- App: Data generated by apps and services including clickstream data, event logs, e-commerce transactions, and social network interactions.
- Streaming: Data points from the Internet of Things (IoT) devices and user events and analytics from mobile apps.
- Batch: Bulk data stored in files such as JSON, CSV, Parquet, or in NoSQL or relational databases. These large datasets may be located on-premises or on other cloud platforms and require a high bandwidth.
During the Data Ingestion stage, Google Cloud services include:
- Compute Engine: To running virtual machines for computing and hosting
- App Engine: Fully managed serverless platform
- Google Kubernetes Engine: For container management
- Pub/Sub: For real-time messaging between apps
- Cloud Logging: For collecting log data
- Cloud Storage Transfer Service: For managing the transfer of data
- BigQuery Data Transfer Service: For automating data movement
- Transfer Appliance: High capacity storage server
Data Storage
After the data has been ingested, it needs to be saved in formats and locations to make it easily accessible to other services. Google Cloud services that are helpful during storing data include:
- Cloud Storage: Managed storage for both structured and unstructured data
- Cloud SQL: Fully managed RDBMS with both MySQL and PostgreSQL engines
- Datastore: NoSQL database
- Cloud Bigtable: Fully managed NoSQL database for large workloads
- Cloud Spanner: Fully managed, horizontally scalable relational database service
- BigQuery: Managed data warehouse
- Cloud Storage: Firebase scalable object storage service
- Cloud Firestore: Flexible NoSQL database storing JSON data
Data Processing and Analyzing
For extracting insights and intelligence from the captured data, it needs to be processed and analyzed. Data from the source needs to be normalized, cleaned, processed and saved in analytical systems from which querying and exploration can be done. Depending on the analytics results, this data can be used for testing and training automated Machine Learning models. Google Cloud services that are used during the big data processing stage include:
- Cloud Dataproc: A fully managed service for Apache Spark, Apache Hadoop, and other open-source tools and frameworks. Creates and resizes clusters, log processing, reporting, and Machine Learning. Dataproc can also be used to read and write data natively in BigQuery, Cloud Bigtable, and Cloud Storage.
- Dataflow: A fully managed serverless service that unifies programming and execution models, simplifying big data for both batch and streaming data. Creates and autoscales on-demand resources and does not require cluster sizes to be specified. It can be used as a pre-processing pipeline for Machine Learning models.
- Cloud Dataprep: This service offers a visual interface for exploring, cleaning, and preparing data for analysis. With Dataflow, it can scale automatically to deal with datasets of any size. Furthermore, its integration with other Google Cloud services such as Cloud Storage and BigQuery allows it to process data irrespective of its location. Dataprep is especially helpful in Machine Learning and analytics.
- BigQuery: A fully managed data warehouse that is also used in the storage stage of the data lifecycle on the Google Cloud platform. It supports complex schemas and SQL and is highly scalable as well as highly distributed. BigQuery can be used for user analysis, device and operational metrics as well as business intelligence.
Explore and Visualize
The final stage of the data lifecycle involves detailed data exploration and visualization for extracting insights that can be used to make informed decisions.
- Datalab: This web-based interactive tool, can be used for exploration, analysis, and visualization of data. Python programs can be written and executed on interactive web-based notebooks for processing and visualization of data. These notebooks can further be shared with collaborators and even published on code-sharing websites such as GitHub.
- Looker: A platform offering tools to enhance data experiences from embedded analytics to modern business intelligence, custom data apps, and workflow integration.
- Data Studio: Offering interactive dashboards and reports with the visual representation of live data in the form of charts and graphs. These can be shared with collaborators who can further customize the data visualization using interactive controls. Data Studio can fetch data from other Google Cloud services such as Cloud SQL, BigQuery, and Google Sheets to create reports and dashboards.
- BigQuery BI Engine: A fast, in-memory analysis service for BigQuery that enables interactive analysis of large and complex datasets with sub-second query response times and a higher level of concurrency. BI Engine integrates seamlessly with Data Studio creating rich and interactive reports and dashboards with minimal effort.
- Sheets: For visualizing data in spreadsheets, Sheets can be used and since it integrates with BigQuery, it can be turned into a powerful tool allowing analysis of billions of rows of BigQuery data without requiring SQL. Not only can BigQuery queries and data be embedded into Sheets, the results can also be exported to CSV files to create smaller datasets if required. Furthermore, native spreadsheet features like charts, pivot tables, and formulas can be used.
- Data Catalog: A fully managed metadata management service that can scale on-demand and quickly discover, manage, and understand data assets. Data Catalog uses Google search technology and a simple search interface to provide a comprehensive view of data assets. Data Catalog’s integration with Cloud Data Loss Prevention and Cloud Identity and Access controls ensures compliance and security.
Orchestration Layers and Workflows
Orchestration is necessary to integrate all elements of the data lifecycle which can range from simplistic to highly complex workflows. These layers help start and stop tasks, provide dashboards for monitoring processing, and even copy files and send notifications. Apart from custom orchestration apps, Cloud Composer can be used for authoring, monitoring, and scheduling pipelines across the clouds. It is a fully managed workflow orchestration service. Its built-in integration with various Google Cloud services including, BigQuery, Cloud Storage, Datastore, Dataflow, Dataproc, Pub/Sub, and AI Platform make Cloud Composer a powerful service for orchestration on Google Cloud.
Conclusion
While Google Cloud Platform trails behind Amazon Web Services and Microsoft Azure when it comes to cloud computing market share, it's gradually making progress. However, when it comes to big data and analytics, Google has a proven track record of managing them efficiently and intelligently in its own popular services. When businesses have access to the very same infrastructure and services Google uses internally, the Google Cloud Platform is a suitable choice for data management and analytics.