my note for GCP data engineering exam
6 min readMar 14, 2019
- video streaming — use GCE
- real time analysis with windowing — use dataflow
- high throughput low latency — use big table
- cloud function triggered by — http, storage object change, pub/sub, firestore, firebase, stackdriver logging
- GCE live migration — VM transfer, for hardware maintenance, bios upgrade, system change e.g. root partition…
- GCE global resources — VPC, firewall, route, template, images
- regional resources — static external IP, subnet
- zonal resources — vm, disk
- stackdriver logging export to buckets, BQ, pub/sub
- deployment manifest file — YAML
- GAE standard env (runtime function)— reliable under heavy load, rapid scaling, startup in sec, big up/down traffic, can scale to zero instance
- GAE flexible env (docker) — startup in min, consistent traffic, gradually up/down traffic, min instance = 1
- cloud storage cannot append or truncate
- datalab — analyse and transform
- data studio — visualize
- composer = workflow DAG
- Machine learning:
- Precision quantifies the number of positive class predictions that actually belong to the positive class.
- Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
- (e.g. 100% prevision if predict 1 positive and that is true, but there are total 10 positive in the data, so recall is 10%)
- accuracy = error rate = abs(true-predict)/N
- feature cross = synthetic feature (construct new features from existing features, e.g. x3 = x1²+x2²)
- bucketing = value →bin bucket (similar to histogram)
- Your model is underfitting if the accuracy on the validation set is higher than the accuracy on the training set. Additionally, if the whole model performs bad this is also called underfitting
- increase model complexity and remove dropout
- streaming — scalability, fault tolerant, latency, instant insight
- pub/sub — at-least once delivery, max retention = 7 days,
ackDeadline
-- to acknowledge the outstanding message. Once the deadline passes, the message is no longer considered outstanding, and Pub/Sub will attempt to redeliver the message.- cannot deliver expiry msg, or msg before subscription
- pubsub at least once for every subscription
- PubSubIO deduplicate msg = exact once
- pull is higher throughput then push
- pubsub seek and replay ack-ed msg, or purge
- Dataflow:
- parDo — foreach element( parallel do )
- UDF — can be any language
- combine — aggregate/combine value after map ( similar to reduce in MapReduce)
- accumulation — combine with add value
- DirectRunner — run pipeline in local mode
- DataflowRunner — run pipeline on cloud
- event time — not actual processing time, just data timestamp at event
- watermark — expected arrival time of all msg in the window
- triggered by — watermark pass end of window, event time, processing time at stage, data driven.
- windowing — fixed, sliding, session (idle time separation and min gap duration)
- A
PCollection
is immutable, so you can apply multiple transforms to the same one. - Within the
DoFn
always use a try-catch block around activities like parsing data. In the exception block, rather than just log the issue, send the raw data out as aSideOutput
into a storage medium such as BigQuery or Cloud Bigtable using a String column to store the raw unprocessed data. - functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic
- When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times. When running in streaming mode, a bundle including a failing item will be retried indefinitely, which may cause your pipeline to permanently stall.
- optimizations can include fusing multiple steps or transforms in your pipeline’s execution graph into single steps.
- When you apply a
GroupByKey
or other aggregating transform, the Dataflow service automatically performs partial combining locally before the main grouping operation. - For bounded data, the service favors efficiency and will perform as much local combining as possible. For unbounded data, the service favors lower latency, and may not perform partial combining (as it may increase latency).
- You can disable autoscaling by explicitly specifying the flag
--autoscaling_algorithm=NONE
- Dynamic Work Rebalancing feature
- run up to 25 concurrent Dataflow jobs per Google Cloud project
- JSON job requests that are 20 MB in size or smaller.
- job’s graph size must not exceed 10 MB
- a maximum of 1000 Compute Engine instances per job.
- For best results, use
n1
machine types. Shared core machine types, such asf1
andg1
series workers, are not supported under the Dataflow - 15 persistent disks per worker instance when running a streaming job.
- Dataflow’s Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. Streaming Engine offers smoother, more granular scaling of workers. Streaming Engine works best with smaller worker machine types
- The service-based Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend. ( — experiments=shuffle_mode=service)
- Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs.
- When you submit a FlexRS job, the Dataflow service places the job into a queue and submits it for execution within 6 hours of job creation. Dataflow finds the best time to start the job within that time window, based on the available capacity and other factors.
- FlexRS jobs have a scheduling delay. Therefore, FlexRS is most suitable for workloads that are not time-critical, such as daily or weekly jobs that can complete within a certain time window.
- Worker VM logs — available through the Logs Viewer or the Dataflow monitoring interface
- When you update a job on the Dataflow service, you replace the existing job with a new job that runs your updated pipeline code. The Dataflow service retains the job name, but runs the replacement job with an updated Job ID. requires a transform mapping.
- There are two possible commands you can issue to stop your job: Cancel and Drain.
- Beam’s state API models state per key. Inside the
ParDo
these state variables can be used to write or update state for the current key or to read previous state written for that key. - Big query – select count(*) scan on index, so not consume storage bandwidth
- BQ SQL UNNEST, ARRAY, STRUCT, cross join usage.
- To access elements from the arrays with either
OFFSET
, for zero-based indexes, orORDINAL
, for one-based indexes.
WITH sequences AS
(SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
ARRAY(SELECT x * 2
FROM UNNEST(some_numbers) AS x) AS doubled
FROM sequences;SELECT id AS matching_rows
FROM sequences
WHERE 2 IN UNNEST(sequences.some_numbers)
ORDER BY matching_rows;
- BigTable — single key lookup
- similar to base, Cassandra, dynamodb, throughput scales linearly with no. of nodes.
- no joins, append only, no transactional, transaction only within 1 single row.
- solve hot spot by — avoid sequential id, reverse key, field promotion (column data →key), salting (construct artificial key, e.g. hash)
- wide table — for dense data, bunch of properties
- narrow table for spare data, e.g. avoid every product listed for every user. e.g. time series
- row size < 100MB
- sorted string table (SST) key value
- access right control by instance level
- access by base, cbt
- BigTable/BQ (PB) , datastore (TB), cloud SQL (GB)
- datastore — ACID, no join, slow write (1/sec) because of update index, for document nosql (e.g. user profiles, game states)
- hierarchies e.g. user is the root entity, all his ordres are the child.
- VPC — default auto subnet is 1 subnet per region.
- VM on same VPC can ping each other even on different region
- VM has multiple nic/VPC can ping VM on that VPC/subnet
- VM with multiple nic/VPC cannot ping VM on that VPC but not same subnet (or region if VPC is auto)
- linux command troubleshoot: ip route
- Use BigQuery for storage. Use Dataproc to run the transformations. ANSI SQL queries require the use of BigQuery.
- composer/airflow: GoogleCloudStorageToBigQueryOperator
- cloud speech to text api: synchronous mode is recommended for short audio files
- duplication in this case could be caused by a terminal re-try, in which case messageId could be different for the same event.
- the client application must include a unique identifier to disambiguate possible duplicates due to push retries
- HIPAA protects healthcare information, COPPA protects children information (<13)
- GDPR — data protection and privacy in EU. It also addresses the transfer of personal data outside the EU and EEA areas. reports data breaches within 72 hours.
- When data is collected, data subjects must be clearly informed about the extent of data collection, the legal basis for processing of personal data, how long data is retained, if data is being transferred to a third-party and/or outside the EU, and any automated decision-making that is made on a solely algorithmic basis.
- DLP scan and discover sensitive data in Cloud Storage, BigQuery, and Datastore. Automatically mask your data.
- Data Catalog: tag and search your data from big query, pubsuib, and cloud storage.
External Links
- lab: https://www.qwiklabs.com/quests/25
- reference: https://medium.com/weareservian/google-cloud-data-engineer-exam-study-guide-9afc80be2ee3
- Dataflow use cases: https://cloud.google.com/blog/products/data-analytics/guide-to-common-cloud-dataflow-use-case-patterns-part-1
- https://medium.com/@raigonjolly/dataflow-for-google-cloud-professional-data-exam-9efd59377068
- https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays?hl=zh-tw