my note for GCP data engineering exam

6 min readMar 14, 2019

video streaming — use GCE
real time analysis with windowing — use dataflow
high throughput low latency — use big table
cloud function triggered by — http, storage object change, pub/sub, firestore, firebase, stackdriver logging
GCE live migration — VM transfer, for hardware maintenance, bios upgrade, system change e.g. root partition…
GCE global resources — VPC, firewall, route, template, images
regional resources — static external IP, subnet
zonal resources — vm, disk
stackdriver logging export to buckets, BQ, pub/sub
deployment manifest file — YAML
GAE standard env (runtime function)— reliable under heavy load, rapid scaling, startup in sec, big up/down traffic, can scale to zero instance
GAE flexible env (docker) — startup in min, consistent traffic, gradually up/down traffic, min instance = 1
cloud storage cannot append or truncate
datalab — analyse and transform
data studio — visualize
composer = workflow DAG
Machine learning:
Precision quantifies the number of positive class predictions that actually belong to the positive class.
Recall quantifies the number of positive class predictions made out of all positive examples in the dataset.
(e.g. 100% prevision if predict 1 positive and that is true, but there are total 10 positive in the data, so recall is 10%)
accuracy = error rate = abs(true-predict)/N
feature cross = synthetic feature (construct new features from existing features, e.g. x3 = x1²+x2²)
bucketing = value →bin bucket (similar to histogram)
Your model is underfitting if the accuracy on the validation set is higher than the accuracy on the training set. Additionally, if the whole model performs bad this is also called underfitting
increase model complexity and remove dropout
streaming — scalability, fault tolerant, latency, instant insight
pub/sub — at-least once delivery, max retention = 7 days,
ackDeadline -- to acknowledge the outstanding message. Once the deadline passes, the message is no longer considered outstanding, and Pub/Sub will attempt to redeliver the message.
cannot deliver expiry msg, or msg before subscription
pubsub at least once for every subscription
PubSubIO deduplicate msg = exact once
pull is higher throughput then push
pubsub seek and replay ack-ed msg, or purge
Dataflow:
parDo — foreach element( parallel do )
UDF — can be any language
combine — aggregate/combine value after map ( similar to reduce in MapReduce)
accumulation — combine with add value
DirectRunner — run pipeline in local mode
DataflowRunner — run pipeline on cloud
event time — not actual processing time, just data timestamp at event
watermark — expected arrival time of all msg in the window
triggered by — watermark pass end of window, event time, processing time at stage, data driven.
windowing — fixed, sliding, session (idle time separation and min gap duration)
A PCollection is immutable, so you can apply multiple transforms to the same one.
Within the DoFn always use a try-catch block around activities like parsing data. In the exception block, rather than just log the issue, send the raw data out as a SideOutput into a storage medium such as BigQuery or Cloud Bigtable using a String column to store the raw unprocessed data.
functions that do not depend on hidden or external state, that have no observable side effects, and are deterministic
When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times. When running in streaming mode, a bundle including a failing item will be retried indefinitely, which may cause your pipeline to permanently stall.
optimizations can include fusing multiple steps or transforms in your pipeline’s execution graph into single steps.
When you apply a GroupByKey or other aggregating transform, the Dataflow service automatically performs partial combining locally before the main grouping operation.
For bounded data, the service favors efficiency and will perform as much local combining as possible. For unbounded data, the service favors lower latency, and may not perform partial combining (as it may increase latency).
You can disable autoscaling by explicitly specifying the flag --autoscaling_algorithm=NONE
Dynamic Work Rebalancing feature
run up to 25 concurrent Dataflow jobs per Google Cloud project
JSON job requests that are 20 MB in size or smaller.
job’s graph size must not exceed 10 MB
a maximum of 1000 Compute Engine instances per job.
For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow
15 persistent disks per worker instance when running a streaming job.
Dataflow’s Streaming Engine moves pipeline execution out of the worker VMs and into the Dataflow service backend. Streaming Engine offers smoother, more granular scaling of workers. Streaming Engine works best with smaller worker machine types
The service-based Dataflow Shuffle feature, available for batch pipelines only, moves the shuffle operation out of the worker VMs and into the Dataflow service backend. ( — experiments=shuffle_mode=service)
Dataflow FlexRS reduces batch processing costs by using advanced scheduling techniques, the Dataflow Shuffle service, and a combination of preemptible virtual machine (VM) instances and regular VMs.
When you submit a FlexRS job, the Dataflow service places the job into a queue and submits it for execution within 6 hours of job creation. Dataflow finds the best time to start the job within that time window, based on the available capacity and other factors.
FlexRS jobs have a scheduling delay. Therefore, FlexRS is most suitable for workloads that are not time-critical, such as daily or weekly jobs that can complete within a certain time window.
Worker VM logs — available through the Logs Viewer or the Dataflow monitoring interface
When you update a job on the Dataflow service, you replace the existing job with a new job that runs your updated pipeline code. The Dataflow service retains the job name, but runs the replacement job with an updated Job ID. requires a transform mapping.
There are two possible commands you can issue to stop your job: Cancel and Drain.
Beam’s state API models state per key. Inside the ParDo these state variables can be used to write or update state for the current key or to read previous state written for that key.
Big query – select count(*) scan on index, so not consume storage bandwidth
BQ SQL UNNEST, ARRAY, STRUCT, cross join usage.
To access elements from the arrays with either OFFSET, for zero-based indexes, or ORDINAL, for one-based indexes.

WITH sequences AS
  (SELECT [0, 1, 1, 2, 3, 5] AS some_numbers
  UNION ALL SELECT [2, 4, 8, 16, 32] AS some_numbers
  UNION ALL SELECT [5, 10] AS some_numbers)
SELECT some_numbers,
  ARRAY(SELECT x * 2
        FROM UNNEST(some_numbers) AS x) AS doubled
FROM sequences;SELECT id AS matching_rows
FROM sequences
WHERE 2 IN UNNEST(sequences.some_numbers)
ORDER BY matching_rows;

BigTable — single key lookup
similar to base, Cassandra, dynamodb, throughput scales linearly with no. of nodes.
no joins, append only, no transactional, transaction only within 1 single row.
solve hot spot by — avoid sequential id, reverse key, field promotion (column data →key), salting (construct artificial key, e.g. hash)
wide table — for dense data, bunch of properties
narrow table for spare data, e.g. avoid every product listed for every user. e.g. time series
row size < 100MB
sorted string table (SST) key value
access right control by instance level
access by base, cbt
BigTable/BQ (PB) , datastore (TB), cloud SQL (GB)
datastore — ACID, no join, slow write (1/sec) because of update index, for document nosql (e.g. user profiles, game states)
hierarchies e.g. user is the root entity, all his ordres are the child.
VPC — default auto subnet is 1 subnet per region.
VM on same VPC can ping each other even on different region
VM has multiple nic/VPC can ping VM on that VPC/subnet
VM with multiple nic/VPC cannot ping VM on that VPC but not same subnet (or region if VPC is auto)
linux command troubleshoot: ip route
Use BigQuery for storage. Use Dataproc to run the transformations. ANSI SQL queries require the use of BigQuery.
composer/airflow: GoogleCloudStorageToBigQueryOperator
cloud speech to text api: synchronous mode is recommended for short audio files
duplication in this case could be caused by a terminal re-try, in which case messageId could be different for the same event.
the client application must include a unique identifier to disambiguate possible duplicates due to push retries
HIPAA protects healthcare information, COPPA protects children information (<13)
GDPR — data protection and privacy in EU. It also addresses the transfer of personal data outside the EU and EEA areas. reports data breaches within 72 hours.
When data is collected, data subjects must be clearly informed about the extent of data collection, the legal basis for processing of personal data, how long data is retained, if data is being transferred to a third-party and/or outside the EU, and any automated decision-making that is made on a solely algorithmic basis.
DLP scan and discover sensitive data in Cloud Storage, BigQuery, and Datastore. Automatically mask your data.
Data Catalog: tag and search your data from big query, pubsuib, and cloud storage.

External Links

my note for GCP data engineering exam

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Matthew Leung

No responses yet