Phat Nguyen
Phat Ng

Phat Ng

Prepare for DP-900

Phat Nguyen's photo
Phat Nguyen
·Mar 27, 2022·

11 min read

ANALYTICS

  • descriptive Analytics

    • what occurred in the past
    • what is current happening
  • cognitive

    • transcribing audio files
    • text-to-speech
    • speech service
    • PDF
    • bulk data processing
  • diagnostic

    • why did it happen?
  • prescriptive

    • how can we make it happen?
  • predictive

    • what will happen

DATABASE

  • normalization

    • reduce data redundancy
    • improves data integrity (toàn vẹn dữ liệu)
  • star schemas

    • đơn giản là 1 fact table ở giữa và các nhanh dimension table xung quanh
  • snowflake schemas

    • 1 fact table ở giữa và dimension là star schemas
  • non-relational DB

    • disavantages:
      • Data consistency
      • don't perform ACID transaction
  • DDL, DML, DCL, TCL

    • DDL (Data Definition Language): CREATE, DROP, ALTER, RENAME, COMMENT and TRUNCATE.
    • DML (Data Manipulation Language): SELECT, INSERT INTO, DELETE, UPDATE.
    • DCL (Data Control Language): REVOKE & GRANT.
    • TCL (Transaccional Control Language): COMMIT & ROLLBACK
  • relational DB

    • optimized for writes
    • consistency and availability
    • flexibility
    • data integrity
    • data retrival
  • key/ value: application users and theirs language

  • object: images and metadata
  • columar/ document: similar just 2 column and rows
  • columnar:

    • Recommendations
    • Personalization
    • Sensor data
    • Telemetry
    • Messaging
    • Social media analytics
    • Web analytics
    • Activity monitoring
    • Weather and other time-series data
  • SQL Server in a VM

    • 100% compatibility
    • fewest changes when migrating existing on-premises solution
  • can use index same way in SQL and non-SQL

  • Manual sharding

    • when DB is high volume, its time consuming
    • difficult method of distributing those daa transaction over mul server

ETL, ELT

  • for Azure Data Factory
  • ETL

    • data that is fully processed before being loaded to the target store
    • Folow
      • extract data from **original server
      • transfrom data in **Transform Engine
      • load data in **Warehouse
  • ELT

    • target data store powerful
    • Flow
      • extract data from **original server
      • load data in **Warehouse
      • transfrom data in **Warehouse
    • support **Data Lake
    • **large amount of data

PROCESSING

  • batch processing

    • latency
    • output data to file store, relational and non DB
    • subjected to detailed analysis (data được phân tích chi tiết)
  • massively parallel processing (MPP)

    • distributes across compute nodes
    • synapse and hive on HD
  • cluster index

    • sorts and stores the data rows

AZURE STORAGE

  • Cloud storage (no relational DB)

    • geographically distributed writes
  • Blog storage

    • files
  • Table storage

    • NoSQL
    • key/value store
    • just multiple read replicas
    • key: partiion key & row key
    • the cheapest storage for key/value
    • 5 PB = 500 TB max
    • direct mounting Window, Mac, Linux
  • Cosmos Germin API

    • Graph
    • key/value store
    • relationship related with Graph
  • Data Lake Storage Gen2

    • hierachy: Subscription > Resource group > Storage account > container
    • POSIX-like access control lists (ACLs)
      • folder/ file level
      • must enable hierachy namespace
    • Azure role-based access control (Azure RBAC)
    • raw data
    • need to create storage account
    • use blob, table, queue, and file storage in the same Azure Storage account
    • multi region can incur bandwith costs
    • built on top of Blob storage
  • SQL Database

    • PaaS
    • relational DB
    • processing read and write: OLTP
    • config firewall to restrict IP address (a server-level firewall)
    • db firewall level take precidence than server rule
    • include managed backup service
    • built-in high availability
    • Azure Defender
    • **no need 365 subscription
    • use existing SQl server licenses to reduce cost
    • admin, login account first time
    • MFA using Azure Active Directory (AD)
    • public endpoint --> no user
    • not fully compatible with SQL Server
    • benefit: not in-DB machine
    • TDL enabled by default, can't change
    • can't choose SQL version
  • SQL Managed Instance

    • native support for cross-database queries and transactions
    • read-only: generate reports without affecting the transactional workload
    • 100% compatibility with SQL Server running in your own environment
    • features: Agent, Database Mail
  • Cosmos DB, HDInsight: **non-relational DB

  • Synapse Analytics: data warehousing, not OLTP

    • external format
    • performing compute-intensive tasks
    • external resources
    • pause SQL pool to reduce cost
    • polybase: query data from external sources, T-SQL
    • SQL pool -> polybase
  • Transparent data encryption:

    • helps protect Azure SQL Database, Azure SQL Managed Instance, and Azure Synapse Analytics
    • by encrypting data at rest (DB encrypt)
  • File storage

    • shared folder, windows server
    • Server Message Block (SMB) protocol That means that multiple VMs can share the same files with both read and write access
    • To enable users at different sites to share files
  • Blob

    • doesn't support shared folders
    • Use a block blob for discrete objects that change infrequently.
    • Use a page block for blobs that require random read and write access.
    • Objec storing and retriving large binary files or blobs, such as images, videos, text files and audio files
    • lowest cost: LRS - 3 copies -> default of Table storage
  • Warehouse

    • structured data
    • The entire process of loading data into the warehouse
    • Performing data warehouse analysis and reporting
    • Managing data in the data warehouse
    • Exporting data from the data warehouse
    • data mining -> type of workload -> OLAP
  • SQL DATABASE SERVERLESS

    • optimize scaling, pausing automaticly
  • Dataset

    • data structures
  • Pipline
    • logical grouping of activities that performs a unit work, can be scheduled
    • not data structure
    • can excute other piplines
  • Linked service

    • The information used to connect external resources
  • Transaction Optimized

    • 25% - 33% cost of Hot storage but expensive GB per storae
  • File Share

    • Standard: disk-based storage
    • Premium: solid-state disks
    • LRS, ZRS, not GRS
    • premium can not create in File Storage account

VISUALIZATION

  • Represent trends and patterns over time

  • Communicate the significance of data

  • treemap

    • charts of colored rectangles, with size representing the relative value of each item. They can be hierarchical, with rectangles nested within the main rectangles.
  • catter and bubble

    • display relationships between 2 (scatter) or 3 (bubble) quantitative measures -- whether or not, in which order, etc.
  • key influencer

    • displays the major contributors to a selected result or value.
  • Interactive report: drill-downs, filters and sorting

AZURE ACCOUNT

  • Data in account replicate outside the Azure region automatically

    • read-access geo-redundant storage (RA-GRS)
    • geo-redundant storage (GRS)
  • ARM

    • automate the creation of resources
  • Azure Storage Explorer

    • access data from Storage account
  • Zone Redundant Storage (ZRS) has 3 copies

  • Geo Redundant Storage (GRS) has 6 copies

  • secure all data come from and to: secure transfer required

... AS A SERVICE

docs.microsoft.com/en-us/learn/modules/fund..

  • PaaS

    • less setup and configuration effort
    • Operating system is cloud manages, not user
    • PaaS DB can not be pause, just can export and delete DB
    • case: **Azure DB for PostgreSQL
    • features:
      • built-in high availability
      • scaling options
      • reduce managed hardware, reduce administrative
    • access to lastest features
    • handle upgrading, patching, backups, and monitoring without user involvement.
  • IaaS

    • PostgreSQl on Azure VMs

COMMAND LINE

  • can use command-line scripts using Powershell, CLI

  • az storage ... create -> just in blob storage

  • sqlcmd: query Azure SQL database

  • bcp: copies data between an instance of Microsoft SQL Server and a data file in a user-specified format

  • Azcopy file from multi sources

  • Azure cli: command-line tools for building and managing Azure resources

  • psql: CLI for Postgre

  • Postgre: port 5432

COSMOS DB API

  • cosmos account -> database -> container(table) -> item (entities)

  • not support partition keys and rows key

  • default security: authorization token

  • full control: a primary key model

  • limit read-access: resource token

  • must create separate API for each DB

  • Cosmos API:

    • supports multiple read replicas
    • supports multiple write regions
  • Table API:

    • supports multiple write regions
  • Core API: for SQL

    • container level: throughput, partition key
    • also support JSON format
    • best choice when start new project support document, key/value
    • any specific choice like table, mongo is only use for existing project
  • API for MongoDB: for MongoDB, document structures

  • Cassandra API: for column-oriented schema

  • Gremlin API: for Graph, edges and vertices

  • Mongo DB API: BSON format

  • provision throughput

    • containers
    • databases
  • account level:

    • consistency level
    • DB API
  • minimum number of Request Units per second (RU/s): 400 RU/s (pay regardless how many uses)

  • RU: 1KB with 10 fields

  • costs: Provisioned throughput, number of regions, number of availability zones, consumed storage

  • IP Policy-based access control -> limit from internet

  • type of account by selecting API type

  • serverless mode for all DB APIs

  • 1 free tier per Azure sub

STUDIO AND TOOL

  • SSMS: graphical tool for manage SQL server

    • query **Synapse Analytics warehouse
    • support Always on DB configuration
  • Azure Data Studio:

    • lightweight editor that can run on-demand SQL queries
    • query M SQL server **big data cluster
    • intelliSense
    • Notebook
    • restore a database
    • access data stored in SQL Database
    • not support Always on DB configuration
    • can not provide statistics and live query
  • SQL Server Data Tools (SSDT):

    • development tool for building SQL Server relational databases, databases in Azure SQL, Analysis Services data models, Integration Services packages, and Reporting Services reports
    • offline database project and implement schema changes by adding, modifying or deleting the definitions of objects (represented by scripts) in the project
  • AZure firewall

    • is a managed, cloud-based network security service that protects your Azure Virtual Network resources
  • Transparent Data Encryption (TDE):

    • the DB to protect data at rest

Data Factory

  • data ingestion tool

  • ETL and ELT

  • Activities

    • Control - Until
    • Data movement - Copy
    • Data transformation - Mapping data flow
    • either sequentially or parallel
  • compute enviroment actitites

    • integration runtime
  • inititates the pipline

    • a trigger
  • orchestrate activities

    • pipline
  • orchestrate pipline

    • Control flow
  • A tumbling window can run a job using data for a specific period of time, and not before or after that.

  • transfer data with process it along the way

  • SSIS

Apache Spark

  • Synapse Analytics

  • Databricks

    • can consume data from alot of sources
    • end-to-end ML
    • all data roles in place
    • ouput must be written to Blob or Data lake first
  • HDInsight

    • big data proces with Spark, Hadoop, Kafka

SECURITY

  • Transparent Data Encryption

    • used to encrypt data at rest, including database, logs, and backups, without requiring changes to the application
    • used to protect sensitive data by limiting access to data at rest,
  • Transport layer security

    • used to encrypt data in motion between the database server and clients using certificate-based encryption
  • Dynamic data masking

    • by designating how much of the sensitive data can be revealed
    • some fields hide data: phone xxx-xxx-xxx
  • Always encrypted

    • in movement, and in use to client applications that have appropriate access to keys
    • used to limit exposer of sensitive data to non-privileged users
    • encrypted end to end, and never in a decrypted state at any time outside the client machine
  • security principal

    • An object that represents a user, group, service, or managed identity that is requesting access to Azure resources
  • Zero Trust Model: assump we have breached

  • secure data at rest

    • Azure Disk Encryption

Ingest, Analyze, Deliver

  • ingest

    • hub, Iot, event
  • analyze

    • SQL, DB, Blob store
    • ML service
  • deliver

    • the rest
  • data ingestion

    • Capturing raw data streaming from various sources and storing it

FLOW Warehouse

  • Ingest data: Data Factory

  • Data store: Data lake

  • prepare and transform: Databricks

  • model and serve: synapse analytic

BI

  • BI service

    • report sharing and distribution
    • report and dashboard creation
    • can not design data model
  • BI desktop

    • data modeling
    • data acquisition and preparation
  • can adding

    • report page
    • visualization from report
    • images
    • text
    • display visualz from Excel workbook
    • associated with a single workplace
  • dashboard: tell whats going on, the most important elements

  • paginate report -> can be printed -> an invoice

  • paginate report -> Any time your client asks for detail, in rows and columns

  • APP is collection of ready-made visuals pre-arranged on reports and dashboard

  • Interactive report: A report based on single set using BI desktop

  • A tile: rectangular box contains a sigle visual for use with a report or dashboard support interaction

  • paginated report made by BI report builder

  • Dashboard is single canvas pin tiles, single-page collection of visual

    • can not filter and slice
  • visualization are a visual represention of your data

OLTP

  • heavy writes and moderate reads

  • schema on write

  • normalize data

    • Data dependencies are logical, all related data items are stored together
    • Reduces data duplication

Least Privileged Access

  • JIT access

Redis Cache

  • Region, pricing tier, hours

Azure Shared Responsibility

  • data security: between Azure and client

Sources: Internet