Skip to content

Latest commit

 

History

History
164 lines (140 loc) · 7.02 KB

File metadata and controls

164 lines (140 loc) · 7.02 KB

Big data services

Azure Search

  • PaaS / Search as a Service over scoped text content.
  • Stores your data in an index that can be searched through full text queries.
    • Can be created in Azure Portal or with SDK/REST APIs.
    • Can be auto-generated from SQL Database or Cosmos DB.
  • Search is not only indexing!
    • Search needs to be smarter, have language understanding, understand uniqueness.
  • Advanced search behaviors
    • Type-ahead query suggestions based on partial term input
    • Hit-highlighting
    • Faceted navigation
    • Natural language support using linguistic rules for the specified language.
  • Pricing
    • Free: Shared with other Azure Search subscribers
    • Higher
      • Dedicated resources only used by your service
      • More search units & partitions
        • Search unit
          • How fast it indexes
            • Each partition and replica counts as one SU
            • E.g. 3 replicas + 3 partitions = 9 SU (as each partition have its own replica)

Azure Cosmos DB

  • Multi-model database service.
  • Globally distributed: users access data centers closer to them.
  • Four different APIs:
    • Document DB (SQL) API,
    • MongoDB API
    • Graph (Gremlin) API
    • Tables (Key/Value) API
  • The Azure Cosmos DB Data Migration tool
    • Open-source solution that imports data to Azure Cosmos DB
    • Sources, include: JSON files, MongoDB, SQL Server, CSV files, Azure Cosmos DB collections.

Consistency levels

  • Automatic data distribution across partitions.
  • Same partition key ensures data will be stored within same partition.
  • 📝 Consistent-prefix
    • Data versions can be behind by are always ordered in a right way based on when they're written.
  • Four consistency levels from stronger guarantees to better performance and availability:
  • From strongest to loosest: Strong => Bounded Stateless => Session => Eventual
  • Consistency strategy
    • Stronger consistency -> Write is slower
    • Looser -> Performance is at peak but read can lag behind to older versions of data

Strong

  • Write operation on primary database => it's replicated to the replica instances.
  • The operation is only committed (and visible) on the primary after it has been committed and confirmed by ALL replicas.
  • Always read most recent copy from master r/w node.

Bounded Stateless

  • Reads honor consistent-prefix guarantee.
  • Established through 2 metrics: time/version.
    • E.g. goal: 90%>
  • Difference from Strong is you can configure how stale documents can be within replicas.
    • Staleness : Quantity of time (or version count) a replica document can be behind the primary document.

Session

  • Reads honor consistent-prefix guarantee.
  • Guarantees all queries are served by same server per session.
    • You always see the latest change you make.
  • All read & write are consistent & monotonic across primary and replica instances within a user session.

Eventual

  • Commits any write operation against the primary immediately.
  • Transactions are handled asynchronously and will eventually (over time) be consistent with the primary.
  • Most performant as you don't wait for replicas to commit to finalize its transactions.

Azure Synapse Analytics

  • Cloud-based Enterprise Data Warehouse (EDW)
  • Formerly known as SQL Data Warehouse.
  • Main difference from SQL is that it's underlying infrastructure is that it's a columnar storage.
    • Tables are still relational
  • Quickly run complex queries across petabytes of data.
    • Leverages Massively Parallel Processing (MPP).
  • E.g. use as a key component of a big data solution.
  • Supports PolyBase T-SQL queries
    • PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
    • It allows you to run queries on external data in Hadoop/Spark or Azure blob storage.
  • 💡 Choose if you run many aggregated queries.
  • Data flow
    • Ingest: Data orchestration and monitoring
    • Store: Big data store
    • Prep & train: Hadoop/Spark and Machine Learning
    • Model & serve: Data warehouse

Azure Data Lake Store

  • Hyper-scale repository for big data analytic workloads.
  • Stores data of any size, type and ingestion speed in a one single place for operational and exploratory analytics.
  • It's a distributed file system
  • Can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs.
    • WebHDFS => Web Hadoop File System for IaaS.
      • Used also by Kafka.
    • HDInsight
      • Compute layer of the data lake store.
      • Managed hadoop cluster
  • Includes security (e.g. OAuth, 777), manageability, scalability, reliability and availability.

File systems

  • Single machine: Fat NFS
  • Network file-systems: NFS, SMB
    • Distributed lock.
    • Different nodes can mount same drive without corruption.
  • Cluster file-systems: ZFS, ADFS, Oracle RAC
    • They respect each others lock.
    • Allows multiple machines to manipulate files within them concurrently.
    • Oracle RAC is active-active.
      • Machines share their caches.
      • They cross replicates to create illusion that they're active nodes, but the reality is that there's a single master node.
        • If master node fails -> mastership is delegated to another node.
  • Distributed file system
    • Connects machines.
    • Instead of everyone mounting a shared device -> Every machine has a local storage and distributed across others.
    • Allows distributing jobs so it can be distributed.
      • Parallel processing, e.g. Hadoop.
    • It's not much about distributing files but parallel processing.
  • More, e.g. storage based on blockchain.

Azure Data Lake Analytics

  • Biggest challenge of Hadoop is that it's a very big ecosystem.
    • So many components, hard to get stuff done.
  • Simplify big data analytics.
  • Can handle jobs of any scale instantly by setting the dial for how much power you need.
  • Pay only for your job when it is running, making it cost-effective.
  • The analytics service supports Azure Active Directory letting you manage access and roles, integrated with your on-premises identity system.
  • Includes U-SQL
    • Similar to C# and SQL.
    • A language that unifies the benefits of SQL with the expressive power of user code.
    • Scalable distributed runtime enables you to efficiently analyze data in the store and across SQL Servers in Azure, Azure SQL Database, and Azure SQL Data Warehouse.

Azure Data Factory

  • Used for data integration
  • Pipeline solution to schedule data-driven workflows.
    • Typically a pipeline is:
      • Connect & Collect / Ingest
      • Store: E.g. Azure Store
      • Transform & Enrich
        • Process in Spark, Data Lake Analytics and Machine Learning.
        • Prep & train with Azure Databricks.
      • Publish
        • Load into analytics engine (e.g. Azure SQL database) to be used by BI tools.
      • Monitor
        • Check failure/success rates
  • Handles
    • Extract-transfer-load
    • Data integration
      • Can ingest data from data stores.
      • Send output data to data stores.
      • E.g. file shares, FTP web, databases, SaaS services.
    • Hybrid extract-transfer-load