- PaaS / Search as a Service over scoped text content.
- Stores your data in an index that can be searched through full text queries.
- Can be created in Azure Portal or with SDK/REST APIs.
- Can be auto-generated from SQL Database or Cosmos DB.
- Search is not only indexing!
- Search needs to be smarter, have language understanding, understand uniqueness.
- Advanced search behaviors
- Type-ahead query suggestions based on partial term input
- Hit-highlighting
- Faceted navigation
- Natural language support using linguistic rules for the specified language.
- Pricing
- Free: Shared with other Azure Search subscribers
- Higher
- Dedicated resources only used by your service
- More search units & partitions
- Search unit
- How fast it indexes
- Each partition and replica counts as one SU
- E.g. 3 replicas + 3 partitions = 9 SU (as each partition have its own replica)
- How fast it indexes
- Search unit
- Multi-model database service.
- Globally distributed: users access data centers closer to them.
- Four different APIs:
- Document DB (SQL) API,
- MongoDB API
- Graph (Gremlin) API
- Tables (Key/Value) API
- The Azure Cosmos DB Data Migration tool
- Open-source solution that imports data to Azure Cosmos DB
- Sources, include: JSON files, MongoDB, SQL Server, CSV files, Azure Cosmos DB collections.
- Automatic data distribution across partitions.
- Same partition key ensures data will be stored within same partition.
- 📝 Consistent-prefix
- Data versions can be behind by are always ordered in a right way based on when they're written.
- Four consistency levels from stronger guarantees to better performance and availability:
- From strongest to loosest: Strong => Bounded Stateless => Session => Eventual
- Consistency strategy
- Stronger consistency -> Write is slower
- Looser -> Performance is at peak but read can lag behind to older versions of data
- Write operation on primary database => it's replicated to the replica instances.
- The operation is only committed (and visible) on the primary after it has been committed and confirmed by ALL replicas.
- Always read most recent copy from master r/w node.
- Reads honor consistent-prefix guarantee.
- Established through 2 metrics: time/version.
- E.g. goal: 90%>
- Difference from Strong is you can configure how stale documents can be within replicas.
- Staleness : Quantity of time (or version count) a replica document can be behind the primary document.
- Reads honor consistent-prefix guarantee.
- Guarantees all queries are served by same server per session.
- You always see the latest change you make.
- All read & write are consistent & monotonic across primary and replica instances within a user session.
- Commits any write operation against the primary immediately.
- Transactions are handled asynchronously and will eventually (over time) be consistent with the primary.
- Most performant as you don't wait for replicas to commit to finalize its transactions.
- Cloud-based Enterprise Data Warehouse (EDW)
- Formerly known as SQL Data Warehouse.
- Main difference from SQL is that it's underlying infrastructure is that it's a columnar storage.
- Tables are still relational
- Quickly run complex queries across petabytes of data.
- Leverages Massively Parallel Processing (MPP).
- E.g. use as a key component of a big data solution.
- Supports PolyBase T-SQL queries
- PolyBase is a technology that accesses and combines both non-relational and relational data, all from within SQL Server.
- It allows you to run queries on external data in Hadoop/Spark or Azure blob storage.
- 💡 Choose if you run many aggregated queries.
- Data flow
- Ingest: Data orchestration and monitoring
- Store: Big data store
- Prep & train: Hadoop/Spark and Machine Learning
- Model & serve: Data warehouse
- Hyper-scale repository for big data analytic workloads.
- Stores data of any size, type and ingestion speed in a one single place for operational and exploratory analytics.
- It's a distributed file system
- Can be accessed from Hadoop (available with HDInsight cluster) using the WebHDFS-compatible REST APIs.
- WebHDFS => Web Hadoop File System for IaaS.
- Used also by Kafka.
- HDInsight
- Compute layer of the data lake store.
- Managed hadoop cluster
- WebHDFS => Web Hadoop File System for IaaS.
- Includes security (e.g. OAuth, 777), manageability, scalability, reliability and availability.
- Single machine: Fat NFS
- Network file-systems: NFS, SMB
- Distributed lock.
- Different nodes can mount same drive without corruption.
- Cluster file-systems: ZFS, ADFS, Oracle RAC
- They respect each others lock.
- Allows multiple machines to manipulate files within them concurrently.
- Oracle RAC is active-active.
- Machines share their caches.
- They cross replicates to create illusion that they're active nodes, but the reality is that there's a single master node.
- If master node fails -> mastership is delegated to another node.
- Distributed file system
- Connects machines.
- Instead of everyone mounting a shared device -> Every machine has a local storage and distributed across others.
- Allows distributing jobs so it can be distributed.
- Parallel processing, e.g. Hadoop.
- It's not much about distributing files but parallel processing.
- More, e.g. storage based on blockchain.
- Biggest challenge of Hadoop is that it's a very big ecosystem.
- So many components, hard to get stuff done.
- Simplify big data analytics.
- Can handle jobs of any scale instantly by setting the dial for how much power you need.
- Pay only for your job when it is running, making it cost-effective.
- The analytics service supports Azure Active Directory letting you manage access and roles, integrated with your on-premises identity system.
- Includes U-SQL
- Similar to C# and SQL.
- A language that unifies the benefits of SQL with the expressive power of user code.
- Scalable distributed runtime enables you to efficiently analyze data in the store and across SQL Servers in Azure, Azure SQL Database, and Azure SQL Data Warehouse.
- Used for data integration
- Pipeline solution to schedule data-driven workflows.
- Typically a pipeline is:
- Connect & Collect / Ingest
- Store: E.g. Azure Store
- Transform & Enrich
- Process in Spark, Data Lake Analytics and Machine Learning.
- Prep & train with Azure Databricks.
- Publish
- Load into analytics engine (e.g. Azure SQL database) to be used by BI tools.
- Monitor
- Check failure/success rates
- Typically a pipeline is:
- Handles
- Extract-transfer-load
- Data integration
- Can ingest data from data stores.
- Send output data to data stores.
- E.g. file shares, FTP web, databases, SaaS services.
- Hybrid extract-transfer-load