分类 大数据 中的文章

Iceberg 简单总结

branch 和 tag,schema、partition、sort order演化,快照的维护(合并,删除孤儿文件等),manifest list 和 manifest files 两级布局;Hive的问题以及iceberg的优化:list是O(1)的,细粒度的partition,OCC,并发冲突,单节点的plan;Hidden partitioning,Time travelVersion rollback。支持:Spark、Flink、Hive、Trino、ClickHouse、Presto、Dremio、starrocks、Athena、EMR、Impala、Dori

阅读全文

Analyzing and Comparing Lakehouse Storage Systems

讨论了 LakeHouse 系统设计的难点,在不可变高延迟的对象存储之上,增加事务特性,三大系统都使用了OCC做隔离,事务实现都用了MVCC,源数据库管理delta和hudi用了表格式,iceberg用了层次存储(单节点处理),数据更新三者都支持CoW(适合读多写少场景),hudi和iceberg支持MoR(适合写多的场景)

阅读全文

Doris Advanced

Pipeline Execution Engine, Nereids-the Brand New Planner, High-Concurrency Point Query, Materialized View, Statistics, Join Optimization. Multi-catalog, Spark Doris Connector, Other Connector, Plugin Development Manual, CloudCanal Data Import, DBT Doris Adapter, UDF, cluster management, Data Admin, Other Manager, Maintenance and Monitor, Metadata Operations and Maintenance

阅读全文

Doris Basic

Introduce Doris,include: Data Model(Aggregate Model,Unique Model,Duplicate Model), Data Partition(Rollup),Index(Inverted Index,BloomFilter Index,NGram BloomFilter Index,Bitmap Index). Import Scenes,Import Way(Broker Load,Routine Load,Spark Load,Stream Load,MySql Load,S3 Load,Insert Into,Importing Data in JSON Format,Min Load Replica Num),Export,Update and Delete

阅读全文

Impala Tuning Summary

Impala Tuning, Architecture. tunning: join,statistics,cache,coordinators,web ui. admission control,administration configuration,security. SQL Statements and Data Type, built-in functions,udf, explain commands, file formats, Supported table and storage

阅读全文

The History of Big data

从谷歌的三篇论文到Hadoop的诞生,再是各种开源产品依次出现,Hive对MapReduce的易用性改进,三大Hadoop 供应商,谷歌新三篇论文诞生了交互式查询(三大供应商推出)以及各种开源存储格式,Spark的出现和各种流处理系统,Netflix也证明了云的强大,流批一体以及各种分布式调度系统,基于云的数仓产品出现,HDFS替换上云、容器化出现、全托管数仓Modern Data Stack、深度学习对Hadoop的影响,三大供应商被收购,三大开放表格存储的出现,几个元数据管理产品,几个新的调度框架,LakeHouse的出现以及相关类似云产品

阅读全文