The words you are searching are inside this book. To get more targeted content, please make full-text search by clicking here.
Discover the best professional documents and content resources in AnyFlip Document Base.
Search
Published by appbalac, 2023-06-24 20:04:23

Data engineering with python

Data engineering with python

Keywords: Data engineering with python

332 Index C camelCase 118 Clojure 8 cluster testing, with messages 266, 267 Cluster Coordinator 315 columns creating 118-122 dropping 115-118 modifying 118-122 Comma-Separated Values (CSV) about 46 building, to JSON data pipeline 55-61 reading 46-49 reading, with pandas DataFrames 49-51 working, with NiFi processors 62 writing 46 writing, with pandas DataFrames 49-51 writing, with Python CSV Library 46-48 counters using 206-210 crontab 11 D dashboard creating 146-150 data analyzing 111-114 backfilling 138 cleaning, with Airflow 125-128 downloading 104 enriching 123-125 exploring 104-110 extracting, from PostgreSQL 81, 82, 97 inserting 77-79 inserting, into Elasticsearch 84, 85 inserting, into PostgreSQL 75 inserting, into staging 248 inserting, with helpers 85-87 processing, with PySpark 296-298 staging 156 staging, in databases 159-161 staging, in files 156-159 transforming, for Elasticsearch 135, 136 validating 156 validating, with Great Expectations 161 databases creating 240-242 data, staging in 159, 160 handling, with NiFi processors 96 data engineering about 4, 6 tools 7 versus data science 7 data engineering pipelines about 11 building, with Apache Airflow 11, 12 building, with Apache NiFi 14, 15 data engineering tools about 7 databases 8, 9 data pipelines 11 data processing engines 10 programming languages 8 data engineers skill and knowledge requisites 6 tasks, performing 4-6 DataFrames used, for extracting data 83 data issues handling, with pandas 114


Index 333 data lake populating 243, 244 reading 245 scanning 247, 248 data pipelines building 130 building, in Apache Airflow 55, 91, 92 building, with Kafka 275 building, with NiFi 275 creating, with Kafka consumer 278-281 creating, with Kafka producer 276, 277 deploying 232 deploying, in production 255 finalizing, for production 222 middle strategy, using 234-237 multiple registries, using 237 running 100, 101 simplest strategy, using 232, 233 versioning 189-191 data science versus data engineering 7 data type mapping 130, 131 Directed Acyclic Graph (DAG) about 11, 30, 55, 92 running 94, 95 distributed data pipeline building 322, 323 managing 323-326 Domain-Specific Language (DSL) 10 E Elasticsearch about 9 configuring 34, 35 data, inserting into 84, 85 data, transforming for 135, 136 installing 34, 35, 84 querying 87-89 EvaluateJsonPath processor 246 Event Time 283 ExecuteSQLCommand processor configuring 97-99 ExecuteSQLRecord processor 249, 254 ExecuteStreamCommand 253 exploratory data analysis (EDA) about 104 performing, in Python 104 Extract, Transform, and Load (ETL) about 4 versus Extract, Load, and Transform (ELT) 160 F files data, staging in 156-159 handling, with NiFi processors 61 reading, in Python 46 writing, in Python 46 fixed window 282 G GetFile processor 246 git-persistence using, with NiFi Registry 194-199 Google BigQuery 9 Great Expectations data validation failure 175-177 installing 162 NiFi, combining with 172-175 used, for validating data 161, 162 using 163-170 using, outside pipeline 170, 171


334 Index Groovy 8 GUI used, for monitoring NiFi 201 H Hadoop Distributed File System (HDFS) 243 helpers used, for inserting data 85-87 I idempotent data pipelines building 178, 179 Ingest Time 283 Insert Warehouse 254 International Organization for Standardization (ISO) 5 J Java 8 Java Database Connectivity (JDBC) 27 JavaScript Object Notation (JSON) about 51 reading, with pandas DataFrame 53, 54 working, with NiFi processors 68-72 writing, with pandas DataFrame 53, 54 writing, with Python 51-53 Java Virtual Machine (JVM) 8 Jolt transformations 71 JSON data pipeline CSV, building to 55-61 JSON Language for Transform 71 Jython 8 K Kafka configuring 262-264 consumers 273-275 downloading 261, 262 logs, maintaining 272 producers 273-275 topics 272 URL 260 used, for building data pipelines 275 Kafka cluster creating 260, 261 starting 265 testing 265, 266 Kafka consumer creating, with Python 284 used, for creating data pipeline 278-281 writing, in Python 286, 287 Kafka producer creating, with Python 284 used, for creating data pipeline 276, 277 writing, in Python 284-286 Kibana configuring 36-40 installing 36-40 Kibana dashboard building 139, 140 L logs about 270, 271 maintaining, in Kafka 272


Index 335 M messages cluster, testing with 266, 267 messages, sending with producer asynchronous 273 Fire and Forget 273 synchronous 273 Microsoft SQL Server 8 MiNiFi setting up 306-308 MiNiFi task building, in NiFi 308-313 mogrify method 77 multiple records inserting 79, 80 MySQL 8 N NiFi combining, with Great Expectations 172-175 counters, using 206-210 MiNiFi task, building in 308-313 monitoring, with GUI 201 monitoring, with PutSlack processor 210-213 monitoring, with status bar 202-206 Registry, adding to 188, 189 Registry, using 187 used, for building data pipelines 275 NiFi cluster building 315-322 NiFi clustering basics 315, 316 NiFi data flow about 21 FlowFiles, content 26 FlowFiles, in queue 24, 26 GenerateFlowFile processor, configuring 23 output 26 processors, adding to canvas 22 NiFi processors CSV, working with 62 databases, handling 96 data, extracting from flowfile 66, 67 file, reading with GetFile 63 flowfile attributes, modifying 67 flowfile, saving to disk 67 JSON, working with 68-72 records, filtering with QueryRecord processor 66 records, splitting into distinct flowfiles 64, 65 relationships, creating between processors 68 used, for handling files 61 NiFi Registry configuring 184-187 git-persistence, using with 194-199 installing 184-186 NiFi REST API Python, using 214-219 reference link 214 NiFi variable registry using 230-232 NoSQL database data extracting, from Python 83 inserting, in Python 83


336 Index P page obtaining 136, 138 pandas data issues, handling 114 pandas DataFrames used, for reading CSV 49-51 used, for reading JSON 53, 54 used, for writing CSV 49-51 used, for writing JSON 53, 54 Papermill 171 pgAdmin 4 installing 41, 42 server, adding 42, 43 table, creating 44 using 240-242 pipeline triggering 131, 132 PostgreSQL 8 about 74 configuring 41 database, creating 74, 75 data, extracting from 81, 82, 97 data, inserting into 75 installing 41 tables, creating 74, 75 PostgreSQL driver 27 Primary Node 316 Processing Time 283 processor groups improving 225-229 production data pipeline, deploying to 255 data pipelines, finalizing for 222 production data pipeline building 244 production environment creating 240 psycopg2 installing 76 PutElasticsearchHttp processor configuring 100 PutSlack processor used, for monitoring NiFi 210-213 PutSQL processor 249, 254 PySpark about 294 configuring 295, 296 data, processing 296-298 installing 295, 296 Python exploratory data analysis, performing 104 files, reading 46 files, writing 46 Kafka consumer, writing 286, 287 Kafka producer, writing 284-286 NoSQL database data, extracting 83 NoSQL database data, inserting 83 relational data, extracting 74 relational data, inserting 74 used, for connecting to PostgreSQL 76, 77 used, for creating Kafka consumers 284 used, for creating Kafka producers 284 used, for writing JSON 51-53 using, with NiFi REST API 214-219 Python CSV Library used, for writing CSV 46-48 R records splitting, into distinct flowfiles 64, 65


Index 337 Registry adding, to NiFi 188, 189 using, in NiFi 187 regular expressions (regex) 63 relational data extracting, from Python 74 inserting, in Python 74 relational databases 8 Resilient Distributed Datasets (RDDs) 10 RouteOnAttribute processor 250 rows dropping 115-118 S Samza 11 Scala 8 scroll using 90 SeeClickFix querying 132-135 session window 283 sliding window 283 Spark about 290 for data engineering 298-302 installing 291-294 running 290 SplitText processor configuring 99 SQLAlchemy 75 staging data, inserting into 248 staging data validating 250-253 staging database querying 249 status bar used, for monitoring NiFi 202-206 stream processing versus batch processing 282, 283 Structured Query Language (SQL) 8, 74 T test environment creating 240 topics 272 tumbling window. SeeĀ  also fixed window U unbounded data 282 UpdateCounter processor 247 V versioned pipeline editing 191-194 visualizations creating 140-145 W windowing method 282 Z Zero-Master Clustering 315, 325 Zookeeper configuring 262-264 URL 261 Zookeeper cluster creating 260, 261 starting 265


Click to View FlipBook Version