cloud

Table of Contents

1. cloud

1.1. Google Cloud

1.1.1. Notas sueltas 2021-03-17

Pub/Sub sirve de pegamento entre sistemas (~Kafka)
Apache Beam, DataFlow (Transformer) Spark, Flink
BigQuery, BigTable, CloudSQL, AI Platform
(MongoDB, influxDB)
Streaming/Batch Pipeline
Dataflow SQL → Escribir en SQL joins de Pub/Sub con un SQL de otro sitio
https://cloud.google.com/dataflow/docs/guides/sql/dataflow-sql-intro
DataProc → Hadoop + Spark (también puede correr en kubernetes, en beta)
DataFusion, integraciones entre todo, editor visual
(Por debajo Cluster DataProc con Spark) (CDAP open source)
DataCatalog centraliza metadatos
DataLineage → De dónde viene cada dato?
Apache Airflow (crear grafos para conectar de todo)

1.1.1.1. Cache BigQuery

La correspondencia no es 1 a 1

Very Hot (~1GB)
BI Engine (menos de 1s, comprimido)
Hot (~10TB)
Materialized views
Warm (~100TB)
Partitioning and Clustering
Cold (~1PB)
Stateless Compute Workers

BQ Fair scheduler

1.1.1.2. Joins

Broadcast (ligero) y Shuffle (pesado)

1.1.1.3. Query Analysis

Cuanto tarda cada parte ⇒ BQ Visualizer

1.1.3. Google Cloud is not stable

https://link.medium.com/6KbbLYIiqib

Google is telling anybody who will listen that the APIs and functionalities built into its Google Cloud platform will remain stable over time and will not fall victim to arbitrary decisions by the company, a fiction designed to avoid discussion of the company’s longstanding disregard for its users, which has led it, over the years, to ruthlessly eliminate countless services that had large user bases.

A culture of contempt for the user that goes far beyond simply removing or adding products, and that affects features, pricing policies and many other things, including SEO: Google is simply a company that I would never recommend anyone to depend on, or if you do, make sure you have full back up; which is the last thing you need in a cloud computing provider.

1.1.5. Instalar dependencias binarias en google Colab

1.1.6. Cloud Functions están muy limitadas

Hay que asegurarse de que va a cumplir las cuotas:
https://cloud.google.com/functions/quotas
Muchas veces para superarlo se hace un patrón de:

  1. Llamo a la función sin argumentos, haciendo de orquestador
  2. Esa función se llama a sí misma N veces con argumentos

1.1.8. Formato filtro custom de gcloud

Son dos flags comunes a muchos comandos de gcloud
https://cloud.google.com/sdk/gcloud/reference/topic/filters
https://cloud.google.com/sdk/gcloud/reference/sql/operations/list → si pones --flattened te saca todos los valores que puedes necesitar en --filter, --format=value(name1, name2)
https://cloud.google.com/sdk/gcloud/reference/topic/resource-keys
https://cloud.google.com/sdk/gcloud/reference/topic/formats

1.1.9. Sacar direcciones IP de sql y de compute (VMs)

gcloud compute instances list --project=glcoud-project-name --filter=name:vm-instance-name --format="csv[no-heading](INTERNAL_IP)"
gcloud compute instances list --project=glcoud-project-name --filter=name:vm-instance-name --format="csv[no-heading](EXTERNAL_IP)"

gcloud sql instances list --project=gcloud-project-name --filter=name:sql-instance-name --format="csv[no-heading](PRIVATE_ADDRESS)"
gcloud sql instances list --project=gcloud-project-name --filter=name:sql-instance-name --format="csv[no-heading](PRIMARY_ADDRESS)"

1.1.10. Sacar logs de gcloud logging bien

res=$(gcloud logging read --project gcloud-project-name --freshness=10d --format=json 'resource.type="k8s_container"
resource.labels.project_id="gcloud-project-name"
resource.labels.location="europe-west1-b"
resource.labels.cluster_name="standard-cluster-1"
resource.labels.namespace_name="k8s-namespace"
labels.k8s-pod/job-name:"cronjob-daily-" severity>=DEFAULT'); printf '%s' "$res" | jq '.[] | .timestamp + " " + .textPayload' -r | tac > log-k8s-namespace-daily1.txt

# Con streaming:
gcloud logging read --project gcloud-project-name --freshness=10d --format=json 'resource.type="k8s_container"
resource.labels.project_id="gcloud-project-name"
resource.labels.location="europe-west1"
resource.labels.cluster_name="cluster-1"
resource.labels.namespace_name="k8s-namespace"
labels.k8s-pod/job-name:"cronjob-daily-" severity>=DEFAULT' | jq --stream 'flatten | select(.[] == "textPayload" or .[] == "timestamp") | {(.[1]) : .[2]} | to_entries | map(select(.value != null)) | select(. != []) | from_entries' | jq --stream '[., input, input, input, input, input, input, input] | flatten | map(select(. != "textPayload" and . != "timestamp")) | .[1] + " " + .[0]' -r | less
# No sé si se corresponde el tiempo realmente en la versión con streaming pero bueno. Creo que se lía con los logs multilínea

1.1.12. Use FireBase⚾, never FireStore🧯

1.2. Cloudera

  • It is optimized for hybrid and multi-cloud environments, delivering the same data management capabilities across bare metal, private, and public clouds
  • It allows multiple analytic functions to work together on the same data at its source, eliminating costly and inefficient data silos
  • It maintains strict enterprise data security, governance, and control across all environments
  • It is 100 percent open source, with open compute and open storage, ensuring zero vendor lock-in and maximum interoperability

1.3. List of Apache Software Foundation projects - Wikipedia

Los que conozco/me suenan:

  • Airflow: Python-based platform to programmatically author, schedule and monitor workflows
  • Avro: a row-oriented remote procedure call and data serialization framework developed within Apache’s Hadoop project.
  • Beam">an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing.
    Beam Pipelines are defined using one of the provided SDKs and executed in one of the Beam’s supported runners (distributed processing back-ends) including Apache Flink, Apache Samza, Apache Spark, and Google Cloud Dataflow
  • Cassandra">is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients.
  • CouchDB">is an open-source document-oriented NoSQL database, implemented in Erlang. It uses JSON to store data, JavaScript as its query language using MapReduce, and HTTP for an API.
  • Drill">is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets
  • Flink">is an open-source, unified stream-processing and batch-processing framework. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink provides a high-throughput, low-latency streaming engine as well as support for event-time processing and state management.
  • Hadoop">The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model.
  • Helix">is one of the several notable open source projects developed by LinkedIn. It is a stable cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of systems
  • Hive">is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • Ignite">Apache Ignite is a distributed key-value in-memory (disk optional) database management system for high-performance computing.
    The database component distributes key-value pairs across the cluster in such a way that every node owns a portion of the overall data set. Data is rebalanced automatically whenever a node is added to or removed from the cluster.
  • Impala">is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012
  • Kafka">is a distributed event store and stream-processing platform, written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a “message set” abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This “leads to larger network packets, larger sequential disk operations, contiguous memory blocks […] which allows Kafka to turn a bursty stream of random message writes into linear writes.
  • Parquet">column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop (such as pandas). It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
  • Spark">is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance.

1.4. La tecnológica Cloudflare lanza su propia nube, un servicio ‘low cost’ que aspira a convertirse en la cuarta plataforma cloud del mundo

1.7. Sitios para desplegar webs serverless

1.7.1. Glitch

https://glitch.com/
Build fast, full-stack web apps in your browser for free

1.7.2. Streamlit — The fastest way to build custom ML tools

1.7.3. Cloud Application Platform | Heroku

1.7.5. Alternatives

Some alternatives:

  • Darklang is still free, if you’re into learning a new functional programming language and way of testing and deploying stuff.
  • There’s also Fly.io which has a “trial” tier that seems decent.
  • Railway has a pretty good looking free plan (more memory than some of the other options at least).
  • seems to be entirely free – I just had a browse around the main page and couldn’t figure out what the catch is, other than it’s limited to Python and Node.
  • https://www.youtube.com/watch?v=prjMJtXCR-g

1.8. Utilizar volumenes FUSE para Azure/Google Cloud/AWS

FUSE necesita ejecutarse como usuario privilegiado en docker, lo que no siempre es posible
En Kubernetes por ejemplo suelen ser contenedores no privilegiados. Tiene que existir /dev/fuse

Author: Julian Lopez Carballal

Created: 2024-09-16 Mon 04:59