What is a YAML file and where is it used in a machine learning context?

I am not entirely sure if this is on-topic here, so please let me know if it is not. I keep seeing the idea of YAML files pop up while reading machine learning literature. My question is, what exactly is a YAML file, and how does it relate to machine learning and data science projects?

Topic serialisation deep-learning neural-network machine-learning

Category Data Science


In addition to data a yaml file is often used as a human readable configuration file ... here is an example of a docker compose yaml file

# version: "3.5"
version: '3'
# version: '2'
services:


  nodejs-admin:
    image: ${GKE_APP_IMAGE_ADMIN}
     hostname: admin
     container_name: loud_admin
     restart: always
     depends_on:
       - loudmongo
       # - loudmail
       - loudmail
     volumes:
       - /cryptdata4/var/log/loudlog-admin:/loudlog-admin
       - /cryptdata5/var/log/blobs:/blobs
       - /cryptdata5/var:/cryptdata5/var
       - /cryptdata5/var/tools:/tools
       - /cryptdata6/var/log/loudlog-enduser:/loudlog-enduser
       # - ${TMPDIR_GRAND_PARENT}/curr/loud-build/${PROJECT_ID}/webapp/admin/bundle:/tmp
       - ${TMPDIR_GRAND_PARENT}/${bundleNormal}/loud-build/${PROJECT_ID}/webapp/admin/bundle:/tmp
      # - $SOURCE_REPO_DIR/tests:/tmp/tests
     environment:
       - MONGO_SERVICE_HOST=loudmongo
       - MONGO_SERVICE_PORT=$GKE_MONGO_PORT
      - MONGO_URL=mongodb://loudmongo:$GKE_MONGO_PORT/admin
       - METEOR_SETTINGS=${METEOR_SETTINGS}
      # when sending to port 587 will NOT get routed through email content filter however port 25 will get routed
      - MAIL_URL=smtp://support@${GKE_DOMAIN_NAME}:ignore_this@loudmail:587/
       # - MAIL_URL=smtp://support@${GKE_DOMAIN_NAME}:ignore_this@loudmail:25/
      - GKE_NOTIF_TASK_OVERDUE=$GKE_NOTIF_TASK_OVERDUE
       - GKE_NOTIF_PLANS_RECUR=$GKE_NOTIF_PLANS_RECUR
      - GKE_NOTIF_SHIFTS_RUN=$GKE_NOTIF_SHIFTS_RUN
       - GKE_NOTIF_LOAD_RUN=$GKE_NOTIF_LOAD_RUN
      - GKE_NOTIF_ORGS_HEALTH_RUN=$GKE_NOTIF_ORGS_HEALTH_RUN
     links:
       - loudmongo
       # - loudmail
       - loudmail
     ports:
      - 127.0.0.1:${PORT_ADMIN}:3001
#   networks:
#    - loudthink-network
    working_dir: /tmp
    command: /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf

  nodejs-enduser:
    # image: ${GKE_APP_REPO_PREFIX}/${PROJECT_ID}/loudweb-enduser
    image: ${GKE_APP_IMAGE_ENDUSER}
     hostname: enduser
     container_name: loud_enduser
     restart: always
     depends_on:
       - nodejs-admin
       - loudmongo
       # - loudmail
       - loudmail
     volumes:
       - /cryptdata6/var/log/loudlog-enduser:/loudlog-enduser
       - /cryptdata5/var/log/blobs:/blobs
       # - ${TMPDIR_GRAND_PARENT}/curr/loud-build/${PROJECT_ID}/webapp/enduser/bundle:/tmp
       # - ${TMPDIR_GRAND_PARENT}/curr/loud-build/${PROJECT_ID}/webapp/admin/bundle/programs/server/assets/app/config/apn-cert.pem:/private/config/apn-cert.pem
       # - ${TMPDIR_GRAND_PARENT}/curr/loud-build/${PROJECT_ID}/webapp/admin/bundle/programs/server/assets/app/config/apn-key.pem:/private/config/apn-key.pem
       - ${TMPDIR_GRAND_PARENT}/${bundleNormal}/loud-build/${PROJECT_ID}/webapp/enduser/bundle:/tmp
      - ${TMPDIR_GRAND_PARENT}/${bundleNormal}/loud-build/${PROJECT_ID}/webapp/admin/bundle/programs/server/assets/app/config/apn-cert.pem:/private/config/apn-cert.pem
       - ${TMPDIR_GRAND_PARENT}/${bundleNormal}/loud-build/${PROJECT_ID}/webapp/admin/bundle/programs/server/assets/app/config/apn-key.pem:/private/config/apn-key.pem
    environment:
      - MONGO_SERVICE_HOST=loudmongo
      - MONGO_SERVICE_PORT=$GKE_MONGO_PORT
       - MONGO_URL=mongodb://loudmongo:$GKE_MONGO_PORT/admin
      - METEOR_SETTINGS=${METEOR_SETTINGS}
       # when sending to port 587 will NOT get routed through email content filter however port 25 will get routed
       - MAIL_URL=smtp://support@${GKE_DOMAIN_NAME}:ignore@loudmail:587/
      # - MAIL_URL=smtp://support@${GKE_DOMAIN_NAME}:ignore@loudmail:25/
     links:
       - loudmongo
       # - loudmail
       - loudmail
     ports:
      - 127.0.0.1:${PORT_ENDUSER}:3000
#   networks:
#    - loudthink-network
    working_dir: /tmp
    command: /usr/bin/supervisord -c /etc/supervisor/conf.d/supervisord.conf

  loudmongo:
    # image: mongo
    # image: mongo:3.6
    image: $GKE_MONGO_IMAGE
     hostname: mongo
     container_name: loud_mongo
     restart: always
     ports:
      - 127.0.0.1:$GKE_MONGO_PORT:$GKE_MONGO_PORT
#   logpath:
#    - /data/logs/mongo.log
    volumes:
     - /cryptdata7/var/data/db:/data/db
     # - /cryptdata7/var/data/logs:/data/logs




  loud-devops:
    # image: dind
    image: localhost:5000/hygge/loudweb-dind/1904082202
    hostname: devops
    container_name: loud_devops
    restart: always
    ports:
      - 127.0.0.1:9000:9000
#     expose:
#      - 9000
#   networks:
#    - loudthink-network
#     environment:
#       - SSH_AUTH_SOCK=$SSH_AUTH_SOCK
     volumes:
       - /var/run/docker.sock:/var/run/docker.sock   
       - /usr/bin/docker:/usr/bin/docker   
       # - $SSH_AUTH_SOCK:/ssh-agent 
      - /usr/lib/x86_64-linux-gnu/libltdl.so.7:/usr/lib/x86_64-linux-gnu/libltdl.so.7  
      - /usr/lib/x86_64-linux-gnu/libgpm.so.2:/usr/lib/x86_64-linux-gnu/libgpm.so.2
      - /home/khepri/src/github.com/loudthink:/inner_home/khepri/src/github.com/loudthink 
      - /home/khepri/.docker:/inner_home/khepri/.docker
      - /cryptdata5/var/tools/inner_home/khepri:/inner_home/khepri
      - $GKE_DIND_SUPERVISOR_LOG_DIR:/var/log/supervisor
       - /cryptdata5/var/tools/usr/local/go:/usr/local/go
       - /cryptdata6/var/log/tmp/khepri01:/cryptdata6/var/log/tmp/khepri01
       - /cryptdata6/var/log/tmp/shared:/cryptdata6/var/log/tmp/shared
       - /cryptdata5/var/tools/usr/local/bin:/usr/local/bin 
       - /cryptdata:/cryptdata
       - /cryptdata2:/cryptdata2
       - /cryptdata4:/cryptdata4
       - /cryptdata5:/cryptdata5
       - /cryptdata6:/cryptdata6
       - /cryptdata7:/cryptdata7
       # - /etc/letsencrypt/live:/etc/letsencrypt/live
       # - /etc/letsencrypt/archive/medssenger-dev.medstack.net:/etc/letsencrypt/archive/medssenger-dev.medstack.net
       - /usr/local/ssl:/usr/local/ssl
       #  following line HOME is only used during install as referenced in loudspeed/build/bin/setup_devops_sudo_cmds.sh to copy over .ssh files
       # - ${HOME}:${HOME}
      # - /var/lib/docker:/var/lib/docker
    command: /usr/bin/supervisord -c /etc/supervisor/supervisord.conf



  loudmail:
    image: tvial/docker-mailserver:latest
    hostname: mail
    domainname: ${GKE_DOMAIN_NAME}
     container_name: loud_mail
     restart: always
     environment:
       - PERMIT_DOCKER=network
       - SSL_TYPE=letsencrypt
       - ONE_DIR=1
       - DMS_DEBUG=1
       - SPOOF_PROTECTION=0
       - REPORT_RECIPIENT=1
       - ENABLE_SPAMASSASSIN=0
       - ENABLE_CLAMAV=0
       - ENABLE_FAIL2BAN=1
       - ENABLE_POSTGREY=0
     cap_add:
       - NET_ADMIN
       - SYS_PTRACE
     ports:
     - "25:25"
     - "587:587"
     - "465:465"
     volumes:
       - ${GKE_MAIL_EHOOK_DIR}/data/:/var/mail/
      # - ${GKE_MAIL_EHOOK_DIR}/state/:/var/mail-state/
       - ${GKE_MAIL_EHOOK_DIR}/state/:${GKE_MAIL_EHOOK_INNER_STATE_DIR}/
       - ${GKE_MAIL_EHOOK_DIR}/config/:/tmp/docker-mailserver/
      - ${GKE_MAIL_EHOOK_OUTPUT_DIR}/:${GKE_MAIL_EHOOK_INNER_OUTPUT_DIR}/
      - ${GKE_MAIL_EHOOK_HOST_DIR}/:/local_ehook/
       # - /etc/letsencrypt/:/etc/letsencrypt/
       - ${GKE_LETSENCRYPT_DIR}/:/etc/letsencrypt/
      - ${GKE_MAIL_EHOOK_DIR}/log/:/var/log/mail/

As with any configuration text file the notion of template is useful ... for example if your yaml file called GKE_COMPOSE_YAML contains say

foobar: some_token

you can update a hardcoded some_token by doing a string search and replace during your pre-process stage

sed -i -- "s|some_token|${my_current_token}|g"   $GKE_COMPOSE_YAML

depending on what your process is using to parse your yaml file it may permit embedding environment variables into your yaml file as per following snippet from a yaml file

    foobar: $my_current_token

when your env contains var my_current_token which may have been defined using

export my_current_token=something.cool.here

There are numerous ways to serialise data. For instance, JSON, XML, and CSV are possible approaches for serialisation. Another way is YAML. It is human-readable which means it contains readable text which is written using usual words. This stands against binary serialisation. Whenever you open the file, you will not be able to see what is written, because it is serialised in binary format and not as a human-readable format. As you may find it here, it is used for configuration files mostly. In ML context depending on your task, you may face different file format. For instance, one of the widely used formats for supervised ML tasks is CSV though you may find others use their favorite encodings. You may also want to make your own serialisation too which is possible among in-company operations is rare.

About

Geeks Mental is a community that publishes articles and tutorials about Web, Android, Data Science, new techniques and Linux security.