If you worked on a Big Data project, you should have seen, and maybe used, some shell scripts. Honestly, I love hearing “The future is now” while talking about a bunch of scripts scheduled by Oozie, but it seems like we couldn’t create a data project in 2018 without some lets-run-it.sh file. For the last 7 years I have seen many people writing x-SH scripts for various reasons, but the main reason today (at least on Big Data project) is Oozie and its shell action. Let’s be clear: I don’t have any problem with Bash/KSH or whatever SH you want, I just think that most of the times they aren’t the best tool for what you are trying to do. Shell scripts always start small but quickly grow as the project progress, and end up in that huge unmaintainable mess. I really like this citation from Herbert Mayer:
No programming language is perfect. There is not even a single best language; there are only languages well suited or perhaps poorly suited for particular purposes.
Good code is good, and bad code is bad – whether it is written as a Bash script, a Windows CMD file, in Python, Ruby, Perl, Basic… Let’s see why you should stop writing shell scripts (at least when you can).
Altough you can find some testing framework for Bash, like the excellent Bats, there are no simple way to verify your scripts and mock the behavior of your commands like hdfs, hbase or oozie. Sure, you can try, good luck with that.
Most of shell scripts are poorly written, and things can quickly get rather wild. It is not due to the language, but rather due to the mainly usage of a scripting language like Bash or KSH: create a glue code that get the stuff done. And as quickly as possible. So, sadly, you end up maintaining scripts without functions, not even indented, without any comment (or few) and using huge multiple-overkill-commands-chained-in-many-pipes-doing-all-the-stuff.
If you are not a solid shell user, have a look on shell guidelines, like these one from Dan Fruehauf or Brian Clements, and you’ll get 90% of the job done in terms of maintainability and readability. None of this will prevent you from making stupid mistakes, but at least you will able to find them more quickly.
Logging in sh is not very difficult, you can find many gists providing examples of basic logging commands for your shell scripts. The real problem is to trust the commands inside your script: many hadoop CLI print logging entries on the stdout, where you were expecting the results of your hdfs/hbase/oozie command. We all know how this end: grep stdout in order to filter the noise.
Bash/KSH… does not exist on every Unix system. If you want to maintain any sort of portability, Bash is not the best choice. There are advantages to using Bash over just SH indeed. Plain sh is a very minimalistic programming language. Bash have features that makes programming more convenient and similar to programming in other modern programming languages, like scoped local variables and arrays. So, are you really ready to write more lines of code because of the limited syntax and features of sh? An other caveat you should consider: different operating systems may have different versions of the same shell commands.
5. Performances and security
Straight from the Linux Documentation Project, when not to use shell scripts:
- Resource-intensive tasks, especially where speed is a factor
- Mission-critical applications upon which you are betting the future of the company
- Situations where security is important, where you need to guarantee the integrity of your system and protect against intrusion, cracking, and vandalism
- Extensive file operations required (Bash is limited to serial file access, and that only in a particularly clumsy and inefficient line-by-line fashion.)
A good knowledge of shell scripting is essential to anyone wishing to become proficient as sysadmin or devops engineer, or just for your own culture. But if you can avoid it in your next Big Data project, especially if you are using Oozie, just do it:
- most of shell actions used in Oozie can be translated in a Java or Hive action. It’s best to use a built-in action type instead of the Shell Action. For instance, if you want to run a Hive script, it’s best to use the Hive action, instead of the Shell Action running Hive.
- if you need to execute some Python code, consider to wrap it inside a dedicated action. Oozie doesn’t provide actually a Python action, but you can easily customize it by adding your own implementation of
- if you have at least two hdfs commands in your script, a simple Java code will be cleaner and shorter if you plan to write a script which is readable, managing correctly the various error codes of the commands and logging things in a correct manner
And if you really, really, can’t avoid a shell script:
:-if you want to test variables that could be undeclared.
set -o pipefailin scripts to return the exit status of the last command that threw a non-zero exit code is returned.
set -o nounsetand force exit if trying to use an uninitialized var
set -o errexitand force exit if any program fails also
- Use long options in order to improve readability
Just remember to don’t push the limits of shell scripts. Scripts are supposed to be glue-code. If you catch yourself doing something tricky, or that you need obscure features, chances are you’re better off moving to more powerful languages like Java or Python.