Friday, July 28, 2017

Learning Fortran is a waste of your time

Suppose you are an advanced student of physics, and you're just beginning your first research experience with a new professor.  You're excited about the opportunities this means for your future, fascinated by the implications of the research, and anxious to please.  Your professor tells you you need to run a computer simulation, and this is also exciting; you've never used a computer as a computer before -- usually just as an internet browser.  To get you started, he points you to a reference text with some sample code, or sends you one he has on his hard drive somewhere.

This code will be written in Fortran.  They all are.  All legacy codes are Fortran.

So what is Fortran?  Fortran is a high-level programming language designed back in the 50's.  Here "high-level" means that it is not assembly language, and the programmer does not directly interact with machine elements like bits, bytes, or memory addresses, though Fortran is much "lower-level" than most modern languages like Java or Python, meaning it is only just a step or two above machine code.  The name "Fortran" is short for "Formula Translation," as the language was intended to more directly translate a mathematical formula into computer code; Fortran allowed, for instance, an entire mathematical expression to be written out as a single line of code, as opposed to assembly, which would require multiple lines of register swapping and simple operations to acheive the same result.

You'll get the sample code and look at it, and it will be completely incomprehensible.  They all are.  All Fortran codes are incomprehensible.

Why is this?  There are a lot of things that contribute to the issue.



Actual Fortran legacy code I received once.
All four of the comments are in French.
I dare you to figure out what it means.
One is the obvious issue that this code on your screen was written by a graduate student at 4 AM who was just trying to get it to work.  He didn't care about whether it was organized -- and absolutely didn't care about adding comments -- but just whether it produced the numbers he wanted.  Once it got numbers, he never touched it again.

The other issue is that Fortran is a terrible programming language whose actual implementation requires the use of spaghetti code, and it would be a waste of your time to learn how to program in it, beyond being able to take a peice of Fortran and rewrite it in a real programming language (and even then, software can mostly do that for you).

So here is the quandary.  The sample code is impossible to read.  The purpose of the sample good is to demonstrate how some numerical calculation might be performed.  And it probably would demonstrate that if it weren't in Fortran.  But it is in Fortran, and thus fails at its single job.

To get out of this quandary, I would like to offer two peices of advice:

  1. To physics students: don't waste your time learning Fortran.
  2. To physics professors: stop coding in this outdated language, or stop requiring it of your students.

But, you're thinking, what's so bad about Fortran?

The central problem with Fortran is that it either requires or encourages you to program using syntax and style that is universally accepted as bad.

The most obvious fault is the over-reliance that Fortran has on GOTO statements.  There has been a fierce debate in computer science circles since at least the 80's about GOTO statements: one side says that they should never even be used at all, the other that they are generally harmful and dangerous but they might sometimes be acceptable in two or three special circumstances.  Neither side of this debate proposes using GOTO as the primary (or in some cases only) way to control the flow of the program.  Many modern languages -- such as Java or Python -- actually excise GOTO statements from the language entirely to stop their use.  Fortran, however, was very late in adopting control structures, and so most Fortran code relies on this hallmark of unreadable code.

A GOTO is a statement in a programming language that causes the control of the program to jump to some other part of the language.  In assembly programming, GOTO and IF are really all you have, and you can implement any calculation with just these two and cleverness.  But the need for cleverness is the problem.

Very complicated calculations can be done entirely with IF and GOTO, but these very complicated calculations require the flow of the program to jump back and forth across the page.  Figuring out when a statement gets executed and what the state of the computer is at that moment becomes untractable -- it becomes what is commonly called "spaghetti code" -- code with the same clean logical structure as a plate of spaghetti [*].

The majority of Fortran legacy code relies majorly or entirely on GOTO statements.  While later Fortran did get the very modest DO loop, DO really isn't that much better, being just a slightly polished GOTO -- it still relies on the same unclear syntax as GOTO, making use of labels and jumps, and often still requiring a GOTO (or a dozen GOTOs) to end the loop.  Due to the late adoption of evn this modest control structure, the Fortran programming community adopted this mandatory specification of spaghetti code, which proliferated widely in the majority of programs written in Fortran -- even those written after the introduction of DO.

But that isn't the only problem with Fortran.

Fortran also makes use of implicit variable declaration.  Languages like assembly, C, or Java, require you to state expicitly the name of each variable and its type (integer, floating point, etc).  Fortran doesn't require this.  Need a new variable?  Just write what you want it to be called, and start using it!  While this seems like a time-saving blessing at first, it becomes a logical nightmare, especially when coupled with the scoping problem (more on that later).

One problem with this is: what should the type of the variable be?  Fortran is required to guess.  Some higher-level languages with implicit declaration also use implicit typing -- the type of the variable is how you use it.  Fortran does not.  Instead, for Fortran, the type of a variable is determined entirely by what you call it.  The convention is (usually) that names beginning with letters A-H or O-Z are implicitly floating point, while names beginning with letters I-N are implicitly integers.  Why those?  I guess because math formulas tend to use i,j,k as indices and n, m to hold integers.  If you see an i in a math formaula, you assume it's an integer, so the computer should do the same.  But what if you don't follow this convention?  What if you want to talk about the Mass of an object, or the current (I, J) in a wire?  Obviously you have to either change the name, or explicitly declare type in contradiction to the convention.

Contradicting the convention for only a handful of variables can be confusing -- if you see Mass in the code but not its declaration much earlier, you would be justified in assuming this is also an integer.

But following the convention by changing the name doesn't solve the main problem -- unless you follow an entire program and memorize every single variable name, you can never know whether any given instance of a variable (say A1 or A2 -- two very popuar Fortran variable names as it happens) is the first time it is used, or whether it has a previous value.

The much worse problem is: what if you give your integer the name AMASS, expecting the usual convention to hold and this to be a floating point, but for whatever reason the compiler does not recognize the convention (say a hidden flag is set, or you're using a different computer with a different compiler that has a slightly different convention).  Later division or multiplication could be integer operations, which will lose precision.  This will introduce bugs which will be very difficult to track, as you will always assume that AMASS is a floating point.

The way around this problem, which is commonly recommended in computational physics classes, is to explicitly declare the type of every variable.  This way, you will never have to worry about compiler quirks, you'll always know the first instance of a variable, and you'll be able to easily identify its type by easily finding its first instance.

However, the fact that Fortran allows implicit declaration means that lazy physicists typically use implicit programming, and so most Fortran code uses entirely implicit type declaration, unless a name that breaks convention is really desired.

Further, Fortran encourages programmers to ignore the very important issue of scope.  Scope defines what variables are accessible to a given section of code.  While it is possible to limit scope, it is also possible to unlimit the scope, allowing any variable anywhere in the program to effectively be within global scope.  This floods the namespace, ruins encapsulation, and creates more spaghetti code and possibly intractable bugs since it is not obvious which parts of the program might have edited your variable.

Encapsulation is the principle in programming that segments of code should be separated and self-contained, and hidden from other parts of the program.  Each segment does what it is supposed to do and knows what it needs to do its job, but the other segments of code don't need to know anything about what happens "under the hood", and only about how to interact with the segment.  If you have two functions, say calculateA() and calculateB(), then calculateB() doesn't need to be able to see all the local variables and logic inside of calculateA().  All calculateB() needs to know is what parameters to pass to calculateA() for it to work, (for instance, two doubles) and what the output will be (for instance, an integer).

The simplest use of encapsulation is somewhat "automatic".  Variables defined in a function(/subroutine/method) are local to that function.  They only have space in memory while that function is running, their value is only accessible by that function, and they cease to exist or hold a value when the function stops calculating.  Further, functions can usually only access local variables, or global variables.  Your function calculateA() cannot see the variable A inside of your main program, and your main program can't see the local variable A in calculateA() either  This is how programming languages normally work.

However, Fortran includes a way to subvert this, by use of COMMON blocks.  COMMON blocks allow for potentially every single variabe to be exposed to global scope.  While this simplifies the practice of writing the code since you don't have to figure out how to get your variables into the subroutine, it vastly complicates the process of reading the code, since any given variable used in a subroutine might be defined in a completely different (and unstated) part of the program.

Further, consider how this compounds with the problem of implicit declaration.  When a reader sees a new variable suddenly appear in the code this could be:

  1. a variable from the COMMON block being used outside of scope in the subroutine
  2. a new variable being implicitly defined.

The only way to really tell for sure is to read the entire program (with all included files) and make note of every variable name.

The main issue, however, is the bugs this can cause.  Functions are able to "silently" change variables that are not in global scope.  In the main program, you may have a variable C whose value you need to know.  If calculateB() includes code that accesses C out of scope and changes its value (perhaps errantly) it will not be obvious where the change is coming from.  It won't be obvious that calculateB() interacts with C at all.  Only after combing through all of the subroutines in the program will the errant line in calculateB() be discovered.

There are of course other small complaints.  The lack of braces like {} make it less clear how blocks of code are separated -- instead, DO and IF blocks are set apart by keywordss (like ENDIF) and labels (also used for GOTOs), which are not nearly as obvious.  Fortran does not use explicit and aesthetically pleasing comparison operators like <,>,==,!=, &, etc., and instead opts for more obscure and inelegant named operators that are combinations of periods and letters like .GT. and .LT.  The language has been through so many changes with lots of archaic deadweight legacy syntax that isn't always consistent with modern implementations.  The common use of ALL CAPITAL LETTERS FOR EVERY SINGLE THING IN THE PROGRAM kind of strains the eyes and is less readable than lowercase or mixedcase.  While objects exist in later versions of the language, I've never encountered them in codes I've seen, leading me to suspect their use isn't very widespread, while objects offers a very powerful tool for parallelization and encapsulation.

All of this makes Fortran a bad language to read, to use, and especially to learn.  You will learn universally scorned practices and style and spend days at a time tangled in the densest spaghetti code trying to figure out what variables came from where and when and why this part is here and there and what happens next.

But this isn't even the primary problem with Fortran.  Sure, it's ugly, but at the end of the day it works.  It will run and produce the results.  You can always get a style guide and follow it and not fall prey to the terrible convention that dominates the Fortran community.

The primary problem with Fortran -- what makes it truly a waste of time -- is that Fortran is a language made by physicists, for physicists, used by physicists, and used by precisely no one else on the planet but physicists.

Think about this.

You are learning a new skill.  Programming, like reading or mathematics, goes well beyond any particular applications.  The same programming skills used to evolve stellar dust clouds or calculate the Ising model work just as well in building iPhone apps or analyzing the stock market or managing web servers.  Learning a programming language opens up the entirety of the computer to you, and with the ability to program you gain the potential to do anything with your computer.

You will eventually graduate as a physicist.  Either you're getting your BS and going into industry, or you went through graduate school and got your PhD.  It is very unlikely that you will work in academia, even if you get your PhD.  Every year there are hundreds more PhD graduates in physics than the handful of professor positions that open up; the job market just can't support all the graduates.  Most physicists get industry jobs, usually with little connection to what they formally studied.  The jobs are't hiring "physicists" as such; they are hiring people with experience in quantitative reasoning with a background in technology who can use computers to solve complicated problems.  And, as a physicist working in computation or simlations, you are just what they're looking for.

Unless, that is, the only language you can use is Fortran.

Fortran is not used by data analysts, or by app developers, or by web content creators, or by video game developers, or by Wall Street quants.  While these jobs do require several years of programming experience, they usually require it in real programming languages that are used by non-physicists.

One very common such language is C++.

There are a lot of other languages (such as Python or R or PHP or Ruby) that are useful in the world beyond physics calculations, but I'm going to make the case for why C++ is the language you should be learning instead of Fortran.

C++ is an extension of the simpler language C, which was originally created to write compilers for the UNIX operating system.  Even though C was originally made for the single task of writing compilers, programmers found it useful for a host of other things beyond this.  The usefulness of C lead to its wide adoption by programmers, which in turn meant that the syntax and style conventions of C became standard in most computer-based fields.  Programmers came to expect C-like language, so that when new languages were developed byeond C (such as Java, Rust, JavaScript, Python, Perl, PHP), they implemented similar syntax.

This means that, generally speaking, anyone who has experience in any given programming language can look at a sample program written in ANSI C and understand exactly what it means, line by line.  For instance, most introductory computer science courses (college-level or AP) teach Java, and the Java syntax is effectively identical to C++.  If you took a course in computer science in high school or college then you most likely already "know" most of the major features of the C language.  Contrariwise, if you learn C, you will also know much of how to program in Java.

Like Fortran, C is also considered a high-level language, though it is actually closer to the metal than Fortran.  With C, you have direct access to memory addresses and registers and bit-level manipulation of data.  While Fortran is often said to be faster, that is largely because old Fortran codes were written in a style that closely imitates assembly language -- similarly written C programs will run at similar speeds, though of course if you're going to write assembly-style programming anyway, what's the point of using "formula translation"?

The style conventions of the C community are much more respectable than those of Fortran.  While it is definitely possible to write spaghetti code in C, it's not as common or as easy.  This means, when you read sample programs written in C, you will be reading sample programs written in good style (as defined by numerous style guides), thus learning and absorbing how to write programs well, in a way that is readable, tractable, organized, and aesthetically pleasing.

That is not the case with Fortran.  The idiosyncratic conventions used in Fortran - like defining every variable in global scope, using GOTOs for everything from loops to branches, never explicitly declaring anything, even just the amount of effort it takes to write a comment - aren't just not adopted by later languages, but are universally acknowledged to be bad practice.  Learning Fortran actually teaches you to program badly.

It is also actually impossible in C to write spaghetti as tangled as in Fortran.  C demands scope encapsulation and it requires explicit type declaration of all variables.  You will always know exactly which variables are avaiable in each scope, where the first instance of a variable occurs, and exactly what the type is.

It's also just more work to write spaghetti code in C.  While C does have a GOTO statement, it is very obscure and near-universally derided, and almost never appears in any program.  Many books teaching C don't even mention that it exists.  Instead programs nearly exclusvely use the FOR, WHILE, DO-WHILE, and SWITCH control structures which are all infinitely easier to use and make the flow of control much more obvious at a glance.  This is in contrast to Fortran, where I believe it is literally impossible to not use GOTO in any but the simplest program.

And while Fortran is still somewhat the "industry" standard in physics, C also has a long-standing and acknowledged use within physics and numerical computation.  For intance, while the earliest edition of the famous Numerical Recipes was for BASIC and Pascal, when they went for more modern languages with the 2nd edition, they rolled out both a Fortran and a C version at the same time, and the latest edition is exclusively for C++.  C is also starting gradually to replace Fortran as the language of choice of younger physicists for serious computational tasks.

Directly related to C (and often used interchangeably with it) is C++, which implements the same base language of C, but adds object-oriented programming capabilities.  The use of objects seems like an over-complication to people who grew up in procedural programming, but should actually be fairly intuitive to physicists.  The purpose of objects, in effect, is to grant you increased encapsulation.  Multiple pieces of data and multiple functions that act on the data can be combined into single self-consistent chunks of code that can be manipulated as a whole.  For instance, to model a particle, information on its mass, charge, its three Cartesian coordinates and the three components of its velocity can all be grouped together into a single object of code, along with functions that move it, calculate the force in it, etc.  You can then make an array of particles, each with these defined properties, and keep track of each particle through one array.  This is in contrast to a usual approach which would treat all the coordinates and velocities and masses as different variables in different arrays, and have to track them all separately.

C++ is very commonly used in game programming, in iPhone programming, in stock market predictions, C++ was used in NASA satellites and Mars rovers, and continues to see use in other embedded systems.  C++ has been used to write a number of software platforms, even those used in scientific research. Further, GNU Scientific Library adds to the C++ library a huge selection of pre-implemented and tested routines for a number of tasks in computational physics.

In short, C++ is an excellent choice of language to learn for your project.  It will teach you good programming style and standards that extend beyond physics calculations.  The centrality of control structures makes it easy to use, easy to read, easy to track, and very hard to foul up terribly.  It has widespread use in science, engineering, and software development, and widespread and accepted use in numerical computations.

Learning C++ will enable you to do the exact same numerical calculations, study the exact same physics, get the exact same numerical results and graphs, and prove or disprove the same hypotheses.  But you will also do so with cleaner code, and get some resume-building experience to boot.

And sure, if you use Fortran, you can later learn C really quickly.  But that doesn't impress an HR department hirer, who can only go by what's on paper; and what's on paper is you have 0 years of experience in any practical language at all.

How did Fortran become so deep-seeded? If it's so ubiquitous, then it must have some advantages over C that made it gain such favor, right?

In part, it's because it came first.  But admittedly there are a few things aout Fortran that commend it over C as a tool in numerical physics calculations.

For one, Fortran has a standard operation for exponentiation to small powers, something which C embarrassingly lacks.  If you want to take ain Fortran, it is a simple a**3.  In C, you have to multiply a*a*a, or call pow(a,3) from the math library, both of which are rather implementations of a very simple task

For two, Fortran uses 1-indexed arrays, which is more intuitive, whereas C uses 0-indexed arrays.  This means if you have an array of length N, in Fortran the elements are at indexes 1, 2, 3, ..., N, which is the usual convention used in formulas.  In C, the 0-indexing means the elements of an array are 0, 1, 2, ..., N-1.  You have to start at 0, and you end one before you think you should.  If you have an array y of length N holding values of some function, it's natural to think that y[N] is the last value -- but in C, it is y[N-1], whereas the value of y[N] will usually be undefined or randomly defined, possibly introducing random error into the calculation.

For three, Fortran implements complex numbers as a primitive type, whereas C only adopted support for complex numbers late, and treats them as an extension to the language and not as a primitive.

Lastly, C is not intended primarily for math, and so many mathematical operations like logarithms or exponents are not part of the basic language, and require importing the math library to implement them.

Despite those valid criticisms of C/C++, it is still the obvious choice of language, and will not turn out to have been a waste of time when you apply for a job outside of the university's walls.

If you professor is trying to get you to program in Fortran, I advise you to politely ask if you could use a different language instead.  Explain that you may end up applying in industry, where languages like C or Python are more standard and Fortran is not used, and using those would get you extra resume material.

However, understand that your professor may want you to use Fortran anyway, for one supremely important reason: your professor only knows how to program in Fortran.  This is really the number one reason to learn Fortran (a3 is the other good reason).  Your professor will be able to advise you on your program, bug check, give you legacy codes, and even share subroutines with you.  If you're using a different language, he or she will not be able to give you that kind of involved help.

Know that despite not actually knowing C, a Fortran-only professor will still be able to hobble through C a lot easier than you will be able to hobble through Fortran.  It will take looking over your shoulder and maybe some explaining about brackets, but it is still parseable.  With sample codes from your professor in Fortran, you will have more time to go through those and figure out what the logic means -- I have personally developed a routine for converting legacy Fortran into compilable C, then into good C, then into C++, and by now it's a mostly unthinking task.  So even though this is a very good reason to use Fortran, it's not as definitive as it might seem.

Also know that if you accept some kind of compromise of starting in Fortran then switching to C++ later, that you will never, ever switch to C++ later.  Ask the railroad companies about the Roman cart wheel.  Once you have the majority of your research calculation completed in Fortran, there will not be any justification for rewriting the entire thing from scratch in a different language.  If you're going to use C++ (or Python, or R, or whatever), you have to use it from the beginning.

If you are going to make a compromise, one might be to contemporaneously write a Fortran and C++ program to do the same thing, then once it's established the Fortran code does what it's supposed to do, and that the C++ code does the same thing as the Fortran code (to the bit), then start developing in C++.

All in all, my humble request to the entire physics community is to just let Fortran give up the ghost.  It has served its purpose in the past, providing a convenient means of expressing physics formulae in computer terms, and without its help, and its effective use by countless physicists around the world, the state of the field would not be where it is today.  Yet Fortran programming is no longer a practical skill.  It serves no purpose outside of physics, where most of the graduating physics majors will ultimately end up.

If you are a budding young physicist, it wouldn't hurt to learn the basics of the syntax, but only so you can convert it into C++ or some other modern language.  Otherwise, it's not worth your time to build entire computational codes in Fortran as a serious part of your research.  When you go to apply to a job in industry and can put "Three years experience in C++ programming" on your resume, you will be glad you didn't use Fortran.

That's all I have to say abut Fortran.  But for completeness, here are some other programming languages in high demand in industry.  Consider squeezing some in, if you can:

1. Python.  I don't know much about Python, but it is seing wide use in introductory physics.  It's allgedly very easy to learn, but it is an interpretted language several layers of abstraction above C or Fortran, meaning it will run much slower. However, for small calculations, it can be a great tool, especially if you need a lot of trial and error.  Python seems to be commonly used in web applications to handle server-side tasks.

2. SQL.  Experience in this is requestd by almost every single data analysis job posting.  SQL is a language for handling databases and retrieving specific data from huge arrays of memory.  It is not directly related to numerical computation, but to get some experience in this, rather than storing your data from calculations in .csv files in nested folders, try storing them in an SQL database, just to say you have the experience using it.

3. R.  This is a language built for statistics, in high demand in data analysis and quantitative finance.  It has a ton of built-in features for data analysis, and emulates a vector-pipe architecture (meaning entire arrays are calculated at once, rather than element-by-element).  Like Python, this is an interpretted language several layers of abstraction above C or Fortran, and so runs slower.  However, it works great for smaller calculations, post-calculation data analysis, and generating plots to test and examine data.

4. Java.  This language is very much like C++, with some simplifications.  Java is great for web development, but seriously: don't use Java for any computational work.  Every Java program runs inside of a virtual machine in your computer, meaning it is very resource-expensive.  It's great for making cross-platform applications with ease, but not a serious contender for numerical calculation.

5. Objective-C/iOs programming.  Objective-C is also an extension of C (like C++) that uses the NextStep/SmallTalk language to incorporate objects.  It is a horrible, disgusting, Frankenstein-monster of a language that makes me sick to think about, but it is the default language used by Apple for all their products.  If you plan on making GUI programs for Mac or iPhone, you will have to deal with this filthy language somehow.  Luckily, the base logic of the language is standard ANSI C, with the SmallTalk stuff almost literally duck-taped to it, so if you've learned C already, you know the basics of Objective-C.  Unless you want to make a research program with GUI interaction (instead of Terminal interaction), there isn't any reason to use Objective-C instead of C++.  C++ implements objects in a way more consistent with the rest of the C language.  If a job says it wants experience in iPhone programming, it likely means experience writing in Objective-C.

Those are all worthwhile languages to learn.  However, none of them approach the speed and generality of C, making C the best choice for computation-heavy calculations that are expected to run for longer than ten minutes.  But for other jobs, consider those above.

No comments:

Post a Comment