Saturday, September 9, 2023

The necessity of teaching better programming practice to physics PhDs

I am a numerical physicist.  I graduated a little over a year ago, and have since gone through the process of applying and interviewing for jobs.  This topic -- what can I do after I graduate -- has been a concern for me for well over a decade, and I've put a lot of thought into how to make sure I get the most out of my time as a graduate student.

I was fortunate, in that my advisor also had a lot of this in mind and had me use many standard best practices, such as git and building from makefiles, testing code output, and allowed me to work in C++ instead of Fortran.  But not all grad students are this fortunate.

I decided to write out some of my thoughts.  As written, this is directed at advisors, but is obviously applicable to grad students in planning how to do their research.  If you are a student, consider the advice here, and bring it up with your advisor.

Why we must change focus

To put it bluntly, Professor of Physics is no longer a job.

In the 1950s, 60s, 70s, and on into the 80s and 90s, when most tenured and chaired physics faculty at most universities were first hired, it was sensible that the chief goal of a graduate education in physics would be training the next set of professors.  It would have been reasonable to assume all graduate students, upon completion of their dissertations, would go on to work in universities as professors themselves.  This is the model that physics graduate education was built around, and it remains the assumption of most active and tenured professors.

Today, this is not a reasonable assumption.

The post-War funding boom into fundamental physics research is long gone.  The enrollment boom from the children of baby boomers has slowed way, way down with the drop in birth rates among Gen-Xers.  Physics departments are no longer expanding, and may in fact be actively trying to shrink down the number of faculty by encouraging retirements.  They are not hiring professors.

If they are hiring, then they are hiring non-tenured, contracted lecturers.  This position is like being a professor, except it pays half as much, offers no tenure, has to be renewed every three years, involves a full teaching load of multiple courses very semester, and has no provision for research either with budget or time.  Actually, it's not like being a professor at all.  It's more like being a high school teacher, except you work at a university.  Any of your students who want to teach at the college level, this is the only job that they might be able to get.  I say "might" because, despite it all, these are still very competitive positions.

I will grant, each year there are notifications of openings for professor positions sent out in email lists.  It is not worth encouraging your students to waste the tail-end of their 20s ostensibly preparing for the five positions that open globally each year, spread across the thousands of PhDs around the world applying.  Some might call it cruel to foster this kind of outlandish hope, if it's treated as the only foreseen outcome.  It is a possible outcome.  So is winning the lottery.  I wouldn't base a pedagogy around preparing students to win the lottery.

Your students should not be prepared with the goal of performing your job, because your job effectively no longer exists.

Some of your students will be hired at major research laboratories, either national labs or private labs of large industrial companies.  This is more true for experimental physicists, who can more easily find jobs in the manufacturing sector.  Many experimentalists may even be able to leverage their basic laboratory skills to get positions working with tangential domain knowledge, such as chemistry or biology.

But not all of your students will be hired as lecturers or lab researchers.  Not even most will be hired in these positions.

Most will be hired in computer technology, serving essentially as software engineers or developers.

The change in focus

A degree in physics communicates more than subject-matter expertise.  It communicates a desire to tackle challenging problems, the work ethic to stick through the difficulty, the quantitative skills to handle the subject, and the ability to learn very difficult material by reading.

Many of your students may not even want to be professors.  There are many jobs, such as quantitative finance or data science, which treat a PhD in physics, specifically, as a basic prerequisite.  Your students may be studying physics to qualify for one of these positions.  

The end goal of graduate education should always be to train the next generation of research scientists.  A graduating PhD should be ready to work as a full-time researcher within the chosen specialty, and be equipped to pursue new research areas.

But because most won't actually become research scientists, it is also important to treat time as a graduate assistant as work experience, and use it to build experience in skills that are also useful outside of academia.  

This is not an either/or proposal, but a both/and proposal.  Students can be taught to become the next generation of research scientists, while also learning skills used widely in industry.  In many ways these goals are complementary, as best practices in industry will often also lead to better organization in conducting and presenting research.

Below are some suggestions.

Data in SQL databases

Modern industries organize their data using relational databases, which allow querying with SQL. Even a cursory glimpse at job postings for physics grads will show SQL as a major required skill.  And yet, except for researchers working directly in fields like astronomical data, most research data is handled inside insecure and ad-hoc methods like folders stuffed with .csv files, or written in pen in a lab notebook.

Have your students start thinking about data storage early, and have them create good data solutions.  This will avoid duplicated efforts and errant presentation of results.

Whether in theory, computation, or experiment, your students will have to generate data in some form.  Data can of course be a classic table of numbers, in tabular or graph format.  In experiment or computation, this is the most likely.  But it can be even more general.  Data might be coefficients of a series expansion.  Data might be a cell of Mathematica output.  Data might be a particular equation result.  All of this is data.

If this data is not stored properly, lots of avoidable delays and errors can creep into research.  A result previously obtained will become lost, requiring wasted effort in regenerating it.  Results obtained in separate instances might disagree, and the different assumptions into the generation unrecorded, leaving doubt on both results until they can both be rederived.  Data generated under one set of circumstances, might be confused with data generated under another, and errantly used in an incorrect table or figure or calculation, leading to errors in presented results that are hard to notice, to track down, or to correct.

This is all to make it clear how important it is to think of all research output as a form of data, and make sure there are good methods to track it.

For some of this, old methods might work fine.  However, it will be useful to begin implementing something like a SQL database as a solution.  The SQL database will enable you to record data, and the relationships between data, and easily pull up old results.  It will also produce valuable work experience in using SQL on databases, which will help make your students stand out to hiring companies.

A database is not simply a gigantic table of results.  It is a representation of relationships between results.

Consider data generated by a simple experiment that varies temperature and pressure of a material and measures its density.  It is wrong to think of the database as simply this table, with columns for each of the three variables.  It is better to think of this entire set of measurements as a single entry in one column of the database.  Your other columns will explain other features of the data.  They will list the start/stop times, the material used, the process for generating the material, the machine used, the settings of the machine when it was used, and which researchers worked on the experiment.  All of these things, and more, can be added as additional columns to the database.  Now you can easily pull up data generated with two different machine settings, and compare.

If the data is listed out with each row of the experimental result having pressure, temperature, and density together with all of this accompanying information, it becomes possible to further analyze the data in ways not originally considered when it was collected.  Suppose it becomes suspected two different researchers are doing something slightly different, unintentionally impacting results.  It is very easy, using SQL, to query all data from one researcher, all data from the other, and compare them with keeping all other factors (pressure, temperature, machine, time, etc.) held constant.  Suppose you wish to use all data between 2-3pm in the afternoon; this is also very easy to do with a SQL query.  All of your data becomes usable in new experiments, in easily performed in-house macro-studies that can at minimum suggest a new test or experiment to perform.

Experimental data is obvious, so consider a different kind of data.  In my research advisor's group, most of the rest of the students focused on using Mathematica to generate thousandth-order coefficients in series expansions of solutions to gravitational wave generation of two orbiting black holes.  It sometimes required weeks of run-time on parallel computer clusters, and the output was a cell in a Mathematica notebook.  These coefficients are data, and they can be recorded in a database.  Along with these coefficients, could be such things as: orbital eccentricity, orbital inclination, black hole mass ratio, number of full orbits, stop/start time, total run time, computer cluster node used, total order calculation, approximate time per coefficient generated.  All of this could be important in diagnosing problems, in reusing results, or in avoiding duplication of effort.

Consider even pen-and-paper derivations of formulae.  It might be that deriving with one set of starting equations leads to one form, and deriving with another a second form.  A database could be an easy way to track which equations were used in a derivation, so conflicting results can be later compared.

Not only will this make the research more efficient, but it will enable the students to gain experience setting up a SQL database and using SQL to fetch results.  Setting up a database may sound daunting.  With sqlite it's as easy as generating a local text file that can be queried.  Load it onto github for additional storage persistence.  Setting up a cloud-based database is more involved, but the budget for it can be easily justified and is valuable learning experience.

Putting data into a database, even if just for the sake of using a SQL database, is still a worthy enterprise, as it is building experience that can help make the students more competitive in job applications outside of academia.

Version control in git

Students will have to create things on a computer.  This is not limited to code.  It includes posters, papers, graphs, plots, tables, and eventually their theses.  All of these have to be made on a computer; the typewriters you used in the 80s no longer even exist.  The problem with typing everything on a computer is that, unlike pen on paper, computer files are mutable.  They can be edited and changed, ad leave no record when they are.  Without version control the original will be lost forever.  If one of these changes leaves the total state of the file worse off, then it could be lost for good, unless by careful steps the student can recreate it.  

This problem can be solved with version control, which documents the state of a directory of files at a state in time, and persists that state for easy restoration or history tracking.  The most popular software for version control and remote collaboration is git.

Learning how to use git is most important for computer code.  But even if the students aren't making code, git can still be valuable.

My doctoral dissertation and one of my papers were written using git.  My dissertation was hosted on github, and the paper on bitbucket.   My research group also stores an enormous bibtex file on bitbucket, which is used across all of our papers by downloading the remote file every time we compile the LaTeX for the paper.

Git works as a command line tool and can be easily integrated into most text editors for programmers.

Suppose your student is writing a paper, and obviously needs your input.  Rather than emailing files back and forth, the paper can be hosted in a remote git repository.  The student first creates a new branch.  The student makes edits in the branch, then pushes the branch to the remote repository, which creates a pull request (PR).  You can go to the PR, see the exact changes made compared to the original, and leave comments on exactly those edits, line-by-line.  You can send things back with comments, request changes, or if everything is good approve those changes.  Once approved, they become merged into the paper, replacing the parts that were changed.

If you need to work on a section contemporaneously with the student, you can make your own branch, and you each work on separate parts of the paper.  At the end, each PR must be merged into the main code base for the paper.  If two edits contradict, they have to be reconciled by choosing which edits to finally include.

This process of merging in edits allows multiple people to work on the same paper, at the same time, without needing continual internet access.  You could imagine a paper involving three students, a colleague at another institution, and yourself, now being effortlessly spread out for the collaboration.

If the grad students are also required to produce code to analyze numerical results, then using git for version control becomes even more useful.  Tracking wording edits in a paper is one thing, but tracking edits in research code is the difference between results before and after a bug was introduced.  Version control with git can allow "rewinding" code to prior states when results were correct, seeing edits made since then, tracking who made the changes, and easily reverting PRs that caused errant changes.

This process is the same used when large teams all edit the same code.  It is used at all major software companies, and any company trying to make software without this process is not going to be major.  Knowing how to use this process or branching, pushing, merging code, even if the "code" is only some TeX, will be very useful to the graduate students on graduation.

Unit testing and CI/CD

I once took classes under a very well-known computational physicist.  One of the repeated lines from this professor, and which I now continue to also repeat, is: 

You have written your code, and it gave you a number... but how do you know the number it gave you is correct?

This question gets at the problem of testing code to ensure it produces accurate results.  The kind of testing suggested by the question represents a first step.

Here is a related question: 

You edited your code, and made changes to it... but how do you know your changes didn't make things worse?

In modern code production, the answer to both questions has been systematized, automatized, and tuned to lead to highly efficient code editing with guaranteed accurate results.

Unit testing refers to the process of testing each individual function of a program to ensure that it behaves correctly.  This includes positive, as well as negative testing.  Positive testing makes sure that when used correctly, it generates what it is expected to generate.  Negative testing ensures that when used incorrectly, such as with invalid inputs or in uncovered situations, that the code fails in a controlled way that does not break the entire program.  

Unit testing requires writing code with more segmentation of functionality so that each individual step can be separately tested for correct behavior.  If your code is meant to calculate a scattering cross-section, rather than one single enormous subroutine, break it up.  One subroutine calculates one particular channel.  Test that one channel with several different types of input, valid and invalid.  Test all of the other channels for all of the other subroutines, with valid and invalid input, and verify everything produced by each subroutine is as expected.  Test that the code to combine all of the channels combines them properly, by ensuring the output of each channel, separately calculating their composition, and compare it to the code's composition.  All of this sounds like more work, but it saves work in the end, as the code becomes robust, reliable, and removes all worry about one of the channels being a source of error.

Having these unit tests around parts of a program can help guarantee that behavior remains static even as a program is edited.  Consider a subroutine to calculate a determinant.  Unit tests might pass the subroutine several matrices with known determinants, including singular matrices and matrices with NaNs, and ensure the known answer is returned.  If it works on these known cases, and the known cases represent a wide range of possibilities, then it increases confidence the subroutine works in untested use cases.  Now, if the underlying subroutine is changed, say changed from using a recursive permutation process (dumb) to a matrix reduction approach (smarter), then the same tests producing the same results indicate that even if the underlying code changes, its results did not.  The code is still correct.

These sorts of unit tests could be a big list of input and output, ran and checked manually.  But that's dumb.  It's a computer.  Use it to compute.  Create separate programs to automatically run all of these checks for you, and run them every time you change the code.  If a test fails, keep fixing the code until they all pass.  Then you never need to worry if you accidentally broke something in the last edit.

This process is a first step towards what is called CI/CD.  These stand for continuous integration/continual deployment, and each of those means something particular and distinct.  But in practice, the term CI/CD is used to refer to an automatic process implementing version control in git combined with automatic testing of code changes before merges.

The student will create a branch in git.  The student will make changes within that branch.  At the end, the student locally runs the unit tests.  The student then pushes these to the remote repository.  On the remote repository, servers automatically pull the code and run the checks (ideally on several different systems, such as linux, mac, and windows).  If all checks pass, then the code can be approved; otherwise it must be fixed until all checks pass.

This makes sure research code is only ever edited in ways that make it better.  It ensures the numbers your code produces are always checked, without requiring mental or physical effort from you to go through and check them all.

If this sounds involved, it is surprisingly easy to setup with most remote repository hosting solutions.  Github, Gitlab, and Bitbucket all offer ways to setup automatic CI/CD workflows, and your institution may have an account with one of these.  I know for sure Github will offer this for free for non-commercial open-source projects.

Testing is not limited to unit testing.  There are also system tests and stability tests.  Unit tests only check that individual pieces have expected behavior.  System tests ensure that multiple components (or the whole program) together produce expected results, and is more akin to comparing to known results or to literature values.  Stability tests compare output to previous output, to ensure the same results are generated even with code changes.  This is important if there is no known result, but you want to ensure the output of the program doesn't change as the code does.

Any job in programming will require doing this, and having experience with this level of testing is the difference between being considered for intern or staff roles.  After 4r+ years of working on research code and earning the PhD, you'd like your students to be hired in the staff role.

Using real programming languages

Even in the year 2023, physicists continue to insist on using Fortran.  And I'm sorry guys, it is time to let it give up the ghost.  There is no reason for new physics students to learn Fortran.  They should instead be learning C++ and python, as these are the main languages in use by anyone, anywhere in the world, outside of a physics department's walls.

Learning Fortran is a waste of time.

There are only a few reasonable justifications for using Fortran.  One, and an important one, is you know Fortran already and don't know anything else.  You have expertise in this language, valuable expertise, which can help direct a student.  That's not worth passing over.  But, your important expertise in Fortran will still be valuable in other languages, as most of the important concepts still translate over. 

A second reason offered might be that Fortran is believed to have better performance with number crunching.  This seems to be largely a myth.  Published studies testing equivalent algorithms on several platforms, using a range of programing languages, found that simple C beats every language, and C++ always beats Fortran.  C++ beats Fortran in speed, and beats it in memory consumption, and in most tasks also beats it in energy consumption.  C and C++ are, simply, better for number crunching than Fortran.

A stupid reason to offer for Fortran is essentially an in-group prejudice of physicists for the physicist language.  The argument that physicists should use Fortran because they've always used Fortran because it's what physicists use, is not dissimilar from the argument that church should only be conducted in Latin.  We have always used it, but should we?  Or would something else be better at meeting the same needs?

If your students are creating research code using python or C++, then they are gaining experience in languages that will be in-demand by employers while also conducting the same research.  If your students are creating research code using Fortran, then they are missing the opportunity to get this experience.

At different points, PASCAL, Perl, COBOL, PHP and others have all been important and widely-used languages.  And they have also all lost importance, as they ceased being used for anything but legacy jobs.  Fortran is also in this list.  You don't have to like that, but it's the way the job market actually is.

You can notice I'm recommending python as an important language to learn.  I think python is insecure, slow, bloated, and overly simplistic.  I dislike python quite a lot, despite being required to use it at work.  My recommendation of this language is not based on preference.  It's because companies are hiring python programmers.  They are not hiring Fortran programmers.

The community of Fortran users can resent me for saying all of this, but they need to come to terms with the fact that the universal scorn held for Fortran is a result of the Fortran community's production of illegible, confusing, unmaintainable spaghetti code.  People associate Fortran with bad code, because most of the lines of Fortran in existence are bad code.  And you, the Fortran community, wrote that bad code.

As a professor, not adapting to new programming languages does not impact you.  You have tenure and a grant, and can get the results you need with any language.  But insisting on Fortran will impact your students, who gain experience in a language considered useless, rather than one considered important by hiring managers.  Rather than do this to them, tell them to use a language in widespread use.  Python is a major one, C or C++ are better for major calculations.  Julia, Rust, Go, and others are also commonly used for numerical work requiring high performance.  All of these are modern languages leading to reasonable job prospects.

Obviously, if your students were going to continue working in a physics department as physics professors, then it doesn’t matter what language they use.  But the job of physics professor no longer exists.

I personally suggest C++, with a warning.  While it is valuable to learn the low-level manipulation of arrays and pointers, working only with arrays and pointers will severely limit you.  Code written with C-style arrays and strings is also a giant red flag of self-taught amateur physicist coders, vs. professional coders.  Start with low-level arrays, pointers and C-strings if you want this experience, but then begin using the higher-level language features of vectors, sets, maps, std::strings, and lambdas.  They will simplify quite a lot of the program, with minimal costs in speed or memory.  They might actually make things faster.

Use a style guide and a linter

Speaking of illegible and intractable spaghetti code, it is important to stress for students to think of the code as a research publication of itself, and not merely as a tool to perform a task.  It is a written artifact of their research, and serves almost as much purpose being read, as it does being run.  In fact, it is possible in some journals, such as the Journal of Open Source Software, to publish your code as a publication, provided it meets the editorial standards of significance and best practices.   If it is written well.

Physicists are notorious for producing unreadable, poorly-written programs.  While using modern, non-Fortran languages can help with this, it still becomes necessary to standardize the style of a program to maximize readability.

It doesn't particularly matter what style guide you go with.  But it does matter that you make some effort to standardize the style of your code.  By conforming to a published style guide, and not your own, you increase the chances that the style you're enforcing will be a good one.

In fact, there exist programs called linters which will automatically enforce style requirements.  Some popular ones in python are flake8 and black.  These can actually be made to automatically re-write the code for you in a way that conforms to the style chosen.

I personally hate linters, to be quite honest.  I despise them.  But they are in common use, and for a good reason.  They standardize style across a large code base, and ensure that code is always written the same way.  Consider requiring students to use one.

If a linter is too much, then absolutely suggest adherence to some style guide or another.  It is the difference between the code being a portfolio piece, and being a big embarrassment.

In short

Start teaching students how to program in real programming languages following standard best-practices.

  • Use git to control updates to code (and papers!)
  • Don't just write code, but write code to automatically verify the code and its output.
  • Setup code on remote repositories like github with automated CI/CD to ensure code performance.
  • Use relational databases and SQL to track research data.
  • Learn real programming languages that are useful outside academia.  Sadly, that means not Fortran
  • Use standardized and good style in writing the program.

While learning this might not be necessary for your job, your job no longer exists.  The students are going to get a different job, and learning all of these skills will be essential in counting their time performing research as not just education, but as valuable work experience they can list on their resume.  A student knowing git, unit testing, setting up CI/CD, managing even simple databases, and querying data, will be way ahead of other physics grads when they go to apply.  It will also enhance their research, and may even help them graduate faster.

These are my recommendations.  Implemented or not, the focus of physics needs to change, away from training professors, to training independent researchers who can use industry best-practices.

No comments: