Trying to manage data for a Geographic Information System (GIS) is like herding cats: you may think you have control but there is always another piece of data or a newer version. It isn’t just getting the data, after all, much of it is open and easily available. The issues lie with the varying coordinate systems, the disparate authorities and varying data structures, all this before how to manage the same data from different providers with different resolutions and methods of capture (crowdsourced, digitized or surveyed).
Almost all geospatial data is now available online and most of it is open licensed. The amount of data has increased exponentially too. Earlier, a specialist could just download some flood zones once a month to keep it up to date, but now there are flood zones, from tidal, historic flood zones, flood protection areas, five-year flooding, ten-year flooding … there is a lot more data. Don’t get me wrong, more detail is great and provides better results, but it requires more storage and more time in herding it in the right direction for use. The United Kingdom is quite small but is made of England, Scotland, Wales and Northern Ireland. Each part collects and stores its data in different ways. The fields don’t match, the data has different names and in some cases the resolution of the data is different. So, trying to combine it all for a UK map can be a full time job for a GIS technician.
The solution lies in automation. Though choosing the right method is going to save a lot of time and investment, there are different types of automation and different ways of using it. I am going to, for the purpose of this piece, label the types “full automation” and “semi automation”.
Semi automation is where the system may do a lot of the work but still needs to be kicked off or started. Take, for example, the Esri ModelBuilder. This is a phenomenal tool for automating huge workflows. You can create models that take points in at one end and spew 3D data out of the other, but there still needs to be intervention of a user to open Esri software and start the ModelBuilder, maybe even move the data if necessary. This is also true for the QGIS version of their ModelBuilder. For around two decades, I have built huge amounts of semi automation, either using Esri, QGIS or just plain Python or batch files.
Esri ModelBuilder is fantastic. Using it, you can create flowcharts for your analysis which can be stored and rerun. Around 10 years ago, Esri created a cost weighting system which took all the national constraints relating to wind farms and then provided the least weighted areas and boundaries. Furthermore, it created a Comma Separated Values (CSV) file that gave nearest features and risks. All I had to do was to add the feature that had the county and it would run. It served as a consistent and reliable method for finding suitable renewables sites for many years and cut the time of processing the data from a week to half a day. A far improved and elegant version of this now exists in an online platform called LandHawk.uk and runs in milliseconds compared to the hour it would take using ModelBuilder.
QGIS has the Graphical Modeler. Like the ModelBuilder, it provides a graphical interface where you can drag and drop tools, flows and analysis into a flowchart, thus making it easier to read and semi automate. I have had much success in the past with creating semi automation that creates contours or raster data from points.
Another method I have used quite a lot in my career involves Windows batch files. Batch files are like a list of commandshell instructions in a neat little file that you can double click and run. Using this method, I have run Python scripts or massive data conversions using OGR2OGR. If you have used OGR2OGR before, you will find batch scripts very easy; it is just the code you type into the commandshell as a text file.
ogr2ogr -f "PostgreSQL" PG:host=localhost user=usename_here password=***** dbname=db_name_here input_shapefile.shp
The above batch script is simple, navigates to the required folder (c:/gis) and then runs the OGR2OGR script. This text is written in Notepad or Notepad++ and then simply saved as a .bat file, it will then become an executable which you can run over and over again.
Scripts can be more complex — if you consider having a folder of data which you maintain and have to upload every week or month, you could simply write this as a script above and then you would only need to run the .bat file. For bonus points, don’t forget that you can append SQL to the end of OGR2OGR commands so that you can append, overwrite, move schema or even (as I do at present) add GIST indexes after it has uploaded to your PostGIS.
Using the above methods, it is quite easy to download data to a folder, manipulate it and then upload it in a clean and consistent manner to the required data store, though it does require a little intervention.
Full automation, in the context of this piece, is classified as not needing any intervention. There is no need to kick-off any process, it will run itself as needed. Some might say that this is the holy grail of GIS data management, it is difficult to achieve as it requires a method for the machine it is being run on to run tasks at specific intervals and then run the specific operation. So far, there are only a few methods that have worked for me. These are: Feature Manipulation Engine (FME), Python, and special development.
FME by Safe Software is like the Esri ModelBuilder on steroids. It allows the building of huge workflows as flowcharts working with a huge array of data formats; the interoperability tools in Esri are built on FME. The secret weapon in its arsenal is the ability to fetch data from a URL. This means that you can simply add an initial step that takes the data from a WFS, database, or URL as csv, shapefile or any other data format. Then it becomes simply a case of applying steps for all your data management. FME is a proprietary product and similar in cost to Esri products, but I strongly believe that once this is set up, you could free up 75% of your data administration time, and is well worth the investment.
Python and special development may, for some, sit in the same bucket. For those who know Python, it is easy to pull together a script which can download data from URLs and place that data into a folder. Then using OGR2OGR it is possible to create a whole workflow. In my personal experience, I found it worked better for me wrapping the Python scripts for each step in a batch file. There are several examples of how to do this automation online using Python. My biggest gripe (not being a natural coder) is keeping all the dependencies up to date and the struggle to find URLs that have changed because the supplier has changed a name. Within FME it is referenced in a log, within Python it just fails.
How do you schedule FME or the batch file/Python script to run so that it is fully automated? For this I found that the in-built Windows task scheduler is fantastic, where most of us always have our machines on all the time. It is great, it’s even better when put on a virtual machine.
The final challenge that hasn’t been addressed is data behind a login. For example, purchased data or open data that requires an acknowledgement of license terms. For username and login, both FME and Python can be tweaked to automatically enter usernames and passwords. It is a lot easier than you may think. I would just advise that you be careful while entering usernames and passwords in any scripts as this makes it easy for people to get your information.
There are other situations where you may need to multiselect or click a box to approve. This is where developer tools come in useful. Developers need to do automated test scripts which perform the same clicks and selections a human makes but automatically, so that a website can be tested in different ways. These same tools can be used to automatically click or select information on a website. One of the most popular tools for this is Selenium Automation Test Tools (Web Driver). This provides Python integration so that you can add it to the rest of your Python automation. Selenium is open source, though there are licensed versions out there. If your business has a testing or devops team, it may well be worth asking for a little help.
Of course, this piece only discusses a few of the potential methods to achieve GIS zen, there are more out there which are beginning to become available. This piece doesn’t discuss using the Python within Esri or QGIS to obtain data which is a method that may be used if you have the knowledge. I have employed this a couple of times using Esri only because the data used is from the Esri services.
One area where I feel that the main GIS providers, Esri and QGIS need to improve is the data management side and the consumption and workflow of data. They provide exceptional tools to manipulate and store data. They even provide online portals and services for consuming data but not methods for accessing data through external URLs or behind logins, nor tools to manage and adjust data before it enters the system in the way that you can with Python OGR2OGR or FME. In my experience, the data lifecycle doesn’t seem to flow as well as it should, though this is an area I am sure they will be keen to improve as the need for manipulating data increases and the focus shifts to data quality and efficiency.