Author: Grant Humphries of Black Bawks Data Science
The European Commission is currently developing an IT system to improve implementation of the catch certificate scheme which is a cornerstone of EU legislation aimed at barring illegal seafood imports. Here, IT expert Grant Humphries of Black Bawks Data Science describes how a relatively straightforward algorithm can be developed to assist officials in identifying potentially suspect consignments.
Simple, but powerful decision support tools based on machine learning are becoming vital in fighting many forms of crime. These tools have the potential to make a significant contribution in the fight against illegal fishing. The key application of such tools is in risk management, helping enforcement officials to make informed and transparent decisions about which fishing vessels or seafood consignments to control or inspect. This is especially important as up to a third of the world’s fish catches (11 – 26 million tonnes per year) are taken by illegal and unreported fishing activities.
In 2008, the European Union adopted its regulation to combat illegal fishing. The regulation aims to increase the costs and remove the incentives for illegal fishing activities, through several mechanisms including a catch documentation scheme, a deterrent sanctioning regime for infringements committed by EU nationals, as well as minimum standards for inspections of landings in EU ports.
Under the EU’s catch documentation scheme, imports of seafood must be accompanied by a certificate that confirms the products were taken in accordance with relevant fisheries rules. Member states are required to inspect fishing vessels in their ports and to carry out verifications of catch certificates on the basis of risk management – i.e. the risk that the catches stem from illegal fishing activities. The EU receives some 250,000 of these catch certificates each year, equating to roughly 1000 certificates per working day across all member states. Risk analysis can therefore assist in ensuring limited enforcement resources are effectively and efficiently targeted to detect products from illegal fishing.
The EU’s illegal fishing regulation sets out a number risk criteria to guide member state authorities in selecting vessels and consignments for further controls and inspections. Based on these criteria, a coalition of NGOs in Europe have put together a position paper which identifies potential categories of information which could be used to develop an automated tool for deciding which seafood imports to verify and inspect. The NGOs recommend that such a tool be integrated into the EU-level IT system that is currently being developed by the European Commission to improve implementation of the EU catch certificate scheme. The aim of the IT system, originally planned for 2016 but now expected in 2018, is to facilitate cross-checks of incoming catch certificates by member state authorities, and the assessment of risk that imports stem from illegal fishing.
In this blog, I present a sample application that shows how a relatively simple decision support tool could be built into the EU IT system referred to above, to assist in the selection of fishing vessels/seafood consignments for controls aimed at detecting illegal fishing. The example below combines the predictive power of Random Forests – a powerful machine learning algorithm – with the open source, web application builder “R Shiny”. This decision support tool allows users to select a test fishing vessel with set parameters, and predict its probability of engaging in illegal fishing. Users can also create their own series of parameters and make a prediction on the probability of a ship engaging in illegal fishing
The test application can be found at: https://blackbawks.shinyapps.io/IUUFishing/.
The code and technical details for the application/simulation are available on Github: https://github.com/Blackbawks/IllegalFishing
Development of the application
The steps involved in developing the risk management application were as follows:
Step 1: Creation of a simulated dataset of ships and characteristics with pre-set relationships between illegal fishing and our simulated predictor variables (see below table)
Step 2: Training of the Random Forests (machine learning) algorithm
Step 3: Building of a web-based application in R Shiny that allows users to input data
Step 4: Use of the information in the trained Random Forests algorithm to predict the probability of the ship engaging in illegal fishing
In the real world, the dataset on which we would train the algorithm would be stored in a central, password and firewall protected database, which could be accessed through the web-based application. A proposed model could look like this:
Step 1: The data
The scenario in the test application has:
- Five fictional countries: Sidonia, Avalon, Noordilund, Slagovnia, and Tortuga.
- Five fictional owners: SparkleFish, FishRGud, KungFuFish, ScummyFishCo and FishARRRies
- Five classes of ship (classed by length): 1 (60 – 100m), 2 (101 – 130m), 3 (131 – 170m), 4 (171 – 220m) and 5 (221 – 300m).
- Five possible destinations of goods: LaLaLand, BetaZed, The Shire, Alpha Centauri, and Kings Landing
- Five fish species: Raricus fishica, Commonae eatedie, Billidae nyiecus, Donaldus trumpfishii and Fishica maximus
I next simulated 3000 fictional ship IDs, with the assumption that the five countries have submitted all known data for their ships. These 3000 ships form the basis by which we “teach” our model (training) to “learn” the patterns / relationships.
With the 3000 ships, I created a data table with the following columns:
|Owner||Name of ship owner|
|Country||Name of country the vessel is registered under|
|Ship type||Class of vessel (as defined above)|
|Year built||The year a simulated ship was constructed|
|Ship size||The size of the vessel in m (related to class)|
|Species||The species of fish reported as being fished|
|CITES||The CITES listing for the species listed|
|Year||The year (each ship has annual records – a time series – spanning from when it was constructed to the present)|
|Country Flag||If a country has any past infringements (0 = none, 1 = minor, 2 = major)|
|Owner Flag||The number of times the owner has been flagged for illegal fishing|
|AIS||If the ship had AIS activated while at sea (Yes / No) – this would involve linking to real time AIS tracking|
|Illegal||Did the vessel engage in illegal fishing? – derived from relationships between all the other variables|
|Shipswitch||Is the ship sending goods to its usual end-point? Or are the goods going elsewhere?|
I simulated this dataset under 10 assumptions:
- Assumption 1) The largest (class 5) and smallest (class 1) vessels are slightly more likely to engage in illegal fishing. (Note: this helps to create a bimodal distribution of the ship sizes engaging in illegal activities to demonstrate the non-parametric nature of the algorithm, i.e. doesn’t depend on statistical distributions).
- Assumption 2) “Responsible” countries with strong Illegal fishing laws are less likely to engage in illegal fishing. In our dataset, Sidonia and Noordilund are countries with strong regulations, Avalon is in the middle and Slagovnia and Tortuga have either little or no regulation.
- Assumption 3) Companies with sustainable practices will almost never engage in illegal fisheries. In our example, SparkleFish, and KungFuFish are the most sustainable, FishRGud are moderate, while ScummyFishCo and FishARRRies are the least sustainable.
- Assumption 4) Older fishing vessels are more likely to engage in illegal fisheries as they are more likely to be used by organizations wanting to cut costs and not prioritize safety features to save money (these are organizations likely to be more corrupt).
- Assumption 5) Raricus fishica is likely to be illegally caught the most… but Billidae nyiecus looks like another species therefore we score it higher as there could be illegal fishing associated with it.
- Assumption 6) CITES listed II species are more likely to be associated with illegal fishing
- Assumption 7) If an owner has been flagged for illegal fishing in the past, this increases the likelihood a vessel is fishing illegally
- Assumption 8) If a country has been flagged for illegal fishing in the past, illegal fishing is more likely
- Assumption 9) If a ship has switched its trade route, it is more likely to be fishing illegally
- Assumption 10) If ship has not switched on its AIS, it is more likely to be fishing illegally
Step 2: The analysis
Note: in this case, I was not interested in testing the hyperparameters (e.g. all the settings that help tune the algorithm) of Random Forest, so I left these under the default settings.
Random Forests works by way of decision trees (i.e. a souped-up series of conditional “if” / “then” statements) to make predictions on a target variable. It creates those conditional statements by “learning” the relationships between the target variable (here, illegal fishing) and the predictor variables (the variables we want to use to predict the target – see table above). Using the data simulated in Step 1, we used the “Illegal” column as our target variable – in other words, we were interested in predicting if a ship was engaged in illegal activity based on the other columns (owner, country, etc…).
We used a cross-validation technique to ensure the model was predicting our data well. In this case, the Random Forests model we used had an accuracy of 74% – that means that it correctly guessed if a ship was engaged in illegal fishing (or not), 74% of the time. This value could be vastly improved through tuning of the model (e.g. tuning of hyperparameters, use of ensemble models, deep learning methodology, or other techniques). I purposefully programmed “noise” in our dataset to ensure that we didn’t achieve a perfect model. The goal of this is to demonstrate how the proposed system could work as opposed to perfecting the model.
Step 3: User input and prediction
On the front end, the user is given the option to select from one of five ships, which fills in the pertinent data like “owner”, “country”, “ship length”, etc… For example, the “Christian Bale” is a ship owned by ScummyFishCo, and is registered in Slogovnia. It is 192m long, making her a class 4 ship. The ship was built in 1975, and normally sends product to LaLaLand. If that ship comes into port and the user tells the front end that this ship was catching Raricus fishica, that the shipment is being sent to King’s Landing, and that the AIS was active since last at port, we find that the probability of this ship engaging in illegal fishing was 0.93 – in this case, we would likely board the ship for inspection.
Another potential ship a user could pick in our application is the Bruce Lee. She is owned by KungFuFish and registered in Noordilund. The ship is 83m long, making her a class 1, and was built in 2014, normally shipping to LaLaLand. If on an excursion, the Bruce Lee returns with Commonae eatedie being shipped to LaLaLand, and had her AIS on, the probability of the ship engaging in illegal fishing would be 0.03 (3%), so we would not likely inspect the ship so thoroughly.
Step 4: Using the information
The question really lies in what thresholds do we use to make the decision on whether to inspect a ship or not. For example, if the probability is 51%, do we board? The precautionary principle would suggest we do, but this could increase the number of inspections which may not be commercially viable. One school of thought could be to only inspect ships with a very high likelihood of engaging in illegal fishing (e.g. 80% or more).
No matter what approach is taken, decision support tools that take advantage of sophisticated algorithms are showing great promise. Using them to combat illegal fishing will automate decision-making in a transparent way that can be scaled from local to global solutions. Furthermore, data integrity can be secured through centralized databases with specifically designed access.
There is still much work to be done to develop these tools in a way that is agreed upon by the global community, but we are at a stage now to begin the process.
Additional Q and A with the author:
- Why was Random Forests selected as the chosen algorithm in this example? What alternatives are available?
- Random Forests is a powerful classification algorithm. Its advantages include its flexibility (e.g. how fast the model “learns”), and overall accuracy when compared to similar algorithms. Methods such as gradient boosted trees or deep learning methods (neural nets) could also be implemented however within the R environment, random forests are easy to implement (deep learning methods are still evolving here), and are more accurate than gradient boosted methods.
- Why was R Shiny selected as the platform for development of the web application? What alternatives are available?
- For this case, I used R Shiny because it is very easy to go from raw code to an online implementation. The time it takes (depending on the complexity of the application) can be from a few hours (or less) to a couple of days.
- However, R Shiny is somewhat limited when it comes to scaling up to large applications. As such, I would recommend that Django (which is a web framework available through the Python language) be used for a scaled version of this application.
- Is it possible to introduce additional variables (risk criteria) over time?
- Yes, it is possible to test the application on a few criteria (e.g. for the purposes of a pilot project), and to leave the application open-ended for further criteria to be added at a later stage.
- Is it possible to “add” the application to an existing database?
- Yes, the application can be built around an existing database. This would require limited changes to the existing database with regards to its structure. However, there would have to be some coordination between the database, contributors, and the front end to ensure data security.
- What are the limitations associated with such applications?
- As noted above, the interpretation of probability is one of the most difficult issues. For “grey” scenarios (e.g. a probability of 50 or 60%), discussion as to thresholds will be necessary.
- The model relies on good quality data. It is important to ensure the data are clean and missing information is kept to a minimum.
- The modeling aspect (random forests) is sometimes referred to as a “black box” and can be difficult to interpret. However, this can be overcome by utilizing the expertise of those familiar with the algorithm
- What options are available in terms of output/provision of results?
- As soon as the user enters the information in a catch certificate, the application can give a yes or no result for inspection/verification, or a probability (e.g. 0.56 or 0.80 – as explained above).
- It is also possible to program certain characteristics that would require inspections in 100% of cases (e.g. catches from a vessel included on an RFMO IUU vessel list)
- The application can also be programmed to provide a PDF report containing the results of the probability calculation.