Procurement recommender systems: how much better before we trust them? -- re García Rodríguez et al (2020)

© jm3 on Flickr.

How great would it be for a public buyer if an algorithm could identify the likely best bidder/s for a contract it sought to award? Pretty great, agreed.

For example, it would allow targeted advertising or engagement of public procurement opportunities to make sure those ‘best suited’ bidders came forward, or to start negotiations where this is allowed. It could also enable oversight bodies, such as competition authorities, to screen for odd (anti)competitive situations where well-placed providers did not bid for the contract, or only did in worse than expected conditions. If the algorithm was flipped, it would also allow potential bidders to assess for which tenders they are particularly well suited (or not).

It is thus not surprising that there are commercial attempts being developed (eg here) and interesting research going on trying to develop such recommender systems—which, at root, work similarly to recommender systems used in e-commerce (Amazon) or digital content platforms (Netflix, Spotify), in the sense that they try to establish which of the potential providers are most likely to satisfy user needs.

An interesting paper

On this issue, on which there has been some research for at least a decade (see here), I found this paper interesting: García Rodríguez et al, ‘Bidders Recommender for Public Procurement Auctions Using Machine Learning: Data Analysis, Algorithm, and Case Study with Tenders from Spain’ (2020) Complexity Art 8858258.

The paper is interesting in the way it builds the recommender system. It follows three steps. First, an algorithm trained on past tenders is used to predict the winning bidder for a new tender, given some specific attributes of the contract to be awarded. Second, the predicted winning bidder is matched with its data in the Companies Register, so that a number of financial, workforce, technical and location attributes are linked to the prediction. Third and final, the recommender system is used to identify companies similar to the predicted winner. Such identification is based on similarities with the added attributes of the predicted winner, which are subject to some basic filters or rules. In other words, the comparison is carried out at supplier level, not directly in relation to the object of the contract.

Importantly, such filters to sieve through the comparison need to be given numerical values and that is done manually (i.e. set at rather random thresholds, which in relation to some categories, such as technical specialism, make little intuitive sense). This would in principle allow the user of the recommender system to tailor the parameters of the search for recommended bidders.

In the specific case study developed in the paper, the filters are:

  • Economic resources to finance the project (i.e. operating income, EBIT and EBITDA);

  • Human resources to do the work (i.e. number of employees):

  • Specialised work which the company can do (based on code classification: NACE2, IAE, SIC, and NAICS); and

  • Geographical distance between the company’s location and the tender’s location.

Notably, in the case study, distance ‘is a fundamental parameter. Intuitively, the proximity has business benefits such as lower costs’ (at 8).

The key accuracy metric for the recommender system is whether it is capable of identifying the actual winner of a contract as the likely winning bidder or, failing that, whether it is capable of including the actual winner within a basket of recommended bidders. Based on the available Spanish data, the performance of the recommender system is rather meagre.

The poor results can be seen in the two scenarios developed in the paper. In scenario 1, the training and test data are split 80:20 and the 20% is selected randomly. In scenario 2, the data is also split 80:20, but the 20% test data is the most recent one. As the paper stresses, ‘the second scenario is more appropriate to test a real engine search’ (at 13), in particular because the use of the recommender will always be ‘for the next tender’ after the last one included in the relevant dataset.

For that more realistic scenario 2, the recommender has an accuracy of 10.25% in correctly identifying the actual winner, and this only raises to 23.12% if the recommendation includes a basket of five companies. Even for the more detached from reality scenario 1, the accuracy of a single prediction is only 17.07%, and this goes up to 31.58% for 5-company recommendations. The most accurate performance with larger baskets of recommended companies only reaches 38.52% in scenario 1, and 30.52% in scenario 2, although the much larger number of recommended companies (approximating 1,000) also massively dilutes the value of the information.

Comments

So, with the available information, the best performance of the recommender system creates about 1 in 10 chances of correctly identifying the most suitable provider, or 1 in 5 chances of having it included in a basket of 5 recommendations. Put the other way, the best performance of the realistic recommender is that it fails to identify the actual winner for a tender 9 out of 10 times, and it still fails 4 out of 5 times when it is given five chances.

I cannot say how this compares with non-automated searches based on looking at relevant company directories, other sources of industry intelligence or even the anecdotal experience of the public buyer, but these levels of accuracy could hardly justify the adoption of the recommender.

In that regard, the optimistic conclusion of the paper (‘the recommender is an effective tool for society because it enables and increases the bidders participation in tenders with less effort and resources‘ at 17) is a little surprising.

The discussion of the limitations of the recommender system sheds some more light:

The main limitation of this research is inherent to the design of the recommender’s algorithm because it necessarily assumes that winning companies will behave as they behaved in the past. Companies and the market are living entities which are continuously changing. On the other hand, only the identity of the winning company is known in the Spanish tender dataset, not the rest of the bidders. Moreover, the fields of the company’s dataset are very limited. Therefore, there is little knowledge about the profile of other companies which applied for the tender. Maybe in other countries the rest of the bidders are known. It would be easy to adapt the bidder recommender to this more favourable situation (at 17).

The issue of the difficulty of capturing dynamic behaviour is well put. However, there are more problems (below) and the issue of disclosure of other participants in the tender is not straightforwardly to the benefit of a more accurate recommender system, unless there was not only disclosure of other bidders but also of the full evaluations of their tenders, which is an unlikely scenario in practice.

There is also the unaddressed issue of whether it makes sense to compare the specific attributes selected in the study, which it mostly does not, but is driven by the available data.

What is ultimately clear from the paper is that the data required for the development of a useful recommender is simply not there, either at all or with sufficient quality.

For example, it is notable that due to data quality issues, the database of past tenders shrinks from 612,090 recorded to 110,987 useable tenders, which further shrink to 102,087 due to further quality issues in matching the tender information with the Companies Register.

It is also notable that the information of the Companies Register is itself not (and probably cannot be, period) checked or validated, despite the fact that most of it is simply based on self-declarations. There is also an issue with the lag with which information is included and updated in the Companies Register—e.g. under Spanish law, company accounts for 2021 will only have to be registered over the summer of 2022, which means that a use of the recommender in late 2022 would be relying on information that is already a year old (as the paper itself hints, at 14).

And I also have the inkling that recommender systems such as this one would be problematic in at least two aspects, even if all the necessary data was available.

The first one is that the recommender seems by design incapable of comparing the functional capabilities of companies with very different structural characteristics, unless the parameters for the filtering are given such range that the basket of recommendations approaches four digits. For example, even if two companies were the closest ones in terms of their specialist technical competence (even if captured only by the very coarse and in themselves problematic codes used in the model)—which seems to be the best proxy for identifying suitability to satisfy the functional needs of the public buyer—they could significantly differ in everything else, especially if one of them is a start-up. Whether the recommender would put both in the same basket (of a useful size) is an empirical question, but it seems extremely unlikely.

The second issue is that a recommender such as this one seems quite vulnerable to the risk of perpetuating and exacerbating incumbency advantages, and/or of consolidating geographical market fragmentation (given the importance of eg distance, which cannot generate the expected impact on eg costs in all industries, and can increasingly be entirely irrelevant in the context of digital/remote delivery).

So, all in all, it seems like the development of recommender systems needs to be flipped on its head if data availability is driving design. It would in my view be preferable to start by designing the recommender system in a way that makes theoretical sense and then make sure that the required data architecture exists or is created. Otherwise, the adoption of suboptimal recommender systems would not only likely generate significant issues of technical debt (for a thorough warning, see Sculley et al, ‘Hidden Technical Debt in Machine Learning Systems‘ (2015)), but also risk significantly worsening the quality (and effectiveness) of procurement decision-making. And any poor implementation in ‘real life’ would deal a sever blow to the prospects of sustainable adoption of digital technologies to support procurement decision-making.