Fabio Manganiello

Fabio's blog

How to tackle Google's monopoly while making search better for everyone

A crowd of perplexed magnifiers

Yes, #Google has been a monopolist in the search and ads space for the past two decades.

And its self-fed profit machine should have been stopped before it became to big to fail and to big to compete with.

And yes, Google got such a stronghold that it actively invested energies in enshittifying its own results by prioritizing revenue-making entries that results got noticeably worse for everyone, and yet it's unlikely to see a single percentage point of market share eroded.

And yes, Google is probably going to appeal the decision anyway, using the same arguments they've been using for years (“it's not that we're monopolists, it's that other like our money and want us to be there”).

But the crazy part of the story is that, while everybody agrees that Google is a monopolist which gained its position by basically bribing everyone away from competing with it, nobody knows who to deal with it.

That's because everybody knows that crawling the web and building a decent modern search engine compliant with modern constraints is hard. And it's even harder to do it while being profitable. And there's usually a relation of inverse proportionality between the profitability of a search engine and the quality of its results.

That's why Google prefers to pay $21B a year to the likes of Apple, Samsung, Motorola and friends just to be the default search engine – in their browsers or on their devices. It's not necessarily that those companies don't have enough resources to build a competing search engine in house. It's just that they estimated how much it would cost them to make and maintain their own search engine, versus how much Google would pay them to stay out of the business and let it be the default, and they realized that they'd better just shut up and take Google's money.

Now lawmakers are coming and saying “hey, Google has bribed you to stay out of its business, now it's time to stop accepting its money”. So what? Force Apple or Samsung to build their own search engine from scratch, and end up like an even worse version of Bing, or like Apple's first version of maps?

On a second thought, why not?

Why not establish that if you're either a big creator or big consumer of online content then you should also contribute your little part in keeping the web searchable? That ensuring that the house is clean enough for people to find things is everybody's job – especially if you make a lot of money with at least a piece of that house?

Metasearch to the rescue

Maybe not why take the meta-search approach of #Searxng and make it the norm?

Maybe we don't need many search engines that are able to compete with the largest one on the market that built its monopoly over 25 years – or, worse, try and reinvent the business model from scratch in a short time and solely forced by regulation, preferably with ads and sponsored content playing the smallest possible part.

But we can all benefit from many companies that all play their little part to keep the web searchable. Public resources need public protocols and low access+sharing barriers if you also want competition.

Large tech companies could all contribute for example by running crawlers that index and aggregate results from their clouds, user-generated content and clients. Or even by running it on larger portions of the Internet. Those crawlers and their algorithms should preferably be open-source, but probably they don't have to – although the APIs that they expose and their format should be open and uniform for compatibility reasons.

That's because their results can then be aggregated by open meta-engines like Searxng. You could also easily fine-tune them, just like you would do with a local Searxng instance – more results from search engine A? less from B? results from C only for Spanish? results from D only for academic papers? Let your search algorithms work the way you like, let thousands of custom search engines based on open protocols flourish. Let them compete on the quality of their results, or on the business niches that they decide to invest on, or experiment with new business models, and let open implementations aggregate their results and provide them to the users.

A given search engine decides to enshittify? It starts boosting extremist results? It starts to return only low-quality sponsored content on its first page? It is purchased by Evil Corp? Then the admin of a search aggregator can simply reduce the relevance on its results or remove them entirely. And users could choose whatever search aggregator they prefer. Or they could even tune relevancy of results themselves from their own settings. No need for regulators to scratch their heads on how to stop a monopoly. No need to ask ourselves how to prevent a single monopolist No need for anyone to be forced into accepting bribes. No need to ask all Big Tech companies to build their general-purpose search engine from scratch rather than providing Google as a default, or worse turn a monopoly into an oligopoly. Give your users all the freedom they want. Let them run their own aggregators. Or sign up to a search aggregator, free or commercial. Let them tune results from multiple search engines the way they like. And let the rebalanced mechanism of demand and supply based on open protocols but competing implementations regulate the market the way a healthy market is supposed to.

Or maybe revive the Semantic Web

Maybe we could even dust off semantic protocols (the “old”, and probably real, Web 3.0), such as RDFS and OWL, to make web content really machine-readable. Those could even make the whole concept of a search engine with HTML/natural language scrapers obsolete.

The main obstacle against their implementation two decades ago was the burden of having to manually annotate your content just to make it machine readable – or worse maintain your own ontologies. Modern AI technologies have definitely changes those constraints – and this could be a good application of them.

The dinosaur in the room

Then there's still the #Mozilla problem on the table. If nobody should accept Google's bribes anymore, that includes Mozilla. And we all know that the $500M/year bribe that Google pays to Mozilla to be the default search engine in Firefox is basically the only serious recurring annual revenue that keeps Mozilla (barely) afloat.

Paradoxically, Google needs Mozilla because the existence of Firefox is the only thing that allows them to say “you see? there's an alternative browser with an alternative rendering engine out there! granted, it has <5% of the market, but it's there, so you can't say that Chrome/Blink has a complete market monopoly”.

And Mozilla needs Google because, well, without their voluntary bribe they would be bankrupt – making the web actually even less competitive.

To their defense, it's not like Mozilla hasn't tried its best to diversify its offer and break free from its awkward relationship with Google. But whatever they've tried (from MDN, to their VPN, to Pocket, to Relay, to sponsored links on the home page, to the more recent moves in the field of data collection defaults and ads) has either proved to be too ambitious for their resources, or too underwhelming compared to established alternatives, or too controversial for their privacy-aware user base.

So maybe this could be the right chance to acknowledge that not only public resources needs public protocols, but that non-profit organizations that keep both competition and open protocols alive also need public funding – it's not fair to let a non-profit compete with a business giant at the giant's market rules.

Now that we all agree on who the bad guy has been all this time, this can be the right chance to do things right. Pass the “we know it's not right, but we all benefit from it” phase, get bold and think of new solutions. Because all the building blocks are already out there.

Python, datetime, timezones and messy migrations

Two perplexed clocks that can't seem to agree on what time it is

datetime.datetime.utcnow() has been deprecated since #Python 3.12:

>>> from datetime import datetime as dt
>>> dt.utcnow()
DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).

And I see why. I've always wondered why that method for some reason returned a timezone-naive datetime in the first place.

Like if I'm asking for UTC, can't you just set tzinfo to UTC, before I accidentally do a comparison with a datetime.datetime.now() (which is timezone-naive) in another method and suddenly end up comparing a timestamp in L.A. with one in London? Who thought that it was a good idea not to have any guard rails in the interpreter to prevent me from comparing apples to bananas?

The officially suggested alternative is to go for a datetime.datetime.now(datetime.UTC) instead, so explicitly set the UTC timezone if you need a monotonous datetime object.

It's a sensible implementation that should already have been implemented years ago.

What about Python <= 3.11?

Except that datetime.UTC is a macro introduced only in Python 3.11. On older versions:

Python 3.9.2 (default, Mar 12 2021, 04:06:34)
[GCC 10.2.1 20210110] on linux
>>> import datetime
>>> datetime.UTC
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'datetime' has no attribute 'UTC'

So the officially supported solution actually only works with Python versions released since October 2022. They could have at least suggested datetime.timezone.utc, as that one at least has been around for a while. So, if you happen to maintain code that is still supposed to support Python <= 3.11, you may want to ignore the documentation and use this solution instead:

import datetime as dt

now = dt.datetime.now(dt.timezone.utc)

Not sure why nobody noticed that the solution suggested by the interpreter is incompatible with any version of Python older than two years, when it doesn't even have to be that way.

The storage problem

However, the biggest issue is how badly it breaks back-compatibility with all the code that has been written (and, most importantly, the data that has been stored) before.

Take this code for example:

import datetime as dt

...

def create_token(...):
    token = SessionToken(...)
    token.expires_at = (
      dt.datetime.utcnow() + dt.timedelta(days=365)
    )

    db.save(token)
    return token

def login(token):
  ...

  if token.expires_at < dt.datetime.utcnow():
    raise ExpiredSession(...)

You've been running this code for a while, you've created a bunch of session tokens, and since you used utcnow all those timestamps have been stored as offset-naive UTC.

Now you go and modify all the references of utcnow with now(UTC). What happens?

Breaking what was already broken

Well, if your code used to always compare offset-naive to offset-naive timestamps generated via utcnow, everything should be ok. You were comparing apples to apples before, now you're comparing bananas to bananas.

If instead you were comparing a utcnow() with a now() somewhere, your code will start breaking:

TypeError: can't compare offset-naive and offset-aware datetimes

And this is actually good. You were comparing apples to bananas before, and the interpreter didn't say anything about that. Now it does. That JIRA ticket about that weird issue where session tokens generated by your users in L.A. expired 8 hours later than they were supposed to can be finally closed.

Breaking what wasn't supposed to break

But what happens when you load from the db your token that was saved with the previous, offset-naive utcnow implementation? Well, your code will suddenly break:

TypeError: can't compare offset-naive and offset-aware datetimes

You did the migration right, you were comparing apples to apples before and bananas to bananas now, but your database still has some apples that it wants to compare. Now what? The solution may not be pretty:

# This if needs to be here as long as there are some
# timestamps stored or transmitted in the old format
if not token.expires_at.tzinfo:
  token.expires_at.replace(tzinfo=dt.timezone.utc)

if token.expires_at < dt.now(dt.timezone.utc):
  ...

And that's assuming that I know that all the offset-naive timestamps that were stored on the db were always stored in UTC (and that's often a big if).

But who's going to handle with the mess of mixed-format timestamps on the db, especially if you have many APIs that also return those timestamps? Time to change all of your API response schemas too, I guess.

Not to mention the case where, like on Postgres, you often explicitly create timestamp columns with/without timezones. Migrating to the new logic means having to migrate all of your TIMESTAMP WITHOUT TIMEZONE columns to TIMESTAMP WITH TIMEZONE. For all the tables on your db that have a timestamp column. Otherwise, change all the occurrences of utcnow in your code to something like dt.now(dt.UTC).replace(tzinfo=None).

I'm not sure if everyone within the community is already aware of the consequences of the new implementation, and that migrating to Python >= 3.12 should be considered a breaking change. And you should especially pay a special attention if your code deals with persisted datetime objects.

Sometimes you have to break eggs in order to make an omelette, I guess.

Why stop now()?

Another thing I've never understood is why Python returns datetime objects that are offset-naive by default anyway.

The utcnow() problem has been solved. But what about now()? Why does it still return an offset-naive object if called with no arguments? Why doesn't it bother to fill up the tzinfo object with the one configured on the local machine? If I need a monotonic series, I can just call now(UTC) anyway, or time() which is even more portable. It's ok to break the code that does risky comparisons, but why not prevent those risky comparisons upfront with sensitive defaults that do their best to enforce apple-to-apple comparisons?

I feel like many cumulative years of suffering experienced by countless Python developers dealing with dates and times could have been spared if only the language had implemented timezone-aware datetime defaults from day 1.

It's about time()

Good abstractions may be conceptually elegant, but most abstractions come with a maintenance cost even when they are good. And datetime abstractions in languages like Python or Java are no exception.

The utcnow() issue is only the latest in a long string of problems caused by such abstractions that I've had to deal with. And whenever I encounter one of these issues, I can't help asking myself how simpler things would be if all date and time representations would always be calculated, compared and stored using a simple time().

It can be converted on the fly to a datetime abstraction when you need to display it to the user, or return it on an API response. But your database and your code should probably always only talk in terms of number of seconds passed since Jan 1st 1970. A UNIX epoch is probably all you need, most of the times.

German administrations love open-source, but some initiatives could benefit from a more pragmatic approach.

Cover image

Big kudos to the German state of Schleswig-Holstein!

Another German administration is breaking Microsoft's glass cage, and at a first read the scope of this initiative seems more ambitious than many I've witnessed in the past.

Both the ArsTechnica article and the original announcement don't include a few details to make better estimate on the possible success of this initiative though.

The announcement follows previously established plans to migrate the state government off Microsoft Office in favor of open source LibreOffice.

I hope that there's a Web-based offering somewhere on the horizon. Fewer and fewer employees nowadays run Word/Excel directly on their machines. Most of them run Google Docs or use Microsoft's office cloud. Giving them a stand-alone app which limits the possibilities for online collaboration may be met with resistance, especially now that many of them are already getting used to online AI assistants. I read that #NextCloud is involved – I hope there's a plan to run #CollaboraOffice, which is more or less like running the LibreOffice engine as a service, #OnlyOffice or some equivalent alternative.

Due to the high hardware requirements of Windows 11, we would have a problem with older computers. With Linux we don't have that

Very sensitive decision that will probably save taxpayers a lot of money. But it'd also be interested to know which #Linux distro has been selected. Hopefully the administration won't repeat Munich's past mistakes and it won't try to build and maintain their own distro. Better get into talks with a popular distro (probably not Red Hat, but hey isn't SuSE German?) and orchestrate a deal where the State funds its development, and in exchange it gets development support. It's a win-win where a distro not managed by a giant like Red Hat or Canonical can get consistent direct funding from a public administration (that's what many of us have been advocating for years anyway), and the local administration can enjoy the support of a well-documented distro like OpenSuSE, Mint, Pop_OS or Manjaro without having to reinvent the wheel and scramble for their own developers/packagers/maintainers, and minimizing the risk of going from one vendor lock-in (Microsoft) to another (IBM or Canonical).

The government will ditch Microsoft Sharepoint and Exchange/Outlook in favor of open source offerings Nextcloud and Open-Xchange, and Mozilla Thunderbird

Same issue as with LibreOffice: folks today are used to webmail and mobile apps. Thunderbird definitely fills the gap on the stand-alone side, especially now that it's getting more love and support than before. But it still lacks an official mobile app – K-9 is almost there, but not nearly there yet. And it doesn't solve the “I'm used to the GMail/Outlook interface and set all of my filters and do my advanced search from a webview” problem. There's actually a big gap there. What's a decent open webmail UI that can compete with GMail/Outlook nowadays? RoundCube feels ancient and it has barely changed in 15 years. SnappyMail is a bit better, and it's what a use as a selected webmail client too, but it's still lightyears behind GMail/Outlook. NextCloud Mail is slowly getting there, but it only integrates with a NextCloud solution. Let's admit that there's a gap that needs to be filled fast if we don't want employees who have years of email muscle memory trained in specific environments to doom the project.

Schleswig-Holstein is also developing an open source directory service to replace Microsoft's Active Directory and an open source telephony offering.

Please, don't. Just don't. A local administration, no matter how well-intentioned and initially well-funded, just won't have the resources necessary to invent such big wheels. And, even if it somehow manages to bake something together, it'll eventually be a patchy solution that they'll have to maintain themselves for years to come, and that is unlikely to find adoption outside of its initial borders.

Invest into #OpenLDAP to fill the gaps left by ActiveDirectory on the LDAP side instead. That project needs a lot more love. And leverage WebDAV for almost everything else. If you are already planning to use NextCloud, it'll already do a lot of the heavylifting for you on that side, without having to write new software or come up with new protocols.

Same for telephony. Looking into iPXE and other open implementations of the PXE and SIP protocols. Telephony protocols are hard and well-established, reinventing the wheel should be avoided at all costs.

I think there's a lot of potential in initiatives like these, but only a clear definition of their scope and a clear plan of execution with continuous user feedback can help preventing something like the failure of the early Munich experiments.

RE: https://arstechnica.com/information-technology/2024/04/german-state-gov-ditching-windows-for-linux-30k-workers-migrating/

Many ambitious voice projects have gone bust in the past couple of years, but one seems to be more promising than it was a while ago.

Cover image

I've picked up some development on Picovoice in these days as I'm rewriting some Platypush integrations that haven't been touched in a long time (and Picovoice is among those).

I originally worked with their APIs about 4-5 years ago, when I did some research on STT engines for Platypush.

Back then I kind of overlooked Picovoice. It wasn't very well documented, the APIs were a bit clunky, and their business model was based on a weird “send us an email with your use-case and we'll reach back to you” (definitely not the kind of thing you'd want other users to reuse with their own accounts and keys).

Eventually I did just enough work to get the basics to work, and then both my article 1 and article 2 on voice assistants focused more on other solutions – namely Google Assistant, Alexa, Snowboy, Mozilla DeepSpeech and Mycroft's models.

A couple of years down the line:

  • Snowboy is dead
  • Mycroft is dead
  • Mozilla DeepSpeech isn't officially dead, but it hasn't seen a commit in 3 years
  • Amazon's AVS APIs have become clunky and it's basically impossible to run any logic outside of Amazon's cloud
  • The Google Assistant library has been deprecated without a replacement. It still works on Platypush after I hammered it a lot (especially when it comes to its dependencies from 5-6 years ago), but it only works on x86_64 and Raspberry Pi ¾ (not aarch64).

So I was like “ok, let's give Picovoice another try”. And I must say that I'm impressed by what I've seen. The documentation has improved a lot. The APIs are much more polished. They also have a Web console that you can use to train your hotword models and intents logic – no coding involved, similar to what Snowboy used to have. The business model is still a bit weird, but at least now you can sign up from a Web form (and still explain what you want to use Picovoice products for), and you immediately get an access key to start playing on any platform. And the product isn't fully open-source either (only the API bindings are). But at first glance it seems that most of the processing (if not all, with the exception of authentication) happens on-device – and that's a big selling point.

Most of all, the hotword models are really good. After a bit of plumbing with sounddevice, I've managed to implement a real-time hotword detection on Platypush that works really well.

The accuracy is comparable to that of Google Assistant's, while supporting many more hotwords and being completely offline. Latency is very low, and the CPU usage is minimal even on a Raspberry Pi 4.

I also like the modular architecture of the project. You can use single components (Porcupine for hotword detection, Cheetah for speech detection from stream, Leopard for speech transcription, Rhino for intent parsing...) in order to customize your assistant with the features that you want.

I'm now putting together a new Picovoice integration for Platypush that, rather than having separate integrations for hotword detection and STT, wires everything together, enables intent detection and provides TTS rendering too (it depends on what's the current state of the TTS products on Picovoice).

I'll write a new blog article when ready. In the meantime, you can follow the progress on the Picovoice branch.