I’ve been using Pocket since 2009, back when it was still called “Read it Later”. I use it a whole freakin’ lot. Yes, I usually self-host my things, yes I know about wallabag and a couple other similar things. Pocket never gave me any trouble, for a while I even paid for it. It worked and got out of my way. It was, essentially, exactly what I needed.
Pocket is being shut down in a couple of weeks.
And I have 32.000 entries in there that I would like to keep.
Step 1: Set up Wallabag
This is exceedingly easy. I’m using docker for a lot of my self-hosted stuff, so I can simply go to their suggested compose file and I’m already up and running. I only did minimal changes to it (integrating it to my reverse proxy, mostly). This took five minutes.
Step 2: Import from Pocket
To import from pocket, you create an app-key in pocket, which identifies your new wallabag setup to pocket. The process is well described in Wallabag’s import documentation for Pocket.
One thing that I found reassuring: you can run this importer as many times as you like. It will just skip over existing entries. So if you only have a couple hundred entries: run it a handful of times, and you’ll be good to go.
Step 3: Memory Limit
This initial setup imports roughly 300 entries at a time before the process chokes up. By default, Wallabag’s PHP is configured to consume a maximum of 128MB. This is not terrible in normal operation. But importing 32.000 entries probably does not count as “normal operation”.
Find your docker-compose.yml
and find the environment
line, which is followed by a lot of bullet points. Here we insert PHP_MEMORY_LIMIT
and configure it ridiculously high. I set mine to 2GB which is probably overkill.
services:
wallabag:
image: wallabag/wallabag
restart: unless-stopped
environment:
- MYSQL_ROOT_PASSWORD=wallaroot
- PHP_MEMORY_LIMIT=2048M
- SYMFONY__ENV__DATABASE_DRIVER=pdo_mysql
- SYMFONY__ENV__DATABASE_HOST=db
# ... more lines here
Step 4: Redis
With the increased memory limit, you get a couple of hundred entries added on each run before it timeouts after roughly three minutes. The number of entries is also dependent on how fast the entries can be queried for information, so you’re entirely at the mercy of dozens or hundreds of third party webservers.
The timeout occurs because all of this work is being done within the same PHP process that tries to frantically deliver a page to your browser within three minutes (or thereabouts). Luckily, the asynchronous import documentation has notes on how to *not* do this part within the same PHP process.
Enter: Redis
In my docker-compose.yml
, redis is configured already, so I skipped over the RabbitMQ part and went straight to the redis headline. First of all, you need to tell your site-wide configuration to *use* redis for imports. This is done in Internal Settings -> Import. Set Redis to `1`.
Part two, though, is that you’ll need an actual worker now that this is no longer done by the PHP-process. I chose to start three or four of these in tmux terminals, because importing is a one-time thing for me and I don’t need to create services or anything. Because we’re within docker compose, I start them like this:
docker compose exec wallabag php bin/console wallabag:import:redis-worker --env=prod pocket -vv
Step 5: Nginx tuning
Redis was a VAST improvement, easily exporting over a thousand entries per run. But I still got timeouts. I guess this is the importer shoveling stuff into redis that’s now running for too long. I am not the only one with large imports, I found this pull request that offers a config for nginx.
The key here is to edit the nginx.conf
to let the PHP process run longer. I did not go with the big change from the pull-request. I entered a shell and modified the nginx.conf directly. This will vanish after a wallabag update. For me, that’s perfect because, again, I don’t need this long-term and I’d rather be on a mostly “normal” setup later on.
docker compose exec wallabag sh
will get you a shell into this machine. I proceed to vi /etc/nginx/nginx.conf
– I find the line with fastcgi_read_timeout
and add some ridiculously large number behind it. It defaults to 300s, I chose 3600s instead.
Nginx does – as far as I know – not auto-reload when the configuration changes. In my setup the easiest thing to do is to kill the nginx process and the environment will automatically recreate it – and that will load the changed configuration file. Within a docker container, only very few processes are visible, so running ps
will give you maybe 10 processes. Use kill
to kill the “oldest” nginx process.
Step 6: Changing the $offset
This got me to roughly 8.500 entries imported – one fourth of what I’m trying to import. So at least I’m at the right order of magnitude. Something else timeouts at this point.
I decided to change the default value for $offset
in the import function here. It appears to not break anything, but the current request is still running. I will report back with more info if this was successful or not.
Update: it was *not* successful. Changing the offset in that particular place does not help, because of the way it recursively calls itself. Instead of adding stuff onto the $offset
, it is derived by multiplying $run
with the number of expected elements. I made a change to that line as well, replacing it with return $this->import($offset + self::NB_ELEMENTS);
This did not do much when starting at 15.000 and I sort of assume that this might be a hard limit at the API.
Step 7: Reverse Sorting!
Well I am almost halfway through, so… maybe I can get another 15.000 if I just sort by oldest first instead? I made this change in the importer by looking at the allowed keywords in the API docs.
Supervising the redis workers for a while, I can clearly see a lot of entries from ~2009 streaming in, so I’ll call that a win. Let’s see where we end up.
After probably three hours, I now have 31.672 entries imported. Over 15.000 entries in a single import run. That’s pretty good. There are roughly 900 entries missing compared to a very naive counting from the backup-csv files. This would come down to roughly 2.8% missing entries. This might come down to link rot, but for that the figure is actually far too low (remember, the earliest links are over 16 years old)!
I’ll run the importer once more and then call it “good enough”.
Thank you, Pocket
Thank you, Pocket, for 16 years of service. I enjoyed using this.