Similar to (and perhaps more simply than) analyzing Youtube video transcripts with language models, I wanted to apply a similar approach to webpages like articles, primarily for the purposes of determining the subject content of lengthy pieces and experimenting to see if this is useful at all.
The html2text
script is good at
extracting content from html.
When combined with a few other CLIs, we can prompt
the language model to create a summary for the cleaned HTML page.
This was my first attempt:
curl -s "<url>" | html2text | llm "summarize this article"
which gave me the following error
Traceback (most recent call last):
File "/opt/homebrew/bin/llm", line 8, in <module>
sys.exit(cli())
^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/llm/cli.py", line 268, in prompt
prompt = read_prompt()
^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/llm/0.13.1_2/libexec/lib/python3.12/
site-packages/llm/cli.py", line 156, in read_prompt
stdin_prompt = sys.stdin.read()
^^^^^^^^^^^^^^^^
File "<frozen codecs>", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 640:
invalid start byte
I solved that with this modification
curl -s "<url>" | html2text | iconv -f ISO-8859-1 -t UTF-8 | llm
"summarize this article"