Skip to content

Commit

Permalink
Don't make tar package, improve compression rate and add .json suffix
Browse files Browse the repository at this point in the history
* Still using xz format but only compress the json files, don't add them to a tar package.

* Improved compression rate, making the filesize even smaller. The only loss, on my system,
  is the compression speed, instead of about 30 seconds, it now takes roughly 4-5 minutes.
  For now this is not a big problem. Time is not our enemy here, but file size is.

* JSON files now have a .json suffix.

* Also added a little note about zram in ./src/README.md
  • Loading branch information
iwconfig committed Feb 13, 2019
1 parent db1e1da commit f03d3e0
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 19 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@

<p align="center">
<a target="_blank" rel="noopener noreferrer" href="https://www.svtplay.se"><img align="middle" src="https://img.shields.io/date/1548277200.svg?label=initial%20backup&logo=&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/singles_and_episodes.tar.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/singles_and_episodes.tar.xz.json&query=prettySize&label=singles_and_episodes.tar.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/title_pages.tar.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/title_pages.tar.xz.json&query=prettySize&label=title_pages.tar.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/singles_and_episodes.json.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/singles_and_episodes.json.xz.json&query=prettySize&label=singles_and_episodes.json.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/title_pages.json.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/title_pages.json.xz.json&query=prettySize&label=title_pages.json.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data"><img align="middle" src="https://img.shields.io/github/repo-size/iwconfig/svtplay-data.svg?logo=github&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
</p>

Every 6th hour a list of data of all content from SVTPlay is backed up in [./singles_and_episodes.tar.xz](singles_and_episodes.tar.xz) file. All available title page data is also stored in [./title_pages.tar.xz](title_pages.tar.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.
Every 6th hour a list of data of all content from SVTPlay is backed up in [./singles_and_episodes.json.xz](singles_and_episodes.json.xz) file. All available title page data is also stored in [./title_pages.json.xz](title_pages.json.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.

Extract archives using the tar command:
Extract archives using the xz command:

tar xf singles_and_episodes.tar.xz
tar xf title_pages.tar.xz
xz -dk singles_and_episodes.json.xz
xz -dk title_pages.json.xz

The code used to acquire this data is located in [./src](src).
3 changes: 2 additions & 1 deletion src/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# svtplay-data
Every 6th hour a list of data of all content from SVTPlay is backed up in [../singles_and_episodes.tar.xz](../singles_and_episodes.tar.xz) file. All available title page data is also stored in [../title_pages.tar.xz](../title_pages.tar.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.
Every 6th hour a list of data of all content from SVTPlay is backed up in [../singles_and_episodes.json.xz](../singles_and_episodes.json.xz) file. All available title page data is also stored in [../title_pages.json.xz](../title_pages.json.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.

### If you for some reason want to use this in your own fork
Add the following rule to crontab with `crontab -e` (not the explanation tree)
Expand All @@ -20,6 +20,7 @@ Next, run this and enter your credentials

and from here on after you login automatically when pushing to repo.

I also use zram in order to optimize LZMA/LZMA2 compression which consumes a lot of memory. Just use [this](https://github.com/novaspirit/rpi_zram) and you're good to go.

## TODO/CONSIDER

Expand Down
12 changes: 6 additions & 6 deletions src/cronjob.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,19 +80,19 @@ git fetch --all || error "Could not fetch the latest from remote repository!"
git reset --hard origin/master

echo "Decompressing data files..."
for file in *.tar.xz; do
tar xf "$file" || error "Decompression failed!"
for file in *.json.xz; do
xz -dfk "$file" || error "Decompression failed!"
done

echo "Running gather_data.py..."
if nice -12 ./src/gather_data.py; then
echo "Data gathering went fine."

function get_size { stat -c %s $1 || error "Could not get size of $1!"; }
for file in {singles_and_episodes,title_pages}; do
for file in {singles_and_episodes.json,title_pages.json}; do
if [ $(get_size $file) -gt $(get_size ${file}.bak) ]; then
echo "Compressing $file"
tar cJf ${file}.tar.xz $file || error "Comression failed!"
xz -vvkf9eT2 $file || error "Comression failed!"
else
echo "$file is unchanged. Leaving ${file}.tar.xz as is..."
fi
Expand All @@ -101,7 +101,7 @@ if nice -12 ./src/gather_data.py; then
done

echo "Making commit..."
git add singles_and_episodes.tar.xz title_pages.tar.xz
git add singles_and_episodes.json.xz title_pages.json.xz
git commit -m "Daily data update: $(date '+%Y-%m-%d %H:%M:%S')" -m "These archives contain all data collected since 2019-01-23 at circa 21:00 hours."

echo "Pushing changes to remote repo"
Expand All @@ -117,7 +117,7 @@ if nice -12 ./src/gather_data.py; then
git reset --hard origin/master

echo "Removing old compressed data files from earlier commits with BFG tool..."
nice -12 java -jar /tmp/bfg.jar -D '*.tar.xz' --private . || error "Java execution failed!"
nice -12 java -jar /tmp/bfg.jar -D '*.json.xz' --private . || error "Java execution failed!"

echo "Cleaning reflogs and collecting repo garbage"
git reflog expire --expire=now --all || error "git reflog command failed! Could not cleanup reflogs."
Expand Down
13 changes: 7 additions & 6 deletions src/gather_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,20 +56,21 @@ def main():
pass

data = [
(singles_and_episodes, Path('singles_and_episodes')),
(title_pages, Path('title_pages'))
(singles_and_episodes, Path('singles_and_episodes.json')),
(title_pages, Path('title_pages.json'))
]

logging.info('Data retrieval is complete')

for data, datafile in data:
if datafile.is_file():
datafile.rename(datafile.with_suffix('.bak'))
bakfile = datafile.with_suffix('.json.bak')
datafile.rename(bakfile)

logging.info('Processing file: {}'.format(datafile.with_suffix('.bak')))
logging.info('Processing file: {}'.format(bakfile))

with datafile.with_suffix('.bak').open(encoding='utf8') as bakfile:
bakdata = [json_cleanup(x) for x in json.load(bakfile)]
with bakfile.open(encoding='utf8') as f:
bakdata = [json_cleanup(x) for x in json.load(f)]

data.extend(bakdata)
data = list({v.get('id') or v.get('articleId'):v for v in data}.values())
Expand Down

0 comments on commit f03d3e0

Please sign in to comment.