Skip to content

Commit

Permalink
Don't make tar package, improve compression rate and add .json suffix
Browse files Browse the repository at this point in the history
* Still using xz format but only compress the json files, don't add them to a tar package.

* Improved compression rate, making the filesize even smaller. The only loss, on my system,
  is the compression speed, instead of about 30 seconds, it now takes roughly 4-5 minutes.
  For now this is not a big problem. Time is not our enemy here, but file size is.

* JSON files now have a .json suffix.

* Also added a little note about zram in ./src/README.md
  • Loading branch information
iwconfig committed Feb 13, 2019
1 parent db1e1da commit f03d3e0
Show file tree
Hide file tree
Showing 4 changed files with 21 additions and 19 deletions.
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@

<p align="center">
<a target="_blank" rel="noopener noreferrer" href="https://www.svtplay.se"><img align="middle" src="https://img.shields.io/date/1548277200.svg?label=initial%20backup&logo=data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAGAAAABgCAMAAADVRocKAAACiFBMVEUAAAAAAAAAAQAAAgAAAwAABQAABgAABwAACAAACQAACwAADAAADwB4eHiWlpb+/v7///8ABwAACQAACwAADAAADQAADgAAEAAAEwCWopYADgAADwAAEQAAFQAAGADm+eYAFQAAGQCWqZYAEwAAFgAAFwAAGwAAHAAAGgAAHwAAIAAAIQDm+eYAGwAAHwAAHgAAIQAAIAAAJAAAJQAAJAAAJQAAJgAAJwB4lXgAKQAALQAAKwAALAAALQB4tHgALwAAMQAAMwAAOgAAOwAANAAARAAANQAAOAAAQAAARQAASwAAPAAAPgAATgAAUQAATAAARQAASAAAVwAAWQAAVwDE8sQAVwAAXADE8sQAYQAAYwAAXgAAZAAAYwAAZAAAZAAAaQAAaQAAawAAbwAAagAAbwAAcQAAbgAAbwAAcQAAdgAAdwAAeAAAdwAAfAAAeAAAfQAAfQAAgAAAggAAggAAgwAAhAAAgwAAhgAAiAAAigAAiwAAjQAAjwAAjQAAkAAAlAAAlAAAlgAAoQAAlQAAlQAAnQAAqAAAmwAAnAAAmgAAmwAAnwAApwAAngAAsAAApAAArAAAoQAAtgAAswAArAAAsgAAsQAAsgAAugAAtwAAsgAAuAAAtAAAuQAAvQAAtgAAuAAAuQAAvAAAtwAAuQAAugAAwAAAuwAAvwAAwAAAuwAAugAAuwAAwAAAwwAAvwAAwAAAwQAAvwAAvQAAvwAAwwAAxgAAwAAAwAAAwwAAxwAAwAAAxAAAxQAAxAAAwQAAxAAAxQAAxwAAxQAAwwAAxwAAxAAAxQAAxwAAxAAAxwAAxAAAxQAAxgAAxwAAxgAAxwAAxgAAxwAAxgAAxwAAxgAAxwCl++qjAAAA1nRSTlMAAgICAgICAgICAgICAgICAgQEBAQEBAQEBAYGBgYGBggICAoKCgoKDAwMDAwODhAQEhISFBQUFBYYGBoaHB4gICAgICIiJiYoKCgqLi4wMjQ2ODg6Ojw8PD4+QEBERkhISkpQUlJSVFRaWmBgYmRmZmxscHJycnZ2fn5+g4OFi4uRkZGTlZWbnZ2hoaGjp6epq62vsbW1t7e5u729wcHDxcXFx83Nzc3Pz9PV19fX19vb293f4eHh4+Xl5+np6evt7e3t7/Hx8/Pz9fX39/f3+fn7+/39HO7J0QAABy1JREFUaN6lWgWDHDUUTtrjCoW9BCt6eIu7Hu4ULc7gbgWKWymuxd2Lu7sWKU6Bg8olf4eJvOTFZmdv3+Vms3O7k+TLe9+THCFauHkh664xRJygbl/CzdNZ/UOOu29nSihRTQvVP6pje9T80dyNbhL3Pcbs95n55erK1UiV/PiUAf3FsBH33IY7pfkrgU4l5OI506j+hv2imWZGSO6OGpSruXIzYw7Y2BVVQkjxynS7+nDd490BwjixKNUDSCGl/OOytQkCoWH9rQbQEKkLU3tQj1CP8eiudg8BC0JCYAx2xH/CbnV9UY8zk4btRVLvgVQoia/OWrb75FouDI+jIdJt7I7NCZ4vIV1bsxrVACktkl7emIE1KdgRYwSgpXDT9DpEQ6QAZxyZg9UiNIBceM1wb1gUd1ljVG8JZx4i3Z7er381JVp9QGo70CLt6/zzJiMLAm1CBkyIUynbOpZ+OBiax4d5iMwKlMzdHsNslgMMBW8prJNGkMNbtR9MA1X5p0N7/3j8sN71lDs1MlokJAAE139mrx+RDym81WvrWHawj4yWFGiRk3kHu+lSx9U9LMSqq8Io0iJoP16yAnocGsNfaWkYrlyB0iRtHJXFRlrG0Myk7j08EtsaRe4g0CJLaxgi0gUiJZ+d1kbnaYmv9QIYAzUV6MmgsYtu2TigHtqSjHgg2uFoeMwL4KSBeu1wPFFQfmBqz0VADT6W8F6tCiYfy6+Xr96eIZiHCKupaBxBPrEHzYQEsd/3WqkhYuChraH5IVxfSIDq63OWBwoiQQdxERguOP12WgSy5K5Nm/1LBiwG67EOJzW0QN49ugBRQ1RhLM0NIIJhpLM6806MXjtMS2INjfduaIE8d0DZtjrg5WG3uQ6I9JVrQxPAEqjFwC24cFVKIjdBuxoa06EjMmV8jbZi7IFtImWC1onU1NgYhMMpREWz+PTkCWUaYpHTZ4BUwZJFjsUX3TQ1QIZGo/AwLLJxkXNlUnh98l5OBHvz0iERF4GhkdjQWDYu6i5/XjqlFSsxc2Gc9DiAlI+MZCzLEwQHLgKcEF1LAEVaz6ZRkvaO1WYpvjl9UjcuYigJ7HkFNTndvmULn69brasFp58lKNfeOgLiALMCE/bapp7PbXwKdB0QdaROgIz/c90bvXpNG4GDFtnkNUfXAtuvcDYgkGVn+k/uFdoAS5wasTmakOOT784fxBCZHWYmJsJ8VEXWi+ASMLbnbXxH3r2tpWvCHVMXPZpoS0hYPjgmxgiA0vpjNr2Sfch/szfMUIVdi3P6eN4iA5hMPCnqvHAgDBBrEetiaCIN+ETsK9RlwUUrxt7eujcO0bUUsi95aBf9XOaLLgz23EUVpUm32vV5e3r6J+N2+sVR/75+ndSloT6vGr7cor13bMbh6PyYO48mgaCtRYkgzjAht3SkLnyuIsTY3O0QFzHQIoaXUNQiEW+AELGOzj93cpaLmPPOtiA1Tnlq37xPBhOzNTsR+QEh2inuwquGSRcuYtkEpOWCXp+RSz8YKjzqpkPHcWA0dutmhKDIjuHMxpUcwdBs4ipjUhahF0Kdz8+YZANIvwepR2NtuEgkXKTqe7sXU1iGnb7+rXpG6PdZq7kyBXg0hlrO6UeRRGwKgUd7eTqqhiCnz72h9cNFS+ZMy8fXDHk0wiB4ZyqNbY6LQlb66KSBIPcvGRqqNFfxdqIczUUxFp+xe3fK5mgE6LoUFyHzigKIwH/9NHPlsOhciouYLdepBnbQmMYaeX6fQjKbc/o2SirERajvhhu9bj1KXUk71SLOCW+rRSLDSO8cNbFNKcRdeOj003A6op47t2qoRsXlHHDOKHQsUDUs5duzB31BiuL6lIaqA5FWT04fjfP4bonm+MqsWUFOmZhLApvJ6LcrprhH06hiW0hAuOEjyNGcyguvQH4v3jxsYlCWDa95LkqdfrYmpeuCN09tW3JkGaevXssJSP2HL09dpvEIpJCIu+jaHNSV/eKDO6YF67CvS/1WTbMFNc2mPg8OPOPPFy/XroDWSY5VwD1bQ8uHhS8eNKGx3oir7zFEkHa684NU/r1hrdJDaVq54y4owgkIc+dogTPQ3Q9PGKB9nKNhzaq82jtZes/W3cGhFBf4g6oyVO70ujIQfX/BICnBkR2Be2Q4SU5Agvmr12f27gUFiriI5T5QRZX9v65cBeisBUTlE3deKOe8fSQ+KyPBwTTN32xlaDaJWXrbJr2e+sFBHYGwDrgI4mu8gi/OHAjR7dawRwv/2YEnWlTvxGMj/hSI0FZiPssTQ2Mo04ey5i+zVhqnRXVINpHlAJHR0lcPzYFAM31C44/x+AjEMZ3LkxffuFGiHqTxDu3GRYEWfXJiSjk0hxbtHULl9O/fIT0g68fQvBrpqOKHmQNhtEMT3NN/P3Fk5arhluHSuOjZ/QO9z/YbIHJcxMJ/oHGyxQb5JQ4NtWhZZP4HniSyAIdZgS8AAAAASUVORK5CYII=&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/singles_and_episodes.tar.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/singles_and_episodes.tar.xz.json&query=prettySize&label=singles_and_episodes.tar.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/title_pages.tar.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/title_pages.tar.xz.json&query=prettySize&label=title_pages.tar.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/singles_and_episodes.json.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/singles_and_episodes.json.xz.json&query=prettySize&label=singles_and_episodes.json.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data/blob/master/title_pages.json.xz"><img align="middle" src="https://img.shields.io/badge/dynamic/json.svg?url=https://img.badgesize.io/iwconfig/svtplay-data/master/title_pages.json.xz.json&query=prettySize&label=title_pages.json.xz%20size&logo=json&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
<a href="https://github.com/iwconfig/svtplay-data"><img align="middle" src="https://img.shields.io/github/repo-size/iwconfig/svtplay-data.svg?logo=github&logoColor=00C700&colorA=0b0c0d&colorB=00C700&style=popout"></a>
</p>

Every 6th hour a list of data of all content from SVTPlay is backed up in [./singles_and_episodes.tar.xz](singles_and_episodes.tar.xz) file. All available title page data is also stored in [./title_pages.tar.xz](title_pages.tar.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.
Every 6th hour a list of data of all content from SVTPlay is backed up in [./singles_and_episodes.json.xz](singles_and_episodes.json.xz) file. All available title page data is also stored in [./title_pages.json.xz](title_pages.json.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.

Extract archives using the tar command:
Extract archives using the xz command:

tar xf singles_and_episodes.tar.xz
tar xf title_pages.tar.xz
xz -dk singles_and_episodes.json.xz
xz -dk title_pages.json.xz

The code used to acquire this data is located in [./src](src).
3 changes: 2 additions & 1 deletion src/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# svtplay-data
Every 6th hour a list of data of all content from SVTPlay is backed up in [../singles_and_episodes.tar.xz](../singles_and_episodes.tar.xz) file. All available title page data is also stored in [../title_pages.tar.xz](../title_pages.tar.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.
Every 6th hour a list of data of all content from SVTPlay is backed up in [../singles_and_episodes.json.xz](../singles_and_episodes.json.xz) file. All available title page data is also stored in [../title_pages.json.xz](../title_pages.json.xz). I made this mostly just for fun but it can be useful for retrieving information that is no longer available on SVTPlay.

### If you for some reason want to use this in your own fork
Add the following rule to crontab with `crontab -e` (not the explanation tree)
Expand All @@ -20,6 +20,7 @@ Next, run this and enter your credentials

and from here on after you login automatically when pushing to repo.

I also use zram in order to optimize LZMA/LZMA2 compression which consumes a lot of memory. Just use [this](https://github.com/novaspirit/rpi_zram) and you're good to go.

## TODO/CONSIDER

Expand Down
12 changes: 6 additions & 6 deletions src/cronjob.sh
Original file line number Diff line number Diff line change
Expand Up @@ -80,19 +80,19 @@ git fetch --all || error "Could not fetch the latest from remote repository!"
git reset --hard origin/master

echo "Decompressing data files..."
for file in *.tar.xz; do
tar xf "$file" || error "Decompression failed!"
for file in *.json.xz; do
xz -dfk "$file" || error "Decompression failed!"
done

echo "Running gather_data.py..."
if nice -12 ./src/gather_data.py; then
echo "Data gathering went fine."

function get_size { stat -c %s $1 || error "Could not get size of $1!"; }
for file in {singles_and_episodes,title_pages}; do
for file in {singles_and_episodes.json,title_pages.json}; do
if [ $(get_size $file) -gt $(get_size ${file}.bak) ]; then
echo "Compressing $file"
tar cJf ${file}.tar.xz $file || error "Comression failed!"
xz -vvkf9eT2 $file || error "Comression failed!"
else
echo "$file is unchanged. Leaving ${file}.tar.xz as is..."
fi
Expand All @@ -101,7 +101,7 @@ if nice -12 ./src/gather_data.py; then
done

echo "Making commit..."
git add singles_and_episodes.tar.xz title_pages.tar.xz
git add singles_and_episodes.json.xz title_pages.json.xz
git commit -m "Daily data update: $(date '+%Y-%m-%d %H:%M:%S')" -m "These archives contain all data collected since 2019-01-23 at circa 21:00 hours."

echo "Pushing changes to remote repo"
Expand All @@ -117,7 +117,7 @@ if nice -12 ./src/gather_data.py; then
git reset --hard origin/master

echo "Removing old compressed data files from earlier commits with BFG tool..."
nice -12 java -jar /tmp/bfg.jar -D '*.tar.xz' --private . || error "Java execution failed!"
nice -12 java -jar /tmp/bfg.jar -D '*.json.xz' --private . || error "Java execution failed!"

echo "Cleaning reflogs and collecting repo garbage"
git reflog expire --expire=now --all || error "git reflog command failed! Could not cleanup reflogs."
Expand Down
13 changes: 7 additions & 6 deletions src/gather_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,20 +56,21 @@ def main():
pass

data = [
(singles_and_episodes, Path('singles_and_episodes')),
(title_pages, Path('title_pages'))
(singles_and_episodes, Path('singles_and_episodes.json')),
(title_pages, Path('title_pages.json'))
]

logging.info('Data retrieval is complete')

for data, datafile in data:
if datafile.is_file():
datafile.rename(datafile.with_suffix('.bak'))
bakfile = datafile.with_suffix('.json.bak')
datafile.rename(bakfile)

logging.info('Processing file: {}'.format(datafile.with_suffix('.bak')))
logging.info('Processing file: {}'.format(bakfile))

with datafile.with_suffix('.bak').open(encoding='utf8') as bakfile:
bakdata = [json_cleanup(x) for x in json.load(bakfile)]
with bakfile.open(encoding='utf8') as f:
bakdata = [json_cleanup(x) for x in json.load(f)]

data.extend(bakdata)
data = list({v.get('id') or v.get('articleId'):v for v in data}.values())
Expand Down

0 comments on commit f03d3e0

Please sign in to comment.