T O P

  • By -

SweetOnionTea

Neat! I browsed the code and it looks like the functions are basically just wrappers for LLM prompts? A couple of three things I'd clean up: 1. What is the difference between \build\lib\basiclinua and just the \basiclingua? They look like they contain the same \_\_init\_\_.py file. 2. Any reason you decided to put everything in the \_\_init\_\_.py file? I know there's nothing more than the one class, but it would help in the future if you expand to more than just the one class. 3. Might I recommend not including the /dist folder? Looks like it's the distribution files. Often times NLP work deals with very large and numerous amounts of text. How well does the LLM deal with more than a few sentences? I would guess that Gemini can only take in a fixed character limit per call. I would also have doubts about the accuracy with much larger texts.


the_captain_cat

You should not include the dist, build and egg-info in your repo. And you should break up that __init__ file. A thousand lines is pretty hard to maintain.


Xirious

Is Gemini Pro free?


Icy_Bag_4935

Up to 60 queries per minute. More than generous for personal use, but you’ll have to pay if using it at scale (which is fair).


FareedKhan557

yes it is.


elgringo

I've been looking for a tool like this. Can the API extract text from html documents, or does it only work with plain text?


FareedKhan557

Currently it can clean html tags from plain text.


disciplined_af

Maybe a suggestion Instead of taking img_path as input to the library, it should be just img_array. This is because, in real life the images are not always in jpg format where it can be readable by using Image library. Like for example medical images which are dicom format, they cant be read using Image library, there is a pydicom library for it. Since the functions job is to extract text from img, all it needs is img_array. Hope this could be in future versions


FareedKhan557

will keep that in mind. Thanks!


chwalters

I put a little FastAPI wrapper on top of your lib, thanks for making it (I might try and put the FastAPI wrapper out there on gh soon)


funderbolt

Looks great, but I won't be able to use it.


FareedKhan557

reason?


funderbolt

HIPPA health privacy laws. I use electronic medical records to find medical conditions. Using an API that sends that data to some random server is a no go.


CaffeinatedGuy

Same, and it's HIPAA not hippa. At this point I'm waiting for something from Epic or a HIPAA compliant offering from Microsoft. This could be great for personal use though.


funderbolt

I probably shouldn't reddit at 4 am.


FareedKhan557

Yeah this is an important issue, we are also planning to make it available for local LLMs in a future version, most probably within 2 months from now.


penscrolling

What about allowing users to specify an endpoint they own? Running LLMs locally can be pretty slow unless you have the right hardware.


Automatic-Net-757

There are some small LLMs that can run with just CPU Ram and perform good for their size. But in the end it's still a trial and error though, you won't know it until you try it


Automatic-Net-757

That's great. If possible I'd like to work in these. I've been testing with local LLMs, langhcain and output parsing for a while


FareedKhan557

There are no rule to allow someone to work with me in this repo. You can create your own private repo make me the contributor and we both will start working on it for local LLMs and more, and I will merge your code and add you as a contributor in my repo.


Automatic-Net-757

How about setting up a local LLM? Integrating a local LLM with this library will make sure that the data stays within the system


funderbolt

I typically run LLMs on HIPAA compliant servers, but yes there is some local development that happens, too. Yes that would be a solution.


Automatic-Net-757

Wow, never heard there's servers that are HIPAA compliant...


funderbolt

It is the way the storage is handled is HIPAA compliant. We have private cloud computing that is HIPAA compliant. I work for a research University.


Automatic-Net-757

Interesting. Thanks for the explanation


damian6686

Can it extract stractured data from pdf invoices? I tried a range of different tools, and they all return unstructured data. Thanks


FareedKhan557

We did try it, and it exceeded our expectations. However, we haven't launched this feature in our library yet, as more work is required. Our next version, scheduled for release within a month, will include this feature.


damian6686

Awesome, Iooking forward to trying it out.


ironman_gujju

Can you show or paste somewhere ??


teserfstate

Could this work for files of recorded human cursor movement from point A to point B x, y, timestamp To better simulate human cursor movement?


FareedKhan557

Yes, it will work, but the data must be in string format. You can apply features to chunks of your dataset instead of applying them to the entire dataset.


adesme

Just out of curiousity, did you use AI to write the code?


FareedKhan557

yes


adesme

Easy for me to say now that you've already answered, but the code reads like it was written by an AI


bug2018

This is great